Help to speed up bulk update

chamara.mundigala · July 8, 2021, 5:33am

we use Dgraph version 21.03 with the below schema

type Teacher {
    teacherId: ID!
    yearLevel: [YearLevel]
    learningAreas: [LearningArea]
    similarTeacher: [SimilarTeacher]
   
}

type LearningArea {
    areaId: ID!
    areaName: String! @search
}

type YearLevel {
    levelId: ID!
    level: String! @search

type SimilarTeacher {
    similarTeacherId : ID!
    internalId: String! @search
    teacher: Teacher!
    score: Float! @search
}

there are about 130,000 teachers loaded to DGraph

and we have a simple comparison that will be performed
on every teacher against all other teachers and
added to the similar teachers list if it matches

following is an example of our update query

upsert {   query {     
        qsource(func: eq(Teacher.internalId, "53513")) {
            source as uid
        }
        qt1(func: eq(Teacher.internalId, "76637")) {
            t1 as uid
        }
        qsim1(func: eq(SimilarTeacher.internalId, "5351376637")) {
            sim1 as uid
        }
        qrev_sim1(func: eq(SimilarTeacher.internalId, "7663753513")) {
            rev_sim1 as uid
        }
        qt2(func: eq(Teacher.internalId, "56968")) {
            t2 as uid
        }
        qsim2(func: eq(SimilarTeacher.internalId, "5351356968")) {
            sim2 as uid
        }
        qrev_sim2(func: eq(SimilarTeacher.internalId, "5696853513")) {
            rev_sim2 as uid
        }   } mutation {   set {     
            uid(sim1) <SimilarTeacher.internalId> "5351376637" .
            uid(sim1) <SimilarTeacher.teacher> uid(t1) .
            uid(sim1) <SimilarTeacher.score> "0.5270462766947299" .
            uid(sim1) <dgraph.type> "SimilarTeacher" .
            uid(source) <Teacher.similarTeacher> uid(sim1) .
            uid(rev_sim1) <SimilarTeacher.internalId> "7663753513" .
            uid(rev_sim1) <SimilarTeacher.teacher> uid(source) .
            uid(rev_sim1) <SimilarTeacher.score> "0.5270462766947299" .
            uid(rev_sim1) <dgraph.type> "SimilarTeacher" .
            uid(t1) <Teacher.similarTeacher> uid(rev_sim1) .
            uid(sim2) <SimilarTeacher.internalId> "5351356968" .
            uid(sim2) <SimilarTeacher.teacher> uid(t2) .
            uid(sim2) <SimilarTeacher.score> "0.5163977794943223" .
            uid(sim2) <dgraph.type> "SimilarTeacher" .
            uid(source) <Teacher.similarTeacher> uid(sim2) .
            uid(rev_sim2) <SimilarTeacher.internalId> "5696853513" .
            uid(rev_sim2) <SimilarTeacher.teacher> uid(source) .
            uid(rev_sim2) <SimilarTeacher.score> "0.5163977794943223" .
            uid(rev_sim2) <dgraph.type> "SimilarTeacher" .
            uid(t2) <Teacher.similarTeacher> uid(rev_sim2) .   } } }

when running the update usually one teacher will be updated with 500 similar teachers per one call.

one update takes about 250 ms

we have 130,000* 100 update statements and at the moment ETA to complete all of the updates is very large

so the ETA to update all the records is
8,449,935,000 / 500 (rows per batch) * 250 (ms per batch) / 1000 (to get seconds) /60 (get mins.) / 60 (get hrs.) / 24 (get days) = 48.9 days

is there any suggestions for us to speed up this process?
really appreciate your help

MichelDiz · July 8, 2021, 3:43pm

Can you share details about your cluster? How many Alphas, zeros. How much resources they have for each(RAM, CPU and so on). Are you using NVMe? Are you balancing the requests? isn’t recommended to concentrate the queries in a single Alpha.

chamara.mundigala · July 9, 2021, 3:17am

Hey Michel
thanks for reply
below is our details

We have one server running zero and alpha on the same machine . It has 8 Core and 32GB ram
server is hosted in a virtual environment in SAS Flash drive .

MichelDiz · July 9, 2021, 1:28pm

These resources aren’t good enough. A single machine with 32GB of RAM, in a VM using SAS. It won’t bring any shine to your env in the context you want. You should use the Dgraph’s principle, “Distributed Graph”. Scale it horizontally as much you can. Use NVMe and give native or close to native access to IO. e.g: Use KVM instead of a regular VM.

Add 6 machines with 32GB each for the Alphas and 3 machines with 16GB for the zeros, a separated SSD for each of them. A good network(LAN or some cloud provider). Always try to balance the load between the Alphas. With that you can get better numbers.

Topic		Replies	Views
Release notes v0.8.2 Users	1	562	November 28, 2017
Understanding bulk data loads, and bulk updates, with XID in v0.8 Users	2	808	November 1, 2017
Bulk load data from v1.2.x to 20.07.x not loading all data Dgraph kind:question	3	363	November 22, 2020
My present version is 1.0.15 need to upgarde it to latest, can u help me out Users kind:question	51	1751	January 27, 2021
Batch insertion in dgraph Dgraph mutation	3	1183	November 19, 2019

Help to speed up bulk update

Related Topics