How to solve mutation conflict

Hello, developers.
i have a problem about batch mutation.

What I want to do

we have many data need to insert to dgraph, and created several indexed schemas for our data. for the insert speed, so we run multi script to batch insert into dgraph cluster, but each time new data insert to cluster the index will be refreshed, so we always got Exception: Transaction has been aborted. Please retry. that means we can only insert one data at the same time.
I was so confused about how to solve that. for the huge data, we can’t insert one by one !

Dgraph metadata

dgraph version

Dgraph version : v21.03.0
Dgraph codename : rocket-mod
Dgraph SHA-256 : 4ca26023e812146d88fc3f5b364589a4de2776fa3dce849d2eff103f3fa9ae60
Commit SHA-1 : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch : release/v21.03
Go version : go1.15.9
jemalloc enabled : true

Can you elaborate how you’re doing the batch mutations? Do you commit after each txn is done?

yes, i commit after each txn done.
i use pydgraph to do that, and pseudo-code looks like:

for data in datas:
    new_person = data.get('name')
    query = '''{person as var(func: eq(name, "%s"))}''' % new_person
    nquad = f'''uid(person) <name> "{new_person}" .'''
    txn = client.txn()
    mutation = txn.create_mutation(set_nquads=nquad)
    req = txn.create_request(query=query, mutations=[mutation], commit_now=True)
    print(txn.do_request(req))

Ah… pydgraph is yet to be updated for 21.03.

Hi @Ro0tk1t ,
I’ll release a new version asap. You can update Pydgraph and try again.

Thanks!

@Anurag @chewxy
emm… i see the latest pydgraph still v20.07 at pypi.org. do my usage of pydgraph correct ?
will it fixed when i update pydgraph ?

Yes, your usage of Pydgraph is correct for Dgraph v20.07 or Dgraph v20.11, if you notice that the dgraph version has been updated to v21.3 which introduced few breaking changes for the clients. We are updating clients to support the newer version. Once I release the new version, this problem should go away. As a quick check, you can test if your code works with Dgraph v20.7 right now.

run script alone works fine, but as i said we always got Exception: Transaction has been aborted. Please retry when run script multiple

Does this happen with a different Dgraph version as well eg Dgraph v20.11?

Yes,

@Anurag
after i install newest pydgraph form github repo, there is nothing changed.
Transaction has been aborted. Please retry

i think that not cuz of the client, but the dgraph server side limitation.
how can i speed up batch mutation? :sob:

Hey @Ro0tk1t , there might be conflicts in the transactions due to which the transactions get aborted.
If you want to load the data in bulk, have you considered live loader or bulk loader for that? They smartly manage this conflict resolution.

fine, use bulk. there is also problem, and no more error detail


i give up

Hey @Ro0tk1t, sorry for the inconvenience. There was a minor bug that caused the bulk loader to crash without printing the error message.
Basically, you are trying to insert data whose type is defined as scalar(string, int, float, etc) in the schema while in the data file it is of type uid.
Example:
Schema: name: string .
Data: _:a <name> _:b . or _:a <name> uid(0x10) .
This PR fix(bulk): throw the error instead of crashing by NamanJain8 · Pull Request #7722 · dgraph-io/dgraph · GitHub should fix this issue. But still, you would need to correct your data. You will see this crash error message with my change:

2021/04/14 12:43:01 RDF doesn't match schema: Input for predicate "name" of type scalar is uid. Edge: entity:4600001 attr:"\000\000\000\000\000\000\000\000name" value_type:UID value_id:4700001 

Do let me know in case of any queries or help needed with data loading.
Thanks for reporting the bug. :slight_smile:

@Naman
thanks. one more problem… with bulk loaded to dgraph, all edge was missed, i think my schema and rdf is right.
the test schema file:

field1: string .
type A {
    field1
}
A: [uid] @reverse . 

field2: string .
type B {
    field2
}

the test rdf file:

<_:A_1> <dgraph.type> "A" .       
<_:A_1> <field1> "value1" .       
<_:A_2> <dgraph.type> "A" .       
<_:A_2> <field1> "value2" .       
                                  
<_:B_1> <dgraph.type> "B" .       
<_:B_1> <field2> "aaaaaaaaaaaa" . 
<_:B_1> <A> <_:A_1> .               

and the bulk command is:

dgraph bulk -s s.schema -f test.rdf --zero localhost:5080

after dgraph cluster up, i run a query in ratel:

{
  a(func: has(field2)){
    expand(_all_)
    {
      ~A{
        expand(_all_)
      }
      A{
        expand(_all_)
      }
    }
  }
}

but i can only see one data node B, no edge ~A and data node A.

{
  "data": {
    "a": [
      {
        "field2": "aaaaaaaaaaaa"
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 215458,
      "processing_ns": 640911,
      "encoding_ns": 25487,
      "assign_timestamp_ns": 1072510,
      "total_ns": 2323218
    },
    "txn": {
      "start_ts": 233
    },
    "metrics": {
      "num_uids": {
        "_total": 1,
        "field2": 1,
        "~A": 0
      }
    }
  }
}

TLDR: edge A should not be a child of the first expand in your query.

In your example set, has(field2) results in a type B which expand returns field2 for… But expand has no edges in type B so the rest of your query is not doing anything.

You have a forward edge from (:B)-[:A]->(:A) so I think you want this:

{
  a(func: has(field2)){ #equivilent to type(B) here
      expand(_all_) #gives you field2
      A{
        expand(_all_) #gives you field1
      }
  }
}

If edge predicate A was in the type definition for type B it would expand it for you as well.

Few case in points →

Each field must be marked with a type, this can either be simple ones (string, int, etc.) or a uid or an array of the simple types.

I’d suggest the following schema for you

type A {
    field1
}
field1: string .

type B {
    field2
    hasConnectionTo
}
field2: string .
hasConnectionTo: [uid] @reverse . 

and accordingly the test rdf would be:

<_:A_1> <dgraph.type> "A" .       
<_:A_1> <field1> "value1" .       
<_:A_2> <dgraph.type> "A" .       
<_:A_2> <field1> "value2" .       
                                  
<_:B_1> <dgraph.type> "B" .       
<_:B_1> <field2> "aaaaaaaaaaaa" . 
<_:B_1> <hasConnectionTo> <_:A_1> .               

now you can run something as follows:

{
    getField2(func: has(field2)){
        hasConnectionTo{
            expand(_all_)
        }

        ~hasConnectionTo{
            expand(_all_)
        }
    }
}

While this makes sense for now, you might want to separate out into types based selection and filter predicates in the future

1 Like

thanks !
for now the test files above works fine.
for our production files, many data node and edge was lost after bulk loaded. and the bulk loader usually failed at REDUCE stage cuz of OOM (only succed 2 times), so we are trying 2 use -j 1 option 2 test again, It’s a little slow.
is there some way to start a bulk load from the maped file in tmp directory ?

and i’m a little confused if the rdf file looks like:

#<_:A_1> <dgraph.type> "A" .       
#<_:A_1> <field1> "value1" .       
<_:A_2> <dgraph.type> "A" .       
<_:A_2> <field1> "value2" .       
                                  
<_:B_1> <dgraph.type> "B" .       
<_:B_1> <field2> "aaaaaaaaaaaa" . 
<_:B_1> < hasConnectionTo> <_:A_1> .        

will the node A_1 exists after bulk loaded ?

Hey @Ro0tk1t , can you elaborate on this, please? It would be helpful to see a sample data for kind of data that was not loaded correctly if any.

Do you have any memory profiles? I assume you are on v21.03. Correct me if I am wrong. Also, a couple of questions.

  • What are the specifications of the machine you are running a bulk loader on.
  • What is the data size.

Yes, you can use --skip_map_phase flag to skip the map phase and --tmp to provide the tmp directory generated earlier.

No, A_1 will not be loaded.

There is a piece of log from /var/log/messages about OOM:

Apr 15 10:46:50 localhost kernel: ls invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Apr 15 10:46:50 localhost kernel: ls cpuset=/ mems_allowed=0
Apr 15 10:46:50 localhost kernel: CPU: 1 PID: 28654 Comm: ls Kdump: loaded Not tainted 3.10.0-1160.21.1.el7.x86_64 #1
Apr 15 10:46:50 localhost kernel: Hardware name: Bochs Bochs, BIOS rel-1.7.5.1-20190822_073655 04/01/2014
Apr 15 10:46:50 localhost kernel: Call Trace:
Apr 15 10:46:50 localhost kernel: [<ffffffff90d8305a>] dump_stack+0x19/0x1b
Apr 15 10:46:50 localhost kernel: [<ffffffff90d7d97a>] dump_header+0x90/0x229
Apr 15 10:46:50 localhost kernel: [<ffffffff9090eb3b>] ? cred_has_capability+0x6b/0x120
Apr 15 10:46:50 localhost kernel: [<ffffffff907c221d>] oom_kill_process+0x2cd/0x490
Apr 15 10:46:50 localhost kernel: [<ffffffff9090ec1e>] ? selinux_capable+0x2e/0x40
Apr 15 10:46:50 localhost kernel: [<ffffffff907c290a>] out_of_memory+0x31a/0x500
Apr 15 10:46:50 localhost kernel: [<ffffffff90d7e497>] __alloc_pages_slowpath+0x5db/0x729
Apr 15 10:46:50 localhost kernel: [<ffffffff907c8e86>] __alloc_pages_nodemask+0x436/0x450
Apr 15 10:46:50 localhost kernel: [<ffffffff90818b58>] alloc_pages_current+0x98/0x110
Apr 15 10:46:50 localhost kernel: [<ffffffff907bdcd7>] __page_cache_alloc+0x97/0xb0
Apr 15 10:46:50 localhost kernel: [<ffffffff907c0c70>] filemap_fault+0x270/0x420
Apr 15 10:46:50 localhost kernel: [<ffffffffc029791e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
Apr 15 10:46:50 localhost kernel: [<ffffffffc0297b1c>] xfs_filemap_fault+0x2c/0x30 [xfs]
Apr 15 10:46:50 localhost kernel: [<ffffffff907edf5a>] __do_fault.isra.61+0x8a/0x100
Apr 15 10:46:50 localhost kernel: [<ffffffff907ee50c>] do_read_fault.isra.63+0x4c/0x1b0
Apr 15 10:46:50 localhost kernel: [<ffffffff907f5d50>] handle_mm_fault+0xa20/0xfb0
Apr 15 10:46:50 localhost kernel: [<ffffffff90d90653>] __do_page_fault+0x213/0x500
Apr 15 10:46:50 localhost kernel: [<ffffffff90d90a26>] trace_do_page_fault+0x56/0x150
Apr 15 10:46:50 localhost kernel: [<ffffffff90d8ffa2>] do_async_page_fault+0x22/0xf0
Apr 15 10:46:50 localhost kernel: [<ffffffff90d8c7a8>] async_page_fault+0x28/0x30
Apr 15 10:46:50 localhost kernel: Mem-Info:
Apr 15 10:46:50 localhost kernel: active_anon:7548052 inactive_anon:137382 isolated_anon:0#012 active_file:38 inactive_file:1482 isolated_file:31#012 unevictable:0 dirty:0 writeback:0 unstable:0#012 slab_reclaimable:92639 slab_unreclaimable:14747#012 mapped:10563 shmem:405626 pagetables:47771 bounce:0#012 free:49990 free_pcp:143 free_cma:0
Apr 15 10:46:50 localhost kernel: Node 0 DMA free:15892kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 15 10:46:50 localhost kernel: lowmem_reserve[]: 0 2829 31991 31991
Apr 15 10:46:50 localhost kernel: Node 0 DMA32 free:122524kB min:5972kB low:7464kB high:8956kB active_anon:2526928kB inactive_anon:45416kB active_file:0kB inactive_file:1760kB unevictable:0kB isolated(anon):0kBisolated(file):0kB present:3129216kB managed:2897760kB mlocked:0kB dirty:0kB writeback:0kB mapped:4980kB shmem:146248kB slab_reclaimable:152112kB slab_unreclaimable:10172kB kernel_stack:816kB pagetables:17280kBunstable:0kB bounce:0kB free_pcp:196kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2018 all_unreclaimable? no
Apr 15 10:46:50 localhost kernel: lowmem_reserve[]: 0 0 29161 29161
Apr 15 10:46:50 localhost kernel: Node 0 Normal free:61544kB min:61572kB low:76964kB high:92356kB active_anon:27665280kB inactive_anon:504112kB active_file:204kB inactive_file:4168kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:30408704kB managed:29864624kB mlocked:0kB dirty:0kB writeback:0kB mapped:37272kB shmem:1476256kB slab_reclaimable:218444kB slab_unreclaimable:48800kB kernel_stack:3904kB pagetables:173804kB unstable:0kB bounce:0kB free_pcp:376kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4913 all_unreclaimable? no
Apr 15 10:46:50 localhost kernel: lowmem_reserve[]: 0 0 0 0
Apr 15 10:46:50 localhost kernel: Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
Apr 15 10:46:50 localhost kernel: Node 0 DMA32: 1095*4kB (UE) 809*8kB (UE) 1696*16kB (UEM) 2060*32kB (UEM) 200*64kB (UEM) 31*128kB (UE) 10*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 123236kB
Apr 15 10:46:50 localhost kernel: Node 0 Normal: 1511*4kB (UE) 1466*8kB (UEM) 1149*16kB (UEM) 469*32kB (UEM) 126*64kB (UEM) 27*128kB (UEM) 2*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 63196kB
Apr 15 10:46:50 localhost kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 15 10:46:50 localhost kernel: 407831 total pagecache pages
Apr 15 10:46:50 localhost kernel: 0 pages in swap cache
Apr 15 10:46:50 localhost kernel: Swap cache stats: add 0, delete 0, find 0/0
Apr 15 10:46:50 localhost kernel: Free swap  = 0kB
Apr 15 10:46:50 localhost kernel: Total swap = 0kB
Apr 15 10:46:50 localhost kernel: 8388478 pages RAM
Apr 15 10:46:50 localhost kernel: 0 pages HighMem/MovableOnly
Apr 15 10:46:50 localhost kernel: 193905 pages reserved
Apr 15 10:46:50 localhost kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Apr 15 10:46:50 localhost kernel: [  471]     0   471    31363    12826      67        0             0 systemd-journal
Apr 15 10:46:50 localhost kernel: [  499]     0   499    12158      559      26        0         -1000 systemd-udevd
Apr 15 10:46:50 localhost kernel: [  638]     0   638    13883      112      26        0         -1000 auditd
Apr 15 10:46:50 localhost kernel: [  665]   999   665   153256     2291      61        0             0 polkitd
Apr 15 10:46:50 localhost kernel: [  668]    81   668    16571      152      33        0          -900 dbus-daemon
Apr 15 10:46:50 localhost kernel: [  674]   998   674    30147      122      29        0             0 chronyd
Apr 15 10:46:50 localhost kernel: [  750]     0   750     6596       75      19        0             0 systemd-logind
Apr 15 10:46:50 localhost kernel: [  793]     0   793    31598      160      17        0             0 crond
Apr 15 10:46:50 localhost kernel: [  908]     0   908    89710     5621      97        0             0 firewalld
Apr 15 10:46:50 localhost kernel: [ 2765]     0  2765   143572     2831      97        0             0 tuned
Apr 15 10:46:50 localhost kernel: [ 2770]     0  2770   173343     8717     182        0             0 rsyslogd
Apr 15 10:46:50 localhost kernel: [ 2827]     0  2827    27552       34      10        0             0 agetty
Apr 15 10:46:50 localhost kernel: [ 2862]     0  2862     6117      115      16        0         -1000 sshd
Apr 15 10:46:50 localhost kernel: [ 3133]     0  3133    22436      260      42        0             0 master
Apr 15 10:46:50 localhost kernel: [ 3136]    89  3136    22479      256      45        0             0 qmgr
Apr 15 10:46:50 localhost kernel: [29274]     0 29274     6967      283      19        0             0 sshd
Apr 15 10:46:50 localhost kernel: [24872]     0 24872     5988       95      16        0             0 sftp-server
Apr 15 10:46:50 localhost kernel: [13398]     0 13398     6845      130      18        0             0 sshd
Apr 15 10:46:50 localhost kernel: [16576]     0 16576    28887      104      13        0             0 bash
Apr 15 10:46:50 localhost kernel: [12019]     0 12019     6738     1290      18        0             0 tmux
Apr 15 10:46:50 localhost kernel: [12020]     0 12020    28887      116      13        0             0 bash
Apr 15 10:46:50 localhost kernel: [31979]     0 31979    40558      208      36        0             0 top
Apr 15 10:46:50 localhost kernel: [15285]     0 15285 72341147  7258379   46500        0             0 dgraph
Apr 15 10:46:50 localhost kernel: [14476]     0 14476    28887      100      13        0             0 bash
Apr 15 10:46:50 localhost kernel: [14980]     0 14980     5011       69      15        0             0 tmux
Apr 15 10:46:50 localhost kernel: [ 3844]    89  3844    22462      252      44        0             0 pickup
Apr 15 10:46:50 localhost kernel: [25393]     0 25393    27014       19      10        0             0 sleep
Apr 15 10:46:50 localhost kernel: [27707]     0 27707    27014       19       9        0             0 sleep
Apr 15 10:46:50 localhost kernel: [27710]     0 27710    27014       18      10        0             0 sleep
Apr 15 10:46:50 localhost kernel: [28642]     0 28642    27014       24      10        0             0 sleep
Apr 15 10:46:50 localhost kernel: [28645]     0 28645    27014       24      10        0             0 sleep
Apr 15 10:46:50 localhost kernel: [28650]     0 28650    27014       23       9        0             0 sleep
Apr 15 10:46:50 localhost kernel: [28653]     0 28653     4853       37      14        0             0 ls
Apr 15 10:46:50 localhost kernel: [28654]     0 28654     4853       38      13        0             0 ls
Apr 15 10:46:50 localhost kernel: [28655]     0 28655    27014       22       9        0             0 sleep
Apr 15 10:46:50 localhost kernel: Out of memory: Kill process 15285 (dgraph) score 864 or sacrifice child
Apr 15 10:46:50 localhost kernel: Killed process 15285 (dgraph), UID 0, total-vm:289364588kB, anon-rss:29033516kB, file-rss:0kB, shmem-rss:0kB

yes i use 21.03

centos 7.6 / 20 Core / 48G memery / 2T ssd storage
the rdf dataset approximately 700G .