Does triple order matter for bulk loading

danieljuhl · March 11, 2021, 12:01pm

I haven’t been able to find anything on the question, but using bulk loader if the same edge is defined multiple times, which edge/triple “wins”?

If I perform the following mutation, the last triple will be the winner. But it looks like, that when I bulk load on 20.11.2 thats not the case. I don’t know, if the first one wins or if its lazy, and thus could change from time to time.

Any clever people at dgraph how can explain what should happen?

     mutation {
          set {
            _:a <relations> _:b (meta="foo") .
            _:a <relations> _:b (meta="bar") .
          }
      }

danieljuhl · March 18, 2021, 5:54pm

@MichelDiz Sorry for the tag, but can you either hint a person who knows something about this or tell whether it is working as designed or a possible bug?

MichelDiz · March 18, 2021, 8:09pm

In theory, should always be the second. I never noticed this behavior. Why did you notice it? it is important the order? Is it intermittent?

All loaders work in a transactional manner. So, the second RDF line and the second transaction should always win by this concept.

danieljuhl · March 18, 2021, 8:29pm

I’ve done some more testing. And it looks to be opposite. Using bulk loader, the first RDF wins, while the second RDF wins during a mutation.

The order is important, because only one of them will be present after the import. And the reason I have two, and doesn’t simply filter before writing the RDFs is, we are initially loading a huge amount of data from another database were one relationship can have multiple attributes, but if only one can be present we know which one is most important. But we can’t make an efficient query to the exporting database which handles this, so what we do is simply export all relationships, but write the less important to one set of RDF files and the more important to another set of RDF files, and during important we can control the order.

To summarize, it looks like the order is handles differently for bulk loading vs mutation - at least for v20.11.2

ibrahim · March 19, 2021, 10:14am

hey @danieljuhl, in case of the bulk loader (or live loader), the order is not guaranteed. The mutations could be processed parallelly and there’s no way of guaranteeing the order in which they will be processed.
If the ordering is important, you can try this (assuming two have 2 rdf files, one from the existing database and one from the new one)

Bulk load the old data (from the existing database)
Live load the new data. If there are duplicates, the live-loaded data will take precedence over the existing data.

danieljuhl · March 19, 2021, 12:43pm

@ibrahim the data is not from two different data sources. It is because the existing data contains multiple edges with different facets for the same set of from/to. And because we can’t query the existing data in a way so that we can control the order (avoid the duplicates) we are currently sending the data to two RDF files prior to bulk loading. The issue was that we assumed the second to overwrite the first, but it looked to be the other way around.

You have clarified that order is not guaranteed, but is it completely random or is it random within chunks of data loaded with bulk loader?

ibrahim · March 19, 2021, 12:54pm

@danieljuhl

The issue was that we assumed the second to overwrite the first, but it looked to be the other way around.

Live load the second rdf file and It will overwrite the duplicates. I am guessing your data already has uids. If you bulk load _:a <foo> "12" . and then live load _:a <foo> 13 . , both the _:a would be considered as different nodes.

You have clarified that order is not guaranteed, but is it completely random or is it random within chunks of data loaded with bulk loader?

It is not completely random. We read chunks and these chunks are processed parallelly.

If you look at the following code, you’ll notice that the files are being processed parallelly

github.com

dgraph-io/dgraph/blob/1c3a3e27e0588d4242299c2d2eaa09cb1083ce16/dgraph/cmd/bulk/loader.go#L268-L296


      
          	for i, file := range files {
          		x.Check(thr.Do())
          		fmt.Printf("Processing file (%d out of %d): %s\n", i+1, len(files), file)
          
          		go func(file string) {
          			defer thr.Done(nil)
          
          			key := ld.opt.EncryptionKey
          			if !ld.opt.Encrypted {
          				key = nil
          			}
          			r, cleanup := fs.ChunkReader(file, key)
          			defer cleanup()
          
          			chunk := chunker.NewChunker(loadType, 1000)
          			for {
          				chunkBuf, err := chunk.Chunk(r)
          				if chunkBuf != nil && chunkBuf.Len() > 0 {
          					ld.readerChunkCh <- chunkBuf
          				}

This file has been truncated. show original

hardik · March 26, 2021, 9:34am

@danieljuhl does the suggest above help with resolution for your ask?

danieljuhl · April 9, 2021, 4:52am

@hardik we tried the bulk then live load approach and it worked - so that what we are using at the moment to handle this case

At first glance the parallel loaded made sense, but we tried ensuring that the “winning” data was in the last rdf file to be loaded (through file name ordering), but it still did not look the will as expected

Topic		Replies	Views
How to solve mutation conflict Dgraph status:accepted , ticket:created	28	1656	February 14, 2023
When live loading after bulk loading, the query does not work properly Dgraph kind:question	3	596	May 26, 2022
Cannnot find the data after bulk load Users kind:question	3	425	July 12, 2021
Loading close to 1M edges/sec into Dgraph - Dgraph Blog Blog	3	1490	November 15, 2018
Distributed bulk loader Dgraph bulkloader , dgraph	0	595	February 10, 2022

Does triple order matter for bulk loading

Related topics