Does triple order matter for bulk loading

I haven’t been able to find anything on the question, but using bulk loader if the same edge is defined multiple times, which edge/triple “wins”?

If I perform the following mutation, the last triple will be the winner. But it looks like, that when I bulk load on 20.11.2 thats not the case. I don’t know, if the first one wins or if its lazy, and thus could change from time to time.

Any clever people at dgraph how can explain what should happen?

     mutation {
          set {
            _:a <relations> _:b (meta="foo") .
            _:a <relations> _:b (meta="bar") .
          }
      }

@MichelDiz Sorry for the tag, but can you either hint a person who knows something about this or tell whether it is working as designed or a possible bug?

In theory, should always be the second. I never noticed this behavior. Why did you notice it? it is important the order? Is it intermittent?

All loaders work in a transactional manner. So, the second RDF line and the second transaction should always win by this concept.

I’ve done some more testing. And it looks to be opposite. Using bulk loader, the first RDF wins, while the second RDF wins during a mutation.

The order is important, because only one of them will be present after the import. And the reason I have two, and doesn’t simply filter before writing the RDFs is, we are initially loading a huge amount of data from another database were one relationship can have multiple attributes, but if only one can be present we know which one is most important. But we can’t make an efficient query to the exporting database which handles this, so what we do is simply export all relationships, but write the less important to one set of RDF files and the more important to another set of RDF files, and during important we can control the order.

To summarize, it looks like the order is handles differently for bulk loading vs mutation - at least for v20.11.2

hey @danieljuhl, in case of the bulk loader (or live loader), the order is not guaranteed. The mutations could be processed parallelly and there’s no way of guaranteeing the order in which they will be processed.
If the ordering is important, you can try this (assuming two have 2 rdf files, one from the existing database and one from the new one)

  1. Bulk load the old data (from the existing database)
  2. Live load the new data. If there are duplicates, the live-loaded data will take precedence over the existing data.
1 Like

@ibrahim the data is not from two different data sources. It is because the existing data contains multiple edges with different facets for the same set of from/to. And because we can’t query the existing data in a way so that we can control the order (avoid the duplicates) we are currently sending the data to two RDF files prior to bulk loading. The issue was that we assumed the second to overwrite the first, but it looked to be the other way around.

You have clarified that order is not guaranteed, but is it completely random or is it random within chunks of data loaded with bulk loader?

@danieljuhl

The issue was that we assumed the second to overwrite the first, but it looked to be the other way around.

Live load the second rdf file and It will overwrite the duplicates. I am guessing your data already has uids. If you bulk load _:a <foo> "12" . and then live load _:a <foo> 13 . , both the _:a would be considered as different nodes.

You have clarified that order is not guaranteed, but is it completely random or is it random within chunks of data loaded with bulk loader?

It is not completely random. We read chunks and these chunks are processed parallelly.

If you look at the following code, you’ll notice that the files are being processed parallelly

@danieljuhl does the suggest above help with resolution for your ask?

@hardik we tried the bulk then live load approach and it worked - so that what we are using at the moment to handle this case

At first glance the parallel loaded made sense, but we tried ensuring that the “winning” data was in the last rdf file to be loaded (through file name ordering), but it still did not look the will as expected

1 Like