Bulk load - missing predicates

Hi, I’m trying out bulk load into a multi node cluster (1 zero, 3 servers). The dgraph bulk command runs fine and generates as many p directories as the number of reducers (3). Then, I copied over these [out/0/p, /out/1/p, out/2/p] directories to the 3 diff. server nodes and started each of them (with -p flag pointing to the correct directory).
But when I query on this data, I can only see data of a few predicates, other predicates are missing . Is there something simple I’m missing here? Should I copy something else in addition to the out/n/p directories from the bulk output to the server nodes?

Please Show your bulk config.
This RDFs you’re trying are public? can you share?
All servers are running at the same time?
can you check manually with some text editor if these RDFs are with the predicates you are looking for?

Cheers

I can confirm that there is no issue with the RDFs because I’ve tried loading the same RDF with single reducer and things are fine. I was able to find all predicates. But when I do a bulk load with reduce shards >1, things don’t work. Here is my bulk config:
dgraph bulk -r totalPayLoad.rdf -s ~/dgraph/totalSchema --map_shards=10 --reduce_shards=2 --http localhost:8765 --zero=localhost:5080

And yes, all the servers are running together.

If you want me to test and report the error I need to have access to the RDFs. To reproduce as you report in your context. Give a link to it.

The data is not publicly shareable, will try to reproduce the issue on the 21million movies dataset and get back to you.

I’ve tried the same approach using the 21million.rdf.gz and 21million.schema from benchmarks/data at master · dgraph-io/benchmarks · GitHub. And I was able to replicate this issue. I can only see a few predicates.

I’m sharing the zero’s state obtained from {zero’s-ip}:6080/state endpoint

https://drive.google.com/file/d/1NgoVt7Et99eY-5U9JqX9s-rIKRqd8V2R/view?usp=sharing

We can clearly see that only a few predicates occupy a significant space, and those are the ones available for querying.

The following query returns only the name, writer.film predicates (which if you see in the zero’s state occupy a decent space)

{
getPredicates(func: eq(name@en, “Steven Spielberg”)){
predicate
}
}

Result:

{
“data”: {
“getPredicates”: [
{
predicate”: [
“name”
]
},
{
predicate”: [
“name”
]
},
{
predicate”: [
“name”,
“writer.film”
]
},
{
predicate”: [
“name”
]
}
]
}

Command I used for bulk load is:

dgraph bulk -r ~/dgraph/21million.rdf.gz -s ~/dgraph/21million.schema --map_shards=10 --reduce_shards=2 --http localhost:8765 --zero=localhost:5080

{
“RDFDir”: “~/dgraph/21million.rdf.gz”,
“SchemaFile”: “~/dgraph/21million.schema”,
“DgraphsDir”: “out”,
“TmpDir”: “tmp”,
“NumGoroutines”: 4,
“MapBufSize”: 67108864,
“ExpandEdges”: true,
“SkipMapPhase”: false,
“CleanupTmp”: true,
“NumShufflers”: 1,
“Version”: false,
“StoreXids”: false,
“ZeroAddr”: “localhost:5080”,
“HttpAddr”: “localhost:8765”,
“MapShards”: 10,
“ReduceShards”: 2
}

Hey, sorry for the delay.

I believe your case is related to this issue. Seems the same to me.

I would recommend the following (as Pawan recommends something similar), if you wanted to use instances with Shards, the first instance should be complete and the others (maybe) reduced that way. The other instances will be communicating according to use. Otherwise do not use reduce_shards by 2. Put only one and then copy the contents to the others Dgraph Servers, if you want Dgraph Servers as replicas.

In the Issue from “Veludurai109” Pawan recommends that you replicate the folder without reducing it. That way you will have complete replicas.

My recommendation would be the following process:

First create a bulk for the “leader” (original shard).

1º Terminal => Start Zero as (And keep it on always):

dgraph zero

2º Terminal => Start the bulk load:

dgraph bulk -r 21million.rdf.gz -s release.schema --map_shards=10 --reduce_shards=1 --zero=localhost:5080

You will get only a out/0 output. Push it to a organized directory or deal with as you wish.

Then create reduced shards.

2º Terminal => Start another bulk load:

dgraph bulk -r 21million.rdf.gz -s release.schema --map_shards=10 --reduce_shards=2 --shufflers=2 --zero=localhost:5080

ps: shufflers default is 1

You will get two outputs. Push it to a organized directory or deal with as you wish.

Now you have 3 outputs. One complete that will be your leader. And two reduced.

Start your Dgraph Servers:

2º Terminal => Start “leader” Dgraph Server:

dgraph server --lru_mb=12288 --zero=localhost:5080 --my=localhost:7080 -o=0 -p leader

3º Terminal => Start Dgraph Server 2:

dgraph server --lru_mb=12288 --zero=localhost:5080 --my=localhost:7081 -o=1 -p 0

4º Terminal => Start Dgraph Server 3:

dgraph server --lru_mb=12288 --zero=localhost:5080 --my=localhost:7082 -o=2 -p 1

If there is something wrong, please someone correct me. I know the basics of Bulk Loader.

But this should be 3 shards here. Since you are using 3 servers. The standard of use of Dgraph odd, never evens.

All Dgraph configs must be odd.

Try the following, create three shards and start them as exemplified above. It certainly should work.

dgraph bulk -r totalPayLoad.rdf -s ~/dgraph/totalSchema --map_shards=10 --reduce_shards=3 --http localhost:8765 --zero=localhost:5080

Hi, I’ve tried with 3 servers, and the query response still remained the same.

Hi, I’m afraid this solution doesn’t work for me. I want the data to be distributed evenly across all the servers. If I could fit all of the data into any one server, I wouldn’t need more than one server. Please suggest an alternative approach. Thanks.

They are in 3 different hosts/machines? If you are creating instances of Dgraph on the same machine I do not see how this could lead to higher performance - one instance could do the job. But I believe you’re using different machines, but I’m just commenting to make sure.

Another fact is, are you using SSDs? how much memory is being made available for each Dgraph Server?

The point is, you can bulkload for one instance and connect the rest of your servers without bulkload them. Eventually Dgraph will sync Through Dgraph Zero. it Take time but will.

Update

I ran a new test with the following commands. Using the same machine.

Zero

dgraph zero --my=localhost:5080

Bulk

dgraph bulk -r 21million.rdf.gz -s release.schema --map_shards=10 --reduce_shards=1 --zero=localhost:5080 --out NEW

I copied the (folder ‘p’) results from the bulk and pasted the same on all servers.

For out/0 folder:

dgraph server --lru_mb=1024 --zero=localhost:5080 --my=localhost:7080 -o=0 --bindall

The latency results:

# I started only one server and ran 4 queries.
"processing_ns": 977500 Server - latency: 977μs
"processing_ns": 976500 Server - latency: 977μs
"processing_ns": 976300 Server latency: 3ms
"processing_ns": 976200 Server latency: 977μs

For out/1 folder:

dgraph server --lru_mb=1024 --zero=localhost:5080 --my=localhost:7081 -o=1 --bindall

The latency results:

# I started another server and ran 3 queries.
"processing_ns": 975900  Server latency: 2ms
"processing_ns": 976200, Server latency: 2ms
"processing_ns": 975900, Server latency: 977μs

For out/2 folder:

dgraph server --lru_mb=1024 --zero=localhost:5080 --my=localhost:7082 -o=2 --bindall

The latency results:

"processing_ns": 967300, Server latency: 976μs
"processing_ns": 976200, Server latency: 2ms
"processing_ns": 975900, Server latency: 977μs

This is more of a correctness problem, than a performance issue. We should look deeper and confirm if this is really a bug, that we’re not somehow picking up the shards built by bulk loader.

Hi, I have same problem.

After bulk import

dgraph bulk -r rdfs/ -s schema.dgraph --zero=172.23.3.34:32541 --reduce_shards=3 --shufflers=3 --map_shards=3 --num_go_routines=16

I cannot query some predicates

{
  org(func: has(organization.id)) {
    count(uid)
  }
}

response:

{
  "data": {
    "org": [
      {
        "count": 0
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 11087,
      "processing_ns": 699972,
      "encoding_ns": 548135
    },
    "txn": {
      "start_ts": 21,
      "lin_read": {
        "ids": {
          "2": 3
        }
      }
    }
  }
}

I ran bulk several time and some predicates always miss.

@selmeci This fix for this issue has been pushed to master. You can refer to this topic here:

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.