Bulk load - missing predicates

sriharshaboda · May 24, 2018, 1:37am

Hi, I’m trying out bulk load into a multi node cluster (1 zero, 3 servers). The dgraph bulk command runs fine and generates as many p directories as the number of reducers (3). Then, I copied over these [out/0/p, /out/1/p, out/2/p] directories to the 3 diff. server nodes and started each of them (with -p flag pointing to the correct directory).
But when I query on this data, I can only see data of a few predicates, other predicates are missing . Is there something simple I’m missing here? Should I copy something else in addition to the out/n/p directories from the bulk output to the server nodes?

MichelDiz · May 24, 2018, 8:35pm

Please Show your bulk config.
This RDFs you’re trying are public? can you share?
All servers are running at the same time?
can you check manually with some text editor if these RDFs are with the predicates you are looking for?

Cheers

sriharshaboda · May 25, 2018, 1:10am

I can confirm that there is no issue with the RDFs because I’ve tried loading the same RDF with single reducer and things are fine. I was able to find all predicates. But when I do a bulk load with reduce shards >1, things don’t work. Here is my bulk config:
dgraph bulk -r totalPayLoad.rdf -s ~/dgraph/totalSchema --map_shards=10 --reduce_shards=2 --http localhost:8765 --zero=localhost:5080

And yes, all the servers are running together.

MichelDiz · May 25, 2018, 1:40am

If you want me to test and report the error I need to have access to the RDFs. To reproduce as you report in your context. Give a link to it.

sriharshaboda · May 25, 2018, 3:40am

The data is not publicly shareable, will try to reproduce the issue on the 21million movies dataset and get back to you.

sriharshaboda · May 25, 2018, 4:32am

I’ve tried the same approach using the 21million.rdf.gz and 21million.schema from benchmarks/data at master · dgraph-io/benchmarks · GitHub. And I was able to replicate this issue. I can only see a few predicates.

I’m sharing the zero’s state obtained from {zero’s-ip}:6080/state endpoint

https://drive.google.com/file/d/1NgoVt7Et99eY-5U9JqX9s-rIKRqd8V2R/view?usp=sharing

We can clearly see that only a few predicates occupy a significant space, and those are the ones available for querying.

The following query returns only the name, writer.film predicates (which if you see in the zero’s state occupy a decent space)

{
getPredicates(func: eq(name@en, “Steven Spielberg”)){
predicate
}
}

Result:

{
“data”: {
“getPredicates”: [
{
“predicate”: [
“name”
]
},
{
“predicate”: [
“name”
]
},
{
“predicate”: [
“name”,
“writer.film”
]
},
{
“predicate”: [
“name”
]
}
]
}

Command I used for bulk load is:

dgraph bulk -r ~/dgraph/21million.rdf.gz -s ~/dgraph/21million.schema --map_shards=10 --reduce_shards=2 --http localhost:8765 --zero=localhost:5080

{
“RDFDir”: “~/dgraph/21million.rdf.gz”,
“SchemaFile”: “~/dgraph/21million.schema”,
“DgraphsDir”: “out”,
“TmpDir”: “tmp”,
“NumGoroutines”: 4,
“MapBufSize”: 67108864,
“ExpandEdges”: true,
“SkipMapPhase”: false,
“CleanupTmp”: true,
“NumShufflers”: 1,
“Version”: false,
“StoreXids”: false,
“ZeroAddr”: “localhost:5080”,
“HttpAddr”: “localhost:8765”,
“MapShards”: 10,
“ReduceShards”: 2
}

MichelDiz · May 25, 2018, 10:20pm

Hey, sorry for the delay.

I believe your case is related to this issue. Seems the same to me.

github.com/dgraph-io/dgraph

Data missing in Dgraph cluster after bulk loading

opened 03:54PM - 15 Feb 18 UTC

closed 12:48AM - 23 Apr 19 UTC

veludurai106

kind/bug

**DGraph Version - 1.0.3** **OS - Centos 7** **Steps to reproduce the issue** … - Started zero server _nohup dgraph zero --my=10.111.111.101:5080 --replicas=3 --idx=01 &_ - Ran bulk loader _dgraph bulk -r /home/mapr/DgraphResources/sample.rdf -s /home/mapr/DgraphResources/sample.schema --map_shards=6 --reduce_shards=3 -z 10.111.111.101:5080_ [sample.schema.txt](https://github.com/dgraph-io/dgraph/files/1728240/sample.schema.txt) [sample.rdf.txt](https://github.com/dgraph-io/dgraph/files/1728248/sample.rdf.txt) - Copied the p folders to all 3 nodes (/opt/dgraph/data) - Started dgraph server in 3 nodes (inside /opt/dgraph/data) _nohup dgraph server --memory_mb=16000 --my=10.111.111.101:7080 --zero=10.111.111.101:5080 & nohup dgraph server --memory_mb=16000 --my=10.111.111.104:7080 --zero=10.111.111.101:5080 & nohup dgraph server --memory_mb=16000 --my=10.111.111.107:7080 --zero=10.111.111.101:5080 &_ - Expected behaviour I will get result if I query any of the 3 nodes (using IP:port/query API) - Actual behaviour **Only one node returns partial result. whereas when I query other 2 nodes , I am getting empty result.** End point - http://10.111.111.107:8080/query?debug=true Post Body - [query.txt](https://github.com/dgraph-io/dgraph/files/1728295/query.txt) Response - [response.txt](https://github.com/dgraph-io/dgraph/files/1728327/response.txt) Result of cluster state api - /state [cluster state.txt](https://github.com/dgraph-io/dgraph/files/1728360/cluster.state.txt) **Node** I initially tried with ~50M edges, faced same issue. Please help me to understand what is missing!.

I would recommend the following (as Pawan recommends something similar), if you wanted to use instances with Shards, the first instance should be complete and the others (maybe) reduced that way. The other instances will be communicating according to use. Otherwise do not use reduce_shards by 2. Put only one and then copy the contents to the others Dgraph Servers, if you want Dgraph Servers as replicas.

In the Issue from “Veludurai109” Pawan recommends that you replicate the folder without reducing it. That way you will have complete replicas.

My recommendation would be the following process:

First create a bulk for the “leader” (original shard).

1º Terminal => Start Zero as (And keep it on always):

dgraph zero

2º Terminal => Start the bulk load:

dgraph bulk -r 21million.rdf.gz -s release.schema --map_shards=10 --reduce_shards=1 --zero=localhost:5080

You will get only a out/0 output. Push it to a organized directory or deal with as you wish.

Then create reduced shards.

2º Terminal => Start another bulk load:

dgraph bulk -r 21million.rdf.gz -s release.schema --map_shards=10 --reduce_shards=2 --shufflers=2 --zero=localhost:5080

ps: shufflers default is 1

You will get two outputs. Push it to a organized directory or deal with as you wish.

Now you have 3 outputs. One complete that will be your leader. And two reduced.

Start your Dgraph Servers:

2º Terminal => Start “leader” Dgraph Server:

dgraph server --lru_mb=12288 --zero=localhost:5080 --my=localhost:7080 -o=0 -p leader

3º Terminal => Start Dgraph Server 2:

dgraph server --lru_mb=12288 --zero=localhost:5080 --my=localhost:7081 -o=1 -p 0

4º Terminal => Start Dgraph Server 3:

dgraph server --lru_mb=12288 --zero=localhost:5080 --my=localhost:7082 -o=2 -p 1

If there is something wrong, please someone correct me. I know the basics of Bulk Loader.

MichelDiz · May 25, 2018, 10:31pm

But this should be 3 shards here. Since you are using 3 servers. The standard of use of Dgraph odd, never evens.

All Dgraph configs must be odd.

Try the following, create three shards and start them as exemplified above. It certainly should work.

dgraph bulk -r totalPayLoad.rdf -s ~/dgraph/totalSchema --map_shards=10 --reduce_shards=3 --http localhost:8765 --zero=localhost:5080

sriharshaboda · May 30, 2018, 5:38am

Hi, I’ve tried with 3 servers, and the query response still remained the same.

sriharshaboda · May 30, 2018, 5:42am

Hi, I’m afraid this solution doesn’t work for me. I want the data to be distributed evenly across all the servers. If I could fit all of the data into any one server, I wouldn’t need more than one server. Please suggest an alternative approach. Thanks.

MichelDiz · May 30, 2018, 10:18pm

They are in 3 different hosts/machines? If you are creating instances of Dgraph on the same machine I do not see how this could lead to higher performance - one instance could do the job. But I believe you’re using different machines, but I’m just commenting to make sure.

Another fact is, are you using SSDs? how much memory is being made available for each Dgraph Server?

The point is, you can bulkload for one instance and connect the rest of your servers without bulkload them. Eventually Dgraph will sync Through Dgraph Zero. it Take time but will.

Update

I ran a new test with the following commands. Using the same machine.

Zero

dgraph zero --my=localhost:5080

Bulk

dgraph bulk -r 21million.rdf.gz -s release.schema --map_shards=10 --reduce_shards=1 --zero=localhost:5080 --out NEW

I copied the (folder ‘p’) results from the bulk and pasted the same on all servers.

For out/0 folder:

dgraph server --lru_mb=1024 --zero=localhost:5080 --my=localhost:7080 -o=0 --bindall

The latency results:

# I started only one server and ran 4 queries.
"processing_ns": 977500 Server - latency: 977μs
"processing_ns": 976500 Server - latency: 977μs
"processing_ns": 976300 Server latency: 3ms
"processing_ns": 976200 Server latency: 977μs

For out/1 folder:

dgraph server --lru_mb=1024 --zero=localhost:5080 --my=localhost:7081 -o=1 --bindall

The latency results:

# I started another server and ran 3 queries.
"processing_ns": 975900  Server latency: 2ms
"processing_ns": 976200, Server latency: 2ms
"processing_ns": 975900, Server latency: 977μs

For out/2 folder:

dgraph server --lru_mb=1024 --zero=localhost:5080 --my=localhost:7082 -o=2 --bindall

The latency results:

"processing_ns": 967300, Server latency: 976μs
"processing_ns": 976200, Server latency: 2ms
"processing_ns": 975900, Server latency: 977μs

mrjn · June 1, 2018, 9:11pm

This is more of a correctness problem, than a performance issue. We should look deeper and confirm if this is really a bug, that we’re not somehow picking up the shards built by bulk loader.

selmeci · June 2, 2018, 6:20pm

Hi, I have same problem.

After bulk import

dgraph bulk -r rdfs/ -s schema.dgraph --zero=172.23.3.34:32541 --reduce_shards=3 --shufflers=3 --map_shards=3 --num_go_routines=16

I cannot query some predicates

{
  org(func: has(organization.id)) {
    count(uid)
  }
}

response:

{
  "data": {
    "org": [
      {
        "count": 0
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 11087,
      "processing_ns": 699972,
      "encoding_ns": 548135
    },
    "txn": {
      "start_ts": 21,
      "lin_read": {
        "ids": {
          "2": 3
        }
      }
    }
  }
}

I ran bulk several time and some predicates always miss.

sriharshaboda · June 26, 2018, 11:15am

@selmeci This fix for this issue has been pushed to master. You can refer to this topic here:

github.com/dgraph-io/dgraph

Data missing in Dgraph cluster after bulk loading

opened 03:54PM - 15 Feb 18 UTC

closed 12:48AM - 23 Apr 19 UTC

veludurai106

kind/bug

**DGraph Version - 1.0.3** **OS - Centos 7** **Steps to reproduce the issue** … - Started zero server _nohup dgraph zero --my=10.111.111.101:5080 --replicas=3 --idx=01 &_ - Ran bulk loader _dgraph bulk -r /home/mapr/DgraphResources/sample.rdf -s /home/mapr/DgraphResources/sample.schema --map_shards=6 --reduce_shards=3 -z 10.111.111.101:5080_ [sample.schema.txt](https://github.com/dgraph-io/dgraph/files/1728240/sample.schema.txt) [sample.rdf.txt](https://github.com/dgraph-io/dgraph/files/1728248/sample.rdf.txt) - Copied the p folders to all 3 nodes (/opt/dgraph/data) - Started dgraph server in 3 nodes (inside /opt/dgraph/data) _nohup dgraph server --memory_mb=16000 --my=10.111.111.101:7080 --zero=10.111.111.101:5080 & nohup dgraph server --memory_mb=16000 --my=10.111.111.104:7080 --zero=10.111.111.101:5080 & nohup dgraph server --memory_mb=16000 --my=10.111.111.107:7080 --zero=10.111.111.101:5080 &_ - Expected behaviour I will get result if I query any of the 3 nodes (using IP:port/query API) - Actual behaviour **Only one node returns partial result. whereas when I query other 2 nodes , I am getting empty result.** End point - http://10.111.111.107:8080/query?debug=true Post Body - [query.txt](https://github.com/dgraph-io/dgraph/files/1728295/query.txt) Response - [response.txt](https://github.com/dgraph-io/dgraph/files/1728327/response.txt) Result of cluster state api - /state [cluster state.txt](https://github.com/dgraph-io/dgraph/files/1728360/cluster.state.txt) **Node** I initially tried with ~50M edges, faced same issue. Please help me to understand what is missing!.

system · July 26, 2018, 11:15am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulkload fails with no error message Dgraph	6	597	May 7, 2020
Bulk Loader - Deploy Documentation	0	898	December 16, 2020
Out of memory problem in large rdf file bulk load Users	8	716	October 30, 2019
Cannnot find the data after bulk load Users kind:question	3	411	July 12, 2021
Schema and mutation RDF triples / bulk loader error handling Dgraph kind:question	4	377	January 13, 2021

Bulk load - missing predicates

Update

Related topics