Bulk load data got unexcepted result

explorer · December 6, 2019, 10:55am

Hey，I am testing bulk loader of dgraph v1.1, I got the problem of losing data and Type System can work correctly.
I tested shards with 1 and 3.
If shards is 1,Type System can work correctly, but get a incorrect data size which obviouly less than the original 。
If shards is 3,Type System can’t work correctly, and data size as same as the metioned above。
I have no idea to resolve this problem。Following is my test procedure and data

I’m uploading file to google cloud,and providing source link when finished
UPDATE:
https://drive.google.com/open?id=1ndT1O1EllhL9FY814NCJwJgzWjdtC6zc
dgraph detail:

[Decoder]: Using assembly version of decoder

Dgraph version : v1.1.0
Dgraph SHA-256 : 7d4294a80f74692695467e2cf17f74648c18087ed7057d798f40e1d3a31d2095
Commit SHA-1 : ef7cdb28
Commit timestamp : 2019-09-04 00:12:51 -0700
Branch : HEAD
Go version : go1.12.7

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph , visit http://discuss.hypermode.com.
To say hi to the community , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2018 Dgraph Labs, Inc.

When shard is 1:

a.schema:
a.rdf:831MB

clear directory:
rm -rf /data/sdv2/dgraph/data/z && mkdir /data/sdv2/dgraph/data/z && rm -rf /data/sdv2/dgraph/data/0 && mkdir /data/sdv2/dgraph/data/0

start zero:
/data/sdv2/dgraph/opt/dgraph zero --idx 1 --replicas 1 --cwd /data/sdv2/dgraph/data/z --log_dir /data/sdv2/dgraph/data/z --my dl01:5080

do bulk load:
/data/sdv2/dgraph/opt/dgraph bulk
–files a.rdf
–schema a.schema
–format rdf
–map_shards 15
–reducers 1
–reduce_shards 1
–num_go_routines 1
–store_xids
–logtostderr
–v 10
–log_dir log
–ignore_errors
–zero dl01:5080

directory was created,but it’s only 359M, much less than a.rdf:751MB
out/0/p:359M

copy p to alpha’s work directory
cp -r /data/sdv2/dgraph/home/out/0/p /data/sdv2/dgraph/data/0

start aplha:
/data/sdv2/dgraph/opt/dgraph alpha --idx 1 --lru_mb 2048 --zero dl01:5080 --port_offset 1 --cwd /data/sdv2/dgraph/data/0 --log_dir /data/sdv2/dgraph/data/0

start dgraph-ratel and test in browser.
/data/sdv2/dgraph/opt/dgraph-ratel -addr dl01:5080

I known the data is in a.rdf
_:Q103 “Q103” .
_:Q103 <dgraph.type> “Entity” .
_:Q103 “Supercalifragilisticexpialidocious” .
_:Q103 “超級酷斃宇宙世界霹靂無敵棒” .
_:Q103 “song from the film and musical Mary Poppins” .
_:Q103 _:Q1860 .
_:Q1860 “Q1860” .
_:Q1860 <dgraph.type> “Entity” .
_:Q1860 “English” .
_:Q1860 “英语” .
_:Q1860 “West Germanic language originating in England with linguistic roots in French, German and Vulgar Latin” .
_:Q1860 “起源於英格蘭的一種語言” .
_:Q1860 “English language” .
_:Q1860 “en” .
_:Q1860 “eng” .
_:Q1860 “英文” .
_:Q1860 “英語” .

{
#Type System work correctly
q(func:type(“Entity”)){
count(uid)
}
}
{
“data”: {
“q”: [
{
“count”: 472755
}
]
},
}
{
#Type System work correctly
q(func:eq(id,“Q103”)){
expand(all)
}
}

{
#return data
“data”: {
“q”: [
{
“desc”: [
“song from the film and musical Mary Poppins”
],
“id”: “Q103”,
“name”: [
“超級酷斃宇宙世界霹靂無敵棒”,
“Supercalifragilisticexpialidocious”
]
}
]
},
…
}
{
#return nothing,but Tiel is in a.rdf
q(func:eq(id,“Q103”)){
Tiel {
name
}
}
}
{
#return nothing , node “Q1860” was lost
q(func:eq(id,“Q1860”)){
id
name
}
}

When shard is 3:

clear directory:
rm -rf /data/sdv2/dgraph/data/z && mkdir /data/sdv2/dgraph/data/z && rm -rf /data/sdv2/dgraph/data/0 && mkdir /data/sdv2/dgraph/data/0 && rm -rf /data/sdv2/dgraph/data/1 && mkdir /data/sdv2/dgraph/data/1 && rm -rf /data/sdv2/dgraph/data/2 && mkdir /data/sdv2/dgraph/data/2

start zero:
/data/sdv2/dgraph/opt/dgraph zero --idx 1 --replicas 1 --cwd /data/sdv2/dgraph/data/z --log_dir /data/sdv2/dgraph/data/z --my dl01:5080

do bulk load:
/data/sdv2/dgraph/opt/dgraph bulk
–files a.rdf
–schema a.schema
–format rdf
–map_shards 15
–reducers 3
–reduce_shards 3
–num_go_routines 3
–store_xids
–logtostderr
–v 10
–log_dir log
–ignore_errors
–zero dl01:5080

three dir created, I exec du -sh out, got 360MB , much less than a.rdf:751MB
out/1/p,out/2/p,out/3/p,

copy to corresponding alpha’ work directory：
cp -r /data/sdv2/dgraph/home/out/0/p /data/sdv2/dgraph/data/0 && cp -r /data/sdv2/dgraph/home/out/1/p /data/sdv2/dgraph/data/1 && cp -r /data/sdv2/dgraph/home/out/2/p /data/sdv2/dgraph/data/2

start alpha:
/data/sdv2/dgraph/opt/dgraph alpha --idx 1 --lru_mb 2048 --zero dl01:5080 --port_offset 1 --cwd /data/sdv2/dgraph/data/0 --log_dir /data/sdv2/dgraph/data/0
/data/sdv2/dgraph/opt/dgraph alpha --idx 2 --lru_mb 2048 --zero dl01:5080 --port_offset 2 --cwd /data/sdv2/dgraph/data/1 --log_dir /data/sdv2/dgraph/data/1
/data/sdv2/dgraph/opt/dgraph alpha --idx 3 --lru_mb 2048 --zero dl01:5080 --port_offset 3 --cwd /data/sdv2/dgraph/data/2 --log_dir /data/sdv2/dgraph/data/2
/data/sdv2/dgraph/opt/dgraph-ratel -addr dl01:5080

test in http://ip:8000/?local
the test query sentence is the same with When shard of 1:

{
#Type System lose efficacy, can’t work correctly
q(func:type(“Entity”)){
count(uid)
}
}
{
“data”: {
“q”: [
{
“count”: 0
}
]
},
}
{
#Type System lose efficacy, can’t work correctly
q(func:eq(id,“Q103”)){
expand(all)
}
}

{ #return nothing,but node Q103 has dgraph.type Entity, and has attribute id,name,desc
“data”: {
“q”:
},
extensions …
}
{
#return nothing
q(func:eq(id,“Q103”)){
Tiel {
name
}
}
}

{
#return nothing , node “Q1860” was lost
q(func:eq(id,“Q1860”)){
id
name
}
}

dmai · December 6, 2019, 7:25pm

Looks like your a.rdf data file isn’t gzipped, so I’m not too concerned by the size difference. Dgraph stores the data in its own format, not directly as RDF text.

This same issue was reported and was fixed Can’t Query Type Data Inserted by Bulk Loader · Issue #3968 · dgraph-io/dgraph · GitHub. It will be in v1.1.1. We’re planning for a release candidate today that includes the fix.

explorer · December 9, 2019, 1:47am

I make sure that some data in data source rdf file missing in bulk load phrase ,and i can’t search them by query.
I’m not concerned by the size difference either, i only emphize the incorrect result by this means.

dmai · December 10, 2019, 12:22am

So bulk loader works for you for a single-group cluster.

If you need data sharding, can you try bulk loader with v1.1.1-rc1 which contains the fix for issue #3968?

explorer · December 10, 2019, 5:09am

hello,I have compiled and tried v1.1.1-rc1 today , shards is 3, type system can work, that is
q(func:type(“Entity”)){
count(uid)
} return value,not empty as before。 but another exception appear, type “Entity” has four attributes, id,name,desc,alias，but query with expand( all ) can only got id。and data lost is still exists.

Topic		Replies	Views
Cannnot find the data after bulk load Users kind:question	3	433	July 12, 2021
Can't Query Type Data on Bulk Loader Users	8	533	October 11, 2019
Bulk load failed Dgraph kind:bug	7	505	April 28, 2020
Bulk load - missing predicates Dgraph	14	1635	July 26, 2018
Bulk loader --> data not accessible Dgraph	3	1693	September 26, 2018

Bulk load data got unexcepted result

I’m uploading file to google cloud,and providing source link when finished UPDATE: https://drive.google.com/open?id=1ndT1O1EllhL9FY814NCJwJgzWjdtC6zc dgraph detail:

Licensed variously under the Apache Public License 2.0 and Dgraph Community License. Copyright 2015-2018 Dgraph Labs, Inc.

When shard is 1:

Related topics

I’m uploading file to google cloud,and providing source link when finished
UPDATE:
https://drive.google.com/open?id=1ndT1O1EllhL9FY814NCJwJgzWjdtC6zc
dgraph detail:

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2018 Dgraph Labs, Inc.