Crash when online write

Moved from GitHub /5752

Posted by JimWen:

What version of Dgraph are you using?

Dgraph version : v20.03.1
Dgraph SHA-256 : 6a40b1e084205ae9e29336780b3458a3869db45c0b96b916190881c16d705ba8
Commit SHA-1 : c201611d6
Commit timestamp : 2020-04-24 13:53:41 -0700
Branch : HEAD
Go version : go1.14.1

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

128G mem & 1.8T SSD

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

Steps to reproduce the issue (command/config used to run Dgraph).

Online write data through java api or liveload, it’s ok at first days, but crash after some days.

Expected behaviour and actual result.

The crash log is as followings, it seems that this is related to badger.

panic: runtime error: slice bounds out of range [4172166119:1963281920]

goroutine 25348428 [running]:
github.com/dgraph-io/badger/v2/table.(*Builder).addHelper(0xc01fc8ccc0, 0xc154c416c0, 0x1c, 0x20, 0x440, 0x0, 0xc203897e8d, 0x55, 0xe7, 0x0, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/table/builder.go:222 +0x4a1
github.com/dgraph-io/badger/v2/table.(*Builder).Add(0xc01fc8ccc0, 0xc154c416c0, 0x1c, 0x20, 0x440, 0x0, 0xc203897e8d, 0x55, 0xe7, 0x0, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/table/builder.go:339 +0xe0
github.com/dgraph-io/badger/v2.(*levelsController).compactBuildTables(0xc000322070, 0x2, 0x1c394a0, 0xc02be6c820, 0xc0005d6960, 0xc0005d69c0, 0xc1225a9c40, 0x1, 0x1, 0xc1225a9c48, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:607 +0x92a
github.com/dgraph-io/badger/v2.(*levelsController).runCompactDef(0xc000322070, 0x2, 0x1c394a0, 0xc02be6c820, 0xc0005d6960, 0xc0005d69c0, 0xc1225a9c40, 0x1, 0x1, 0xc1225a9c48, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:835 +0xc6
github.com/dgraph-io/badger/v2.(*levelsController).doCompact(0xc000322070, 0x2, 0x3ff03c83ae800000, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:904 +0x4b7
github.com/dgraph-io/badger/v2.(*levelsController).runWorker(0xc000322070, 0xc0040d1800)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:365 +0x319
created by github.com/dgraph-io/badger/v2.(*levelsController).startCompact
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:340 +0x88

jarifibrahim commented :

@JimWen can you show me the contents of the p directory on which the crash happened? From the information in the crash it looks like the builder was trying to build a table with around 4.0G size (4172166119 == 4.0G). The usual size for tables is 70 MB so this size is definitely odd.

I want to see the size of the other SST files in this p directory.

JimWen commented :

@jarifibrahim I’m afraid that i can’t show you the contents of the p directory because i have to clear all the alpha data and reimport data to solve this problem .

jarifibrahim commented :

@JimWen Does the crash show up when you bulk load a lot of data like 5 billion nquads? I’ll try to insert 5 billion nquads in dgraph and see how it goes.

JimWen commented :

@JimWen Does the crash show up when you bulk load a lot of data like 5 billion nquads? I’ll try to insert 5 billion nquads in dgraph and see how it goes.

No, just happen when liveload or online mutation through api.

jarifibrahim commented :

No, just happen when liveload or online mutation through api.

Got it. I’m running dgraph on 2billion rdf. Let me see if it goes through.

jarifibrahim commented :

@JimWen my 2billion test didn’t crash. Do you have any suggestions or maybe specifics what you did so that I can reproduce the failure? I haven’t tried using the java API because that shouldn’t have any effect on this crash.

JimWen commented :

@JimWen my 2billion test didn’t crash. Do you have any suggestions or maybe specifics what you did so that I can reproduce the failure? I haven’t tried using the java API because that shouldn’t have any effect on this crash.

If the schema is like
type User {
follower:uid
}
But actually the write data is [uid] which means 1-N, would this just cause data lose or may cause crash?

  1. It works on Ludicrous Mode, would this cause crash?
  2. I notice that it aways crash when i stop write for a long time and restart the cluster alpha node, then it works fine for a while and crash.
    @jarifibrahim

JimWen commented :

@jarifibrahim
it crash again after wrting on line for 7days using v20.03.3, the total data is about 7 billion edges. it works on Ludicrous Mode and crash log is as followings:

panic: runtime error: slice bounds out of range [:4172166156] with capacity 1963281920

goroutine 21418 [running]:
github.com/dgraph-io/badger/v2/table.(*Builder).addBlockToIndex(…)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:296
github.com/dgraph-io/badger/v2/table.(*Builder).finishBlock(0xc1b382c180)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:290 +0x64d
github.com/dgraph-io/badger/v2/table.(*Builder).Add(0xc1b382c180, 0xc18ac559c0, 0x1e, 0x20, 0x440, 0x0, 0xc13621a025, 0x154, 0x13db, 0x0, …)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:333 +0xf8
github.com/dgraph-io/badger/v2.(*levelsController).compactBuildTables(0xc0001c2070, 0x2, 0x1b81ee0, 0xc06930d040, 0xc00018e8a0, 0xc00018e900, 0xc21da10768, 0x1, 0x1, 0xc21da10770, …)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:613 +0x90b
github.com/dgraph-io/badger/v2.(*levelsController).runCompactDef(0xc0001c2070, 0x2, 0x1b81ee0, 0xc06930d040, 0xc00018e8a0, 0xc00018e900, 0xc21da10768, 0x1, 0x1, 0xc21da10770, …)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:862 +0xe1
github.com/dgraph-io/badger/v2.(*levelsController).doCompact(0xc0001c2070, 0x2, 0x3ff0303383000000, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:931 +0x49d
github.com/dgraph-io/badger/v2.(*levelsController).runWorker(0xc0001c2070, 0xc01e3893e0)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:368 +0x2fd
created by github.com/dgraph-io/badger/v2.(*levelsController).startCompact
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:343 +0x88
“log/alpha1.log” 29871L, 2672378C

And i also check the p directory, there is no table near 4G size, top size file list is as followings

-rw------- 1 work work 70M Jul 4 13:51 001918.vlog
-rw------- 1 work work 70M Jul 5 16:51 002277.vlog
-rw------- 1 work work 70M Jul 4 14:41 001940.vlog
-rw------- 1 work work 70M Jul 6 12:41 002565.vlog
-rw------- 1 work work 70M Jul 4 09:00 001796.vlog
-rw------- 1 work work 70M Jul 4 16:51 001994.vlog
-rw------- 1 work work 70M Jul 4 13:40 001914.vlog
-rw------- 1 work work 70M Jul 4 09:11 001800.vlog
-rw------- 1 work work 70M Jul 6 12:51 002569.vlog
-rw------- 1 work work 70M Jul 4 15:50 001970.vlog
-rw------- 1 work work 70M Jul 4 16:20 001982.vlog
-rw------- 1 work work 70M Jul 5 16:01 002260.vlog
-rw------- 1 work work 70M Jul 4 16:30 001986.vlog
-rw------- 1 work work 70M Jul 4 16:10 001978.vlog
-rw------- 1 work work 70M Jul 5 19:01 002314.vlog
-rw------- 1 work work 70M Jul 5 15:01 002239.vlog
-rw------- 1 work work 70M Jul 5 17:01 002280.vlog
-rw------- 1 work work 70M Jul 5 11:41 002178.vlog
-rw------- 1 work work 70M Jul 5 16:31 002270.vlog
-rw------- 1 work work 70M Jul 5 16:21 002267.vlog
-rw------- 1 work work 70M Jul 5 09:41 002133.vlog
-rw------- 1 work work 70M Jul 5 21:41 002352.vlog
-rw------- 1 work work 70M Jul 4 17:01 001998.vlog
-rw------- 1 work work 70M Jul 4 17:21 002005.vlog
-rw------- 1 work work 70M Jul 3 20:00 001714.vlog
-rw------- 1 work work 70M Jul 5 21:01 002342.vlog
-rw------- 1 work work 70M Jul 3 20:51 001726.vlog
-rw------- 1 work work 71M Jul 5 14:01 002218.vlog
-rw------- 1 work work 71M Jul 4 08:50 001793.vlog
-rw------- 1 work work 71M Jul 4 12:01 001885.vlog
-rw------- 1 work work 71M Jul 5 12:41 002195.vlog
-rw------- 1 work work 71M Jul 4 17:40 002012.vlog
-rw------- 1 work work 71M Jul 3 12:21 001499.vlog
-rw------- 1 work work 71M Jul 4 12:51 001899.vlog
-rw------- 1 work work 71M Jul 3 18:51 001696.vlog
-rw------- 1 work work 71M Jul 5 12:21 002190.vlog
-rw------- 1 work work 71M Jul 3 19:20 001704.vlog
-rw------- 1 work work 71M Jul 4 12:11 001888.vlog
-rw------- 1 work work 71M Jul 2 17:31 001216.vlog
-rw------- 1 work work 71M Jul 4 18:20 002024.vlog
-rw------- 1 work work 72M Jul 4 10:50 001851.vlog
-rw------- 1 work work 146M Jul 6 11:50 036406.sst
-rw------- 1 work work 202M Jul 6 12:01 036593.sst
-rw------- 1 work work 249M Jul 6 10:01 034726.sst
-rw------- 1 work work 327M Jul 5 22:42 032620.sst
-rw------- 1 work work 422M Jul 5 11:21 029259.sst
-rw------- 1 work work 459M Jul 5 14:40 030111.sst
-rw------- 1 work work 501M Jul 6 10:01 034725.sst
-rw------- 1 work work 719M Jul 6 10:56 035595.sst
-rw------- 1 work work 1.1G Jul 6 10:12 034913.sst
-rw------- 1 work work 1.3G Jul 6 12:02 036595.sst

And the zero state is as followings

{“counter”:“27638547”,“groups”:{“1”:{“members”:{“1”:{“id”:“1”,“groupId”:1,“addr”:“172.21.46.35:7080”,“leader”:true,“lastUpdate”:“1593611550”}},“tablets”:{“chl”:{“groupId”:1,“predicate”:“chl”,“space”:“4695775026”},“dgraph.graphql.schema”:{“groupId”:1,“predicate”:“dgraph.graphql.schema”},“dgraph.graphql.xid”:{“groupId”:1,“predicate”:“dgraph.graphql.xid”},“dgraph.type”:{“groupId”:1,“predicate”:“dgraph.type”,“space”:“41421216460”},“extmsg”:{“groupId”:1,“predicate”:“extmsg”},“from”:{“groupId”:1,“predicate”:“from”,“space”:“223127171705”},“net”:{“groupId”:1,“predicate”:“net”,“space”:“4220898548”},“title”:{“groupId”:1,“predicate”:“title”,“space”:“12325089358”},“to”:{“groupId”:1,“predicate”:“to”,“space”:“13864305792”},“withExpect”:{“groupId”:1,“predicate”:“withExpect”,“space”:“10265741211”}},“snapshotTs”:“54649802”,“checksum”:“3759641937265695043”},“2”:{“members”:{“2”:{“id”:“2”,“groupId”:2,“addr”:“172.21.46.43:7080”,“leader”:true,“lastUpdate”:“1593611555”}},“tablets”:{“aid”:{“groupId”:2,“predicate”:“aid”,“space”:“12185695440”},“gid_gps”:{“groupId”:2,“predicate”:“gid_gps”,“space”:“2246383293”},“gid_ip”:{“groupId”:2,“predicate”:“gid_ip”,“space”:“4275951255”},“gid_user”:{“groupId”:2,“predicate”:“gid_user”,“space”:“10186607933”},“src”:{“groupId”:2,“predicate”:“src”,“space”:“9066055157”}},“snapshotTs”:“54848420”,“checksum”:“16815000284943703647”},“3”:{“members”:{“3”:{“id”:“3”,“groupId”:3,“addr”:“172.21.46.44:7080”,“leader”:true,“lastUpdate”:“1593611557”}},“tablets”:{“gid_device”:{“groupId”:3,“predicate”:“gid_device”,“space”:“5483695993”},“gid_did”:{“groupId”:3,“predicate”:“gid_did”,“space”:“304280668”},“gid_idfa”:{“groupId”:3,“predicate”:“gid_idfa”,“space”:“1199109308”},“gid_wifi”:{“groupId”:3,“predicate”:“gid_wifi”,“space”:“1457055681”},“loc”:{“groupId”:3,“predicate”:“loc”,“space”:“16879986572”},“status”:{“groupId”:3,“predicate”:“status”,“space”:“354027403”},“t”:{“groupId”:3,“predicate”:“t”,“space”:“10332890221”}},“snapshotTs”:“55094804”,“checksum”:“3247556968549744124”},“4”:{“members”:{“4”:{“id”:“4”,“groupId”:4,“addr”:“172.21.46.59:7080”,“leader”:true,“lastUpdate”:“1593611559”}},“tablets”:{“gid_brand”:{“groupId”:4,“predicate”:“gid_brand”,“moveTs”:“39”},“gid_job”:{“groupId”:4,“predicate”:“gid_job”,“space”:“3165219418”},“gid_mac”:{“groupId”:4,“predicate”:“gid_mac”,“space”:“1763035672”},“msg”:{“groupId”:4,“predicate”:“msg”,“space”:“176492074”},“pl”:{“groupId”:4,“predicate”:“pl”,“space”:“9285853378”},“withDID”:{“groupId”:4,“predicate”:“withDID”,“space”:“114209649”},“withDeviceId”:{“groupId”:4,“predicate”:“withDeviceId”,“space”:“14719347936”},“withIdfa”:{“groupId”:4,“predicate”:“withIdfa”,“moveTs”:“24”},“withIp”:{“groupId”:4,“predicate”:“withIp”,“space”:“14052854988”},“withMac”:{“groupId”:4,“predicate”:“withMac”,“space”:“3683865987”}},“snapshotTs”:“54637590”,“checksum”:“876440663310665151”},“5”:{“members”:{“5”:{“id”:“5”,“groupId”:5,“addr”:“172.21.46.65:7080”,“leader”:true,“lastUpdate”:“1593611561”}},“tablets”:{“bg”:{“groupId”:5,“predicate”:“bg”,“space”:“10036594964”},“gid_company”:{“groupId”:5,“predicate”:“gid_company”,“moveTs”:“40”},“gid_expect”:{“groupId”:5,“predicate”:“gid_expect”,“space”:“5837675755”},“gid_interview”:{“groupId”:5,“predicate”:“gid_interview”,“space”:“174559486”},“withBrand”:{“groupId”:5,“predicate”:“withBrand”,“moveTs”:“28”},“withCompany”:{“groupId”:5,“predicate”:“withCompany”,“moveTs”:“29”},“withGps”:{“groupId”:5,“predicate”:“withGps”,“space”:“7782805485”},“withInterview”:{“groupId”:5,“predicate”:“withInterview”,“space”:“221288458”},“withJob”:{“groupId”:5,“predicate”:“withJob”,“space”:“10772862262”},“withWifi”:{“groupId”:5,“predicate”:“withWifi”,“space”:“4211920663”}},“snapshotTs”:“54489876”,“checksum”:“11427293922357654662”}},“zeros”:{“1”:{“id”:“1”,“addr”:“172.21.46.35:5080”,“leader”:true}},“maxLeaseId”:“15000000000000020000”,“maxTxnTs”:“55300000”,“maxRaftId”:“5”,“cid”:“dca06408-70ba-4f46-ae60-7df73f943dd9”,“license”:{“maxNodes”:“18446744073709551615”,“expiryTs”:“1596203542”,“enabled”:true}}

And here is my exec param

if [ $# -lt 2 ]; then
echo ‘too less input, at least 2’
exit 1
fi

type=$1
offset=$2

echo “type:$type, offset:$offset”

path=./log
if [ ! -d $path ]; then
mkdir $path
fi

replica=1
zero_1=172.21.46.35
dgpath=/data/app/Dgraph/bin/dgraph
uipath=/data/app/Dgraph/bin/dgraph-ratel
cfpath=/data/app/Dgraph/config/config.json

local_host="hostname --fqdn"
local_ip=/sbin/ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v inet6|awk '{print $2}'|tr -d "addr:"

echo “ip:$local_ip”

case $type in
zero1)
index=expr $offset + 1
port=expr $offset + 5080
nohup $dgpath zero --enable_sentry=false --ludicrous_mode --my=$local_ip:$port -o $offset --idx $index --replicas $replica --config $cfpath --cwd zero$index --rebalance_interval 12000h > ./log/zero$index.log 2>&1 &
;;
zero)
index=expr $offset + 1
port=expr $offset + 5080
nohup $dgpath zero --enable_sentry=false --ludicrous_mode --my=$local_ip:$port -o $offset --idx $index --replicas $replica --config $cfpath --cwd zero$index --peer $zero_1:5080 > ./log/zero$index.log 2>&1 &
;;
alpha)
index=expr $offset + 1
port=expr $offset + 7080
nohup $dgpath alpha --enable_sentry=false --ludicrous_mode --snapshot_after=1000000 --abort_older_than 15m --my=$local_ip:$port -o $offset --zero=$zero_1:5080 --config $cfpath --cwd alpha$index > ./log/alpha$index.log 2>&1 &
;;
ratel)
index=expr $offset + 1
port=expr $offset + 8000
nohup $uipath -addr “http://$local_ip:$port” -port $port > ./log/web$index.log 2>&1 &
;;
*)
echo ‘wrong type input’
;;
esac

jarifibrahim commented :

@JimWen can you share alpha logs as well?

JimWen commented :

@JimWen can you share alpha logs as well?

Sorry to that i have clear all the data to run it not on Ludicrous Mode . I notice nothing unnormal except the crash log in that file.

jarifibrahim commented :

@JimWen Can you save the big SST files and the logs next time it crashes? I’d like to look at the SST files which are over 500 MB. These files should be 65 MB by default. If the size goes anywhere around 4 GB, badger will crash because of uint32 overflow.

JimWen commented :

@jarifibrahim ok, i will save it next time.

jarifibrahim commented :

@JimWen would you be able to share your rdfs? It might be possible that this crash is because of some specific input. I don’t know how plausible is that but I’d like to explore that option as well.
You can send it to Ibrahim[at]dgraph.io

JimWen commented :

@JimWen would you be able to share your rdfs? It might be possible that this crash is because of some specific input. I don’t know how plausible is that but I’d like to explore that option as well.
You can send it to Ibrahim[at]dgraph.io

@jarifibrahim Sorry, i‘m afraid that i can’t share these business data.

lgalatin commented :

hi @JimWen We would like to setup a call with you to see if we can further troubleshoot this problem. Can I contact you on your email to set up a session with @jarifibrahim?

JimWen commented :

@jarifibrahim do we have any tools to help inspect keys in sst file? It also crash when running not on Ludicrous Mode, and is it normal if sst file size is bigger than 65 MB?

jarifibrahim commented :

@JimWen dgraph debug tool can show the keys and its metadata. It’s not normal to have an sst file larger than 65 MB. The SST are bigger than 65MB only if you have too many duplicate keys. To get a ~3 GB of sst file, you’ll need too many duplicate keys (which doesn’t happen under normal usage).

Badger uses uint32 at multiple places and a values bigger than 4 GB will cause the uin32 to overflow and cause a crash.