The dgraph v20.11 running in production suddenly crashed, the error message is in the attachment, please help urgently

llooper-dev · February 22, 2021, 2:14am

The dgraph running in production suddenly crashed!

unexpected fault address 0x7f1fdf99a000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f1fdf99a000 pc=0xa9e21e]

dgraph.log (44.4 KB)

amaster507 · February 22, 2021, 2:56am

It would probably help to detail your environment.

llooper-dev · February 22, 2021, 3:09am

The reason is that we are running out of disk space, but a bit of a puzzle is that we have three nodes, one of which has a p directory of up to 1.1T, and the other two nodes have very little data consumption and are completely unevenly distributed.Doesn’t DGraph support load balancing internally?What can I do to get the data evenly distributed across the nodes?

[root@yzsjhl19-91 data2]# du -sh  *|grep 'p$'
333G	p

[root@yzsjhl19-92 ~]# du -sh  *|grep 'p$'
12G	p

[root@yzsjhl30-26 data2]# du -sh  *|grep 'p$'
1.1T	p

amaster507 · February 22, 2021, 3:19am

Is your data heavily based on a single predicate. Data should be spread across by predicates automatically.

llooper-dev · February 22, 2021, 3:26am

This is the production user data, should not be all the same.My QL is as follows, basically which predicate will I choose as the equilibrium calculation?

<forwardCount>: int .
<forwardFrom>: uid @reverse .
<isDeleted>: int .
<isRoot>: int .
<rootUgcId>: int @index(int) .
<test>: int @index(int) .
<type>: string @index(term) .
<ugcId>: int @index(int) .
<ugcUid>: int @index(int) .
<updateTime>: int .

type <RR_UGC> {
	forwardFrom
	forwardCount
	isDeleted
	isRoot
	rootUgcId
	ugcId
	ugcUid
	createTime
	updateTime
}

amaster507 · February 22, 2021, 3:47am

Dgraph should auto balance the predicates, I was just thinking maybe that most of your data was all on a single predicate, but that doesn’t appear to be the case. I don’t have any advice on what to try from here. I don’t do server config anymore for Dgrapg to know if there are any other commands you could run to help debug that further.

See jf this helps: Unbalanced disk usage - #3 by dmai

I don’t know who to tag to help more here. Maybe @dmai ?

llooper-dev · February 22, 2021, 6:00am

UncompressedBytes is relatively large. Which parameter can be used to improve it?

{"groupId":1,"predicate":"dgraph.type","force":false,"onDiskBytes":"364537219","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"1829024082"},"dgraph.user.group":{"groupId":1,"predicate":"dgraph.user.group","force":false,"onDiskBytes":"0","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"0"},"dgraph.xid":{"groupId":1,"predicate":"dgraph.xid","force":false,"onDiskBytes":"0","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"0"},"forwardCount":{"groupId":1,"predicate":"forwardCount","force":false,"onDiskBytes":"1632884892","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"5122086792"},"forwardFrom":{"groupId":1,"predicate":"forwardFrom","force":false,"onDiskBytes":"1767072417","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"3287967454"},"isDeleted":{"groupId":1,"predicate":"isDeleted","force":false,"onDiskBytes":"1130632662","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"4013628802"},"isRoot":{"groupId":1,"predicate":"isRoot","force":false,"onDiskBytes":"1144710246","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"3922396123"},"rootUgcId":{"groupId":1,"predicate":"rootUgcId","force":false,"onDiskBytes":"1789223434","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"4169587811"},"test":{"groupId":1,"predicate":"test","force":false,"onDiskBytes":"0","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"0"},"type":{"groupId":1,"predicate":"type","force":false,"onDiskBytes":"3286410623","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"3647302252"},"ugcId":{"groupId":1,"predicate":"ugcId","force":false,"onDiskBytes":"3194198127","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"6839584343"},"ugcUid":{"groupId":1,"predicate":"ugcUid","force":false,"onDiskBytes":"2197521869","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"5312250272"},"updateTime":{"groupId":1,"predicate":"updateTime","force":false,"onDiskBytes":"1040495592","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"2550994861"}},"snapshotTs":"329122847","checksum":"4895759682130352111","checkpointTs":"0"}},"zeros":{"1":{"id":"1","groupId":0,"addr":"10.4.19.91:5180","leader":true,"amDead":false,"lastUpdate":"0","clusterInfoOnly":false,"forceGroupId":false}},"maxLeaseId":"110552334","maxTxnTs":"329140000","maxRaftId":"3","removed":[],"cid":"ad22c31d-11e2-4a2f-a8f1-955eb5230deb","license":{"user":"","maxNodes":"18446744073709551615","expiryTs":"1611302192","enabled":false}}

anand · February 22, 2021, 6:14am

@llooper-dev,
What version of Dgraph are you using?
Are you creating+deleting some predicates continuously?

llooper-dev · February 22, 2021, 6:21am

Dgraph version : v20.11.0

In the process of using it, the user will create, and my question now is, why can’t it be uniformly stored within the same group?Almost all stored in one node, in the p directory, I observed many SST,vlog files and take up a lot of space, please see the attachment!

sst.log (76.9 KB)

ibrahim · February 23, 2021, 8:23am

hey @llooper-dev, the sst.log file you’ve shared contains output of the node with 350 GB of data. Can you please share the sst.log output for the node with 1.1 TB of data?

vnium · February 24, 2021, 9:14am

May be related to Vlog files use lots of disk space: Add option to set LSMOnly option when opening p dir

llooper-dev · March 2, 2021, 6:06am

Hi Are you using set LSMOnly option to solve the problem of Vlog taking up a lot of space? thk!

vnium · March 4, 2021, 8:20pm

There is sadly no such option. I am still waiting for a fix to this issue.

mrjn · March 4, 2021, 8:27pm

There’s a fix in place already. @ibrahim can point you to it. It involves increasing value log threshold.

llooper-dev · March 5, 2021, 2:02am

@mrjn @ibrahim There are too many Vlog files in production, which take up too much space. Can I clean these generated Vlog files from time to time? Will it affect the system？

ibrahim · March 5, 2021, 6:39am

@llooper-dev Please don’t delete the vlog files. They store useful data.

The vlog files contain data and vlog GC is supposed to clean it up. In this case, the vlog GC is unable to clean it up.

The following commit improves the vlog usage.
https://github.com/dgraph-io/badger/commit/6c35ad6c28e00ecd933960dcf54c7dc6e8a0fea3

@llooper-dev if you can share the exact version of dgraph that you’re running, I can create a new binary with the necessary patch for you. The fix hasn’t been released yet and it is in dgraph master.

llooper-dev · March 11, 2021, 2:32am

@ibrahim First of all, I am very grateful to you. The tags I am using now are v20.11.2, running in CentOS 7.6. I am looking forward to the new binary.

llooper-dev · March 31, 2021, 9:00am

@ibrahim @anand @mrjn Is there a new release to fix the problem of vlog taking up disk space?I am still waiting for your version to be fixed. There are some problems with the production. I am about to give up and collapse.

mrjn · March 31, 2021, 9:07am

@hardik is the value log threshold change going out in the next release of t’challa?

hardik · March 31, 2021, 9:09am

@mrjn Yes, it will be present in 20.11.3 which would be available tomorrow. cc: @llooper-dev

Topic		Replies	Views
Dgraph crashes after predicate movement (another oom crash?) Dgraph dgraph , status:accepted , ticket:created , cluster	8	1217	August 26, 2020
Unbalanced disk usage Users	3	929	March 7, 2020
Dgraph crashed during live loading using dgraph live and unable to start the db Dgraph	12	788	February 24, 2019
Auto deleting predicates Dgraph kind:bug	5	528	December 3, 2020
Panic: Allocator can not allocate more than 32 buffers Dgraph status:accepted , kind:bug , ticket:created	14	927	December 3, 2020

The dgraph v20.11 running in production suddenly crashed, the error message is in the attachment, please help urgently

Related topics