The dgraph v20.11 running in production suddenly crashed, the error message is in the attachment, please help urgently

The dgraph running in production suddenly crashed!

unexpected fault address 0x7f1fdf99a000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f1fdf99a000 pc=0xa9e21e]

dgraph.log (44.4 KB)

It would probably help to detail your environment.

The reason is that we are running out of disk space, but a bit of a puzzle is that we have three nodes, one of which has a p directory of up to 1.1T, and the other two nodes have very little data consumption and are completely unevenly distributed.Doesn’t DGraph support load balancing internally?What can I do to get the data evenly distributed across the nodes?

[root@yzsjhl19-91 data2]# du -sh  *|grep 'p$'
333G	p

[root@yzsjhl19-92 ~]# du -sh  *|grep 'p$'
12G	p

[root@yzsjhl30-26 data2]# du -sh  *|grep 'p$'
1.1T	p

Is your data heavily based on a single predicate. Data should be spread across by predicates automatically.

This is the production user data, should not be all the same.My QL is as follows, basically which predicate will I choose as the equilibrium calculation?

<forwardCount>: int .
<forwardFrom>: uid @reverse .
<isDeleted>: int .
<isRoot>: int .
<rootUgcId>: int @index(int) .
<test>: int @index(int) .
<type>: string @index(term) .
<ugcId>: int @index(int) .
<ugcUid>: int @index(int) .
<updateTime>: int .

type <RR_UGC> {
	forwardFrom
	forwardCount
	isDeleted
	isRoot
	rootUgcId
	ugcId
	ugcUid
	createTime
	updateTime
}

Dgraph should auto balance the predicates, I was just thinking maybe that most of your data was all on a single predicate, but that doesn’t appear to be the case. I don’t have any advice on what to try from here. I don’t do server config anymore for Dgrapg to know if there are any other commands you could run to help debug that further.

See jf this helps: Unbalanced disk usage - #3 by dmai

I don’t know who to tag to help more here. Maybe @dmai ?

UncompressedBytes is relatively large. Which parameter can be used to improve it?

{"groupId":1,"predicate":"dgraph.type","force":false,"onDiskBytes":"364537219","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"1829024082"},"dgraph.user.group":{"groupId":1,"predicate":"dgraph.user.group","force":false,"onDiskBytes":"0","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"0"},"dgraph.xid":{"groupId":1,"predicate":"dgraph.xid","force":false,"onDiskBytes":"0","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"0"},"forwardCount":{"groupId":1,"predicate":"forwardCount","force":false,"onDiskBytes":"1632884892","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"5122086792"},"forwardFrom":{"groupId":1,"predicate":"forwardFrom","force":false,"onDiskBytes":"1767072417","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"3287967454"},"isDeleted":{"groupId":1,"predicate":"isDeleted","force":false,"onDiskBytes":"1130632662","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"4013628802"},"isRoot":{"groupId":1,"predicate":"isRoot","force":false,"onDiskBytes":"1144710246","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"3922396123"},"rootUgcId":{"groupId":1,"predicate":"rootUgcId","force":false,"onDiskBytes":"1789223434","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"4169587811"},"test":{"groupId":1,"predicate":"test","force":false,"onDiskBytes":"0","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"0"},"type":{"groupId":1,"predicate":"type","force":false,"onDiskBytes":"3286410623","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"3647302252"},"ugcId":{"groupId":1,"predicate":"ugcId","force":false,"onDiskBytes":"3194198127","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"6839584343"},"ugcUid":{"groupId":1,"predicate":"ugcUid","force":false,"onDiskBytes":"2197521869","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"5312250272"},"updateTime":{"groupId":1,"predicate":"updateTime","force":false,"onDiskBytes":"1040495592","remove":false,"readOnly":false,"moveTs":"0","uncompressedBytes":"2550994861"}},"snapshotTs":"329122847","checksum":"4895759682130352111","checkpointTs":"0"}},"zeros":{"1":{"id":"1","groupId":0,"addr":"10.4.19.91:5180","leader":true,"amDead":false,"lastUpdate":"0","clusterInfoOnly":false,"forceGroupId":false}},"maxLeaseId":"110552334","maxTxnTs":"329140000","maxRaftId":"3","removed":[],"cid":"ad22c31d-11e2-4a2f-a8f1-955eb5230deb","license":{"user":"","maxNodes":"18446744073709551615","expiryTs":"1611302192","enabled":false}}

@llooper-dev,
What version of Dgraph are you using?
Are you creating+deleting some predicates continuously?

Dgraph version : v20.11.0

In the process of using it, the user will create, and my question now is, why can’t it be uniformly stored within the same group?Almost all stored in one node, in the p directory, I observed many SST,vlog files and take up a lot of space, please see the attachment!

sst.log (76.9 KB)

hey @llooper-dev, the sst.log file you’ve shared contains output of the node with 350 GB of data. Can you please share the sst.log output for the node with 1.1 TB of data?

May be related to Vlog files use lots of disk space: Add option to set LSMOnly option when opening p dir

Hi Are you using set LSMOnly option to solve the problem of Vlog taking up a lot of space? thk!

There is sadly no such option. I am still waiting for a fix to this issue.

There’s a fix in place already. @ibrahim can point you to it. It involves increasing value log threshold.

@mrjn @ibrahim There are too many Vlog files in production, which take up too much space. Can I clean these generated Vlog files from time to time? Will it affect the system?

@llooper-dev Please don’t delete the vlog files. They store useful data.

The vlog files contain data and vlog GC is supposed to clean it up. In this case, the vlog GC is unable to clean it up.

The following commit improves the vlog usage.
https://github.com/dgraph-io/badger/commit/6c35ad6c28e00ecd933960dcf54c7dc6e8a0fea3

@llooper-dev if you can share the exact version of dgraph that you’re running, I can create a new binary with the necessary patch for you. The fix hasn’t been released yet and it is in dgraph master.

@ibrahim First of all, I am very grateful to you. The tags I am using now are v20.11.2, running in CentOS 7.6. I am looking forward to the new binary.

@ibrahim @anand @mrjn Is there a new release to fix the problem of vlog taking up disk space?I am still waiting for your version to be fixed. There are some problems with the production. I am about to give up and collapse.

@hardik is the value log threshold change going out in the next release of t’challa?

@mrjn Yes, it will be present in 20.11.3 which would be available tomorrow. cc: @llooper-dev