Error caused by switching Leader in live loading

takehironet · November 27, 2018, 5:47am

Hi all,
I would like to know if following Error really causes any data loss:
“draft.go:467] Lastcommit 10591 > current 10575. This would cause some commits to be lost.”

I found above error in Dgraph Alpha’s log, while live-loading “A bigger dataset” running 3-node Dgraph cluster built on GCP.
https://tour.dgraph.io/moredata/1/

I would be glad if you could help me.

u007 · November 27, 2018, 7:22am

which version are you using?

takehironet · November 27, 2018, 8:13am

The version is 1.0.10.

Followings are commands to run each process.

Zero @ node1
dgraph zero --idx=1 --replicas=3 --my=10.146.0.2:5080 --bindall
Alpha @ node1
dgraph alpha --idx=1001 --my=10.146.0.2:7080 --lru_mb=3072 --badger.vlog=disk
Others are similar to node1.

To reproduce, I needed to do following operations several times:

curl -X POST http://127.0.0.1:8080/alter -d'{"drop_all": true}'
curl -X POST http://127.0.0.1:8080/alter -d'
director.film: uid @reverse .
genre: uid @reverse .
initial_release_date: dateTime @index(year) .
name: string @index(term) @lang .
'
dgraph live -r dgraph/1million.rdf.gz --zero 10.146.0.2:5080 -c 1 -b 2000

Thank you

MichelDiz · November 27, 2018, 2:15pm

What you mean with “switching Leader in live loading”?

How many Alphas instances do you have? why “-idx=1001”? do you have more then 1001 Alphas?

Please don’t use “-b 2000” as you’re using “–badger.vlog=disk” I think you may have hdd storages, so do you have less performance. This may cause issues increasing the value of batch. Let it default or try to use SSDs or NVMe.

Can you share your specs?

takehironet · November 28, 2018, 4:49am

Thank you for the reply.

What you mean with “switching Leader in live loading”?

I found the error after new leader had elected while live loading.
I think this is triggered by high load condition.
Here is the log around the error but I modified little bit, combined and add node name each line.

gist.github.com

https://gist.github.com/takehironet/df9916bffa31d9357db7c1897d6d5a96

error.log

04:27:13.225817  node1    1210 index.go:569] Rebuild: Iteration done. Now commiting at ts=31003
04:27:13.225870  node1    1210 index.go:604] Rebuild: Flushing all writes.
04:27:13.225924  node1    1210 mutation.go:148] Done schema update predicate:"name" value_type:STRING directive:INDEX tokenizer:"term" lang:true
04:27:13.227651  node1    1210 server.go:325] ALTER op: schema:"\ndirector.film: uid @reverse .\ngenre: uid @reverse .\ninitial_release_date: dateTime @index(year) .\nname: string @index(term) @lang .\n"  done
04:27:17.315009  node3    1232 mutation.go:148] Done schema update predicate:"name" value_type:STRING directive:INDEX tokenizer:"term" lang:true
04:27:37.203060  node2    1212 index.go:569] Rebuild: Iteration done. Now commiting at ts=31003
04:27:37.203092  node2    1212 index.go:604] Rebuild: Flushing all writes.
04:27:37.203148  node2    1212 mutation.go:148] Done schema update predicate:"name" value_type:STRING directive:INDEX tokenizer:"term" lang:true
04:29:25.313464  node1    1210 node.go:84] 3e9 is starting a new election at term 13
04:29:25.341404  node1    1210 node.go:84] 3e9 became pre-candidate at term 13

This file has been truncated. show original

How many Alphas instances do you have? why “-idx=1001”? do you have more then 1001 Alphas?

I have 3 Alphas.

node1: Zero idx:1, Alpha idx:1001
node2: Zero idx:2, Alpha idx:1002
node3: Zero idx:3, Alpha idx:1003

Please don’t use “-b 2000” as you’re using “–badger.vlog=disk” I think you may have hdd storages, so do you have less performance. This may cause issues increasing the value of batch. Let it default or try to use SSDs or NVMe.

The reason why I used -b 2000 is to know how Dgraph behave in high load situation.
However I will use --badger.vlog=mmap and don’t use -b 2000 in normal operation.
I believed --badger.vlog=disk gives me more safety because vlog is WAL and it must be flushed to storages in RDBMS like PostgreSQL.

Can you share your specs?
on GCP:
n1-standard-2 (vCPU x 2, RAM 7.5 GB), Standard disk (it should be HDD, not SSD)

MichelDiz · November 28, 2018, 4:02pm

In my opinion (This is a personal comment) - if you are going to use HDD, you would necessarily need to increase the amount of memory and consequently the lru_mb cache. HDDs are very slow, the fastest of them with 15k RPM has 400 IOPS - And the most basic SSD has 5K IOPS and an NVMe has around 120K IOPS, up to 10 million IOPS read. In theory a DDR4 RAM can give you 1.7 million IOPS write. SSD, NVMe and RAM have in common low latency and fast access.

Realize? more memory resolves physical storage bottleneck problems.

When we are talking about Dgraph, this is a DB designed to use the maximum of SSDs or NVMe. If you use HDD you have to compensate for this. And compensate a lot because in this hypothesis you are tripling the work of the Dgraph. With less memory and greater work you will have problems as with any DB.

Even PG gets better with SSD see the chart.

MichelDiz · November 28, 2018, 4:02pm

On testing the Dgraph load. I’d think you’d better create a test with clients. Like this guy

This is the best way to test Dgraph. Live Load needs some adjustments to keep up with some of the changes in Dgraph in recent times, so I do not recommend using it for that purpose or increasing its default values.

Topic		Replies	Views
Dgraph crashed during live loading using dgraph live and unable to start the db Dgraph	12	788	February 24, 2019
Frequent Zero leadership change Issues dgraph , kind:bug	2	444	May 21, 2021
Getting error while doing live loading Dgraph kind:question , dgraph	17	705	August 25, 2021
Dgraph runs into a error loop and freezes the host Users	20	2219	February 21, 2018
Memory use and crashes when live loading Dgraph	11	779	November 3, 2020

Error caused by switching Leader in live loading

Related topics