Commands:
Update schema through dgraph web UI (copy and paste schema.txt into schema bulk edit)
Then run the following from shell in dir where downloaded trait.txt.gz and graph.gz exist.
Dgraph should process the graph, but instead spins on it and never completes processing.
Other
I know the graph itself is sound as I have processed smaller data sets with the exact layout. It seems once a graph reaches a certain size dgraph can no longer process it. Maybe I have exceed some sort of internal limit. I have also attached log of dgraph server and it looks like a raft consensus is timing out.
This is on a single box using 1 dgraph/zero/ratel server (e.g. the getting started docker-compose). This uses the standard localhost:8080/mutate endpoint (with commitNow=true). The graph.gz is a dump of what I am posting to the endpoint.
It would do occasional spikes and then eventually sit at 6~10GB. I would have to run it again to make sure and I won’t be at my dev box for a while. But the graphj.gz is only ~168 mb unzipped, so I am kind of surprised it was taking so much ram.
@MichelDiz I updated description with proper repo steps (sorry should have done that to begin with). Got onto a laptop and performed repo, I killed dgraph once it started to exceed 10GB in RAM.
As you have a lot of RAM available, try to run your cluster with these flags (Just for this situation):
--badger.tables=ram
--badger.vlog=mmap
dgraph alpha -h
--badger.tables string [ram, mmap, disk]
Specifies how Badger LSM tree is stored. Option sequence consume most to least RAM while providing best to worst read performance respectively. (default "mmap")
--badger.vlog string [mmap, disk]
Specifies how Badger Value log is stored. mmap consumes more RAM, but provides better performance. (default "mmap")
What I have indicated is a way of assessing whether it could be bottleneck by I/O.
I would then recommend that you have an actual cluster, where each group has its own resources. A single instance with slow HDD tends to have poor performance. But if you have multiple instances with own resources, even with slow HDDs and well-defined groups, there is no performance loss.
So I am running on a corsair MP600, that gets about 4250MB/s in write throughput, so I doubt its IO bound. You wont even see those speeds in a small cluster setup. The only thing that i can think that might prevent this is I am using a docker volume and that could be IO limited.
So it seems these flags did not make any difference. I’ve also noticed that since the request failed to fully process, dgraph tries to recover it when I restart the server. The recovered “job?” causes the server to start spinning again. The server also doesn’t seem to be responsive to simple query commands.
Also, I’ve been monitoring the server trying to process the job for over 15 min and the server is sitting at 7.5GBof ram use (which has stayed consistent when reached, slowly rising though) and fluctuates from using 2 vcores to all 24 vcores. I don’t see much io activity (via iotop). I suspect maybe golangs GC is kicking in for a massive cleanup when i hit all 24 cores?
Looks like job finally finished (but didn’t commit graph changes)
server_1 | W1101 14:16:42.413590 1 draft.go:916] Inflight proposal size: 189262822. There would be some throttling.
server_1 | W1101 14:16:42.413866 1 draft.go:958] Raft.Ready took too long to process: Timer Total: 1.064s. Breakdown: [{proposals 1.064s} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
server_1 | I1101 14:58:11.000551 1 draft.go:637] Blocked pushing to applyCh for 41m28.587s
server_1 | W1101 14:58:11.000616 1 draft.go:958] Raft.Ready took too long to process: Timer Total: 41m28.587s. Breakdown: [{proposals 41m28.587s} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
server_1 | I1101 14:58:13.865419 1 log.go:34] 1 is starting a new election at term 5
server_1 | I1101 14:58:13.865450 1 log.go:34] 1 became pre-candidate at term 5
server_1 | I1101 14:58:13.865457 1 log.go:34] 1 received MsgPreVoteResp from 1 at term 5
server_1 | I1101 14:58:13.865498 1 log.go:34] 1 became candidate at term 6
server_1 | I1101 14:58:13.865504 1 log.go:34] 1 received MsgVoteResp from 1 at term 6
server_1 | I1101 14:58:13.865546 1 log.go:34] 1 became leader at term 6
server_1 | I1101 14:58:13.865554 1 log.go:34] raft.node: 1 elected leader 1 at term 6
server_1 | I1101 14:58:14.765325 1 groups.go:808] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
server_1 | I1101 14:58:14.765349 1 groups.go:817] Got Zero leader: zero:5080
q
Well, I’m running your dataset for the last 26 minutes. Only the transaction to be aborted in the logs (Maybe some typo in the dataset). There’s no issue as I could tell.
(I’m not using Docker. Just a local instances of Dgraph in Darwin)
It uses 10.13GB of RAM. Which is normal. And have some spikes in CPU. Which is also normal due to the background activities.
Use Upsert Block with 185MB of data (decompressed) was never tested or the intention to do in that kind of query. Upsert is very powerful, and its use must be done with care.
I would be surprised if there was a typo in data set, since it programattically generated and it works on smaller data sets (but i guess there is always the chance). Are you saying I should use the mutate endpoint instead and should expect better perf? I really just use upsert so that i can get the trait node ids and insert new data with them in one call.