DGraph Times Out Processing Graph

diggy · October 31, 2019, 2:08pm

Moved from GitHub dgraph/4231

Posted by jm4games:

Dgraph Version?

Dgraph version : v1.1.0
Dgraph SHA-256 : 7d4294a80f74692695467e2cf17f74648c18087ed7057d798f40e1d3a31d2095
Commit SHA-1 : ef7cdb28

Hardware Spec ?

OS: Arch Linux, (running dgraph in docker env)
CPU: AMD 3900x
RAM: 64GB

Steps to reproduce the issue (command/config used to run Dgraph).

schema.txt - graph layout
trait.txt - initial seed data
graph.gz - (full graph)

Commands:
Update schema through dgraph web UI (copy and paste schema.txt into schema bulk edit)
Then run the following from shell in dir where downloaded trait.txt.gz and graph.gz exist.

$ curl -X POST -H 'Content-Encoding: gzip' -H "Content-Type: application/rdf" localhost:8080/mutate\?commitNow=true --data-binary @trait.txt.gz
$ curl -X POST -H 'Content-Encoding: gzip' -H "Content-Type: application/rdf" localhost:8080/mutate\?commitNow=true --data-binary @graph.gz

Expected behaviour and actual result.

Dgraph should process the graph, but instead spins on it and never completes processing.

Other

I know the graph itself is sound as I have processed smaller data sets with the exact layout. It seems once a graph reaches a certain size dgraph can no longer process it. Maybe I have exceed some sort of internal limit. I have also attached log of dgraph server and it looks like a raft consensus is timing out.

Files

graph.gz
error.log
schema.txt
trait.txt.gz

diggy · October 31, 2019, 4:15pm

MichelDiz commented :

What exactly are you doing? Is it a bulk load? live load?

The file graph.gz isn’t correct for bulk. Bulk or Live only accepts Nquads or JSON. No upsert or even the format:

mutation {
set {

    }
}

If you wanna change something, you should do it after the load via upsert. Not during the load.

What is the config of your cluster (how many groups, instances) do you have?

diggy · October 31, 2019, 4:42pm

jm4games commented :

This is on a single box using 1 dgraph/zero/ratel server (e.g. the getting started docker-compose). This uses the standard localhost:8080/mutate endpoint (with commitNow=true). The graph.gz is a dump of what I am posting to the endpoint.

diggy · October 31, 2019, 5:00pm

MichelDiz commented :

How much RAM does it consume? doing this.

diggy · October 31, 2019, 5:19pm

jm4games commented :

It would do occasional spikes and then eventually sit at 6~10GB. I would have to run it again to make sure and I won’t be at my dev box for a while. But the graphj.gz is only ~168 mb unzipped, so I am kind of surprised it was taking so much ram.

diggy · October 31, 2019, 5:55pm

jm4games commented :

@MichelDiz I updated description with proper repo steps (sorry should have done that to begin with). Got onto a laptop and performed repo, I killed dgraph once it started to exceed 10GB in RAM.

Also fixed traits.txt (now traits.txt.gz)

diggy · October 31, 2019, 6:12pm

MichelDiz commented :

As you have a lot of RAM available, try to run your cluster with these flags (Just for this situation):

--badger.tables=ram
--badger.vlog=mmap

dgraph alpha -h

 --badger.tables string  [ram, mmap, disk] 
Specifies how Badger LSM tree is stored. Option sequence consume most to least RAM while providing best to worst read performance respectively. (default "mmap")
 --badger.vlog string  [mmap, disk] 
Specifies how Badger Value log is stored. mmap consumes more RAM, but provides better performance. (default "mmap")

diggy · October 31, 2019, 6:17pm

jm4games commented :

Ok I will have to give it a shot tonight, my current laptop doesn’t have near as much ram

diggy · October 31, 2019, 6:25pm

MichelDiz commented :

What I have indicated is a way of assessing whether it could be bottleneck by I/O.

I would then recommend that you have an actual cluster, where each group has its own resources. A single instance with slow HDD tends to have poor performance. But if you have multiple instances with own resources, even with slow HDDs and well-defined groups, there is no performance loss.

Also, Dgraph/BadgerDB works better with SSD/NVMe.

diggy · October 31, 2019, 6:48pm

jm4games commented :

So I am running on a corsair MP600, that gets about 4250MB/s in write throughput, so I doubt its IO bound. You wont even see those speeds in a small cluster setup. The only thing that i can think that might prevent this is I am using a docker volume and that could be IO limited.

diggy · November 1, 2019, 2:24pm

jm4games commented :

So it seems these flags did not make any difference. I’ve also noticed that since the request failed to fully process, dgraph tries to recover it when I restart the server. The recovered “job?” causes the server to start spinning again. The server also doesn’t seem to be responsive to simple query commands.

diggy · November 1, 2019, 2:30pm

jm4games commented :

Also, I’ve been monitoring the server trying to process the job for over 15 min and the server is sitting at 7.5GBof ram use (which has stayed consistent when reached, slowly rising though) and fluctuates from using 2 vcores to all 24 vcores. I don’t see much io activity (via iotop). I suspect maybe golangs GC is kicking in for a massive cleanup when i hit all 24 cores?

diggy · November 1, 2019, 3:01pm

MichelDiz commented :

Quick question, how do you run it? 185MB via Ratel?

diggy · November 1, 2019, 3:03pm

jm4games commented :

how to I run the request? I run all servers via docker-compose and send the request via localhost:8080/mutate which is alpha

diggy · November 1, 2019, 3:04pm

jm4games commented :

Looks like job finally finished (but didn’t commit graph changes)

server_1  | W1101 14:16:42.413590       1 draft.go:916] Inflight proposal size: 189262822. There would be some throttling.
server_1  | W1101 14:16:42.413866       1 draft.go:958] Raft.Ready took too long to process: Timer Total: 1.064s. Breakdown: [{proposals 1.064s} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
server_1  | I1101 14:58:11.000551       1 draft.go:637] Blocked pushing to applyCh for 41m28.587s
server_1  | W1101 14:58:11.000616       1 draft.go:958] Raft.Ready took too long to process: Timer Total: 41m28.587s. Breakdown: [{proposals 41m28.587s} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
server_1  | I1101 14:58:13.865419       1 log.go:34] 1 is starting a new election at term 5
server_1  | I1101 14:58:13.865450       1 log.go:34] 1 became pre-candidate at term 5
server_1  | I1101 14:58:13.865457       1 log.go:34] 1 received MsgPreVoteResp from 1 at term 5
server_1  | I1101 14:58:13.865498       1 log.go:34] 1 became candidate at term 6
server_1  | I1101 14:58:13.865504       1 log.go:34] 1 received MsgVoteResp from 1 at term 6
server_1  | I1101 14:58:13.865546       1 log.go:34] 1 became leader at term 6
server_1  | I1101 14:58:13.865554       1 log.go:34] raft.node: 1 elected leader 1 at term 6
server_1  | I1101 14:58:14.765325       1 groups.go:808] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
server_1  | I1101 14:58:14.765349       1 groups.go:817] Got Zero leader: zero:5080
q

diggy · November 1, 2019, 3:06pm

MichelDiz commented :

ah, just saw your update in the first comment

curl -X POST -H 'Content-Encoding: gzip' -H "Content-Type: application/rdf" localhost:8080/mutate\?commitNow=true --data-binary @graph.gz

diggy · November 1, 2019, 3:11pm

jm4games commented :

fyi, server is sitting at 8.8 GB of ram use after job completion. That seems really high… given how small the data set was.

diggy · November 1, 2019, 3:28pm

MichelDiz commented :

Well, I’m running your dataset for the last 26 minutes. Only the transaction to be aborted in the logs (Maybe some typo in the dataset). There’s no issue as I could tell.

(I’m not using Docker. Just a local instances of Dgraph in Darwin)

It uses 10.13GB of RAM. Which is normal. And have some spikes in CPU. Which is also normal due to the background activities.

Use Upsert Block with 185MB of data (decompressed) was never tested or the intention to do in that kind of query. Upsert is very powerful, and its use must be done with care.

I believe these issues below cover what you need:

github.com/dgraph-io/dgraph

Show query progress, specifically for long running queries

opened 05:47PM - 23 Aug 19 UTC

MichelDiz

kind/enhancement exp/expert area/querylang dgraph

### What you wanted to do A massive upsert (milions of nodes). ### Why th…at wasn't great, with examples All massive upserts take some time to execute. The massive upsert works fine, but there's no visual reference during the process if it is going well or not. Sometimes Ratel could fail visually or the Browser (for some reason) and We're not sure if the Big Upsert was executed or still executing. We could have two types of reference for this. Could be API responses (if possible) sending percentage values and a log in the Alpha's log terminal. The log would start to appear only if the transaction takes more time than usual. ### Observation This probably would be done after we have a Query Planner.

https://github.com/dgraph-io/dgraph/issues/4048

diggy · November 1, 2019, 3:35pm

jm4games commented :

I would be surprised if there was a typo in data set, since it programattically generated and it works on smaller data sets (but i guess there is always the chance). Are you saying I should use the mutate endpoint instead and should expect better perf? I really just use upsert so that i can get the trait node ids and insert new data with them in one call.

diggy · November 1, 2019, 3:48pm

MichelDiz commented :

I would personally recommend that you do this in small chunks with longer intervals. And add a load balancing logic to the logic of your code.

For example, instead of firing all chunks for just one instance, you can fire like this:

localhost:8080/mutate\?commitNow=true --data-binary @graphchunk1.gz
localhost:8081/mutate\?commitNow=true --data-binary @graphchunk2.gz
localhost:8082/mutate\?commitNow=true --data-binary @graphchunk3.gz
localhost:8083/mutate\?commitNow=true --data-binary @graphchunk4.gz

Sending chunks for each available instance. That way you split the load well in your cluster.

Topic		Replies	Views
Can not query dataset with 222M Edge in a bare metal server with 65GB Ram(Memory overflow) Dgraph	12	998	September 10, 2019
Insert become slow until can't insert Dgraph Clients dgraph4j , investigate	51	1301	July 11, 2020
[WIP] Performance bottleneck on 1 TB data Dev	24	2664	August 14, 2020
Dgraph alpha node running out of memory Dgraph	5	861	September 28, 2020
Dgraph can't idle without being oomkilled after large data ingestion Dgraph	63	3905	September 14, 2020

DGraph Times Out Processing Graph

Dgraph Version?

Hardware Spec ?

Steps to reproduce the issue (command/config used to run Dgraph).

Expected behaviour and actual result.

Other

Files

Related topics