Bulk loader becomes slow when memory gets full

I went through the post https://github.com/dgraph-io/dgraph/issues/1323 and I found my situation is very similar with it except I already tried dgraph-bulk-loader.

I have a personal server with 32 core CPU and 64G memory and 1.5 T SSD and the data is around 3 billion edges.
At first edge_speed went very high and up to 2m edges /sec, however, when memory reached 90% of the whole (then it kept at this level and I believe it was the loader’s nice strategy to avoid out of memory), the edge_speed went down gradually.

I didn’t finish the import yet, so I could only share the report around 4 hours as below and I believe the speed will go below 100k when it is done:

MAP 04h15m28s rdf_count:570.9M rdf_speed:37.24k/sec edge_count:1.671G edge_speed:109.0k/sec

While the memory and SSD are busy , the CPU is idle

%Cpu(s): 4.4 us, 2.6 sy, 0.0 ni, 37.6 id, 55.3 wa, 0.0 hi, 0.1 si, 0.0 st

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18464 hj 20 0 0.099t 0.058t 37216 S 347.4 94.1 1447:13 dgraph-bulk-loa


From the post, bulk showed its amazing import power, however, it is a little sorry that it doesn’t take the memory scalability into consideration. For many users like me and the one in issues/1323, we choose personal servers to run dgraph, and the memory is not to so easy to scale as cloud servers. Though I could maximize my memory to 128G but I am not sure if it will make much difference since it is still easy to reach full .

Just wonder if bulk loader could optimize swapping strategy or make more use of CPU when memory gets full

Add some information, I followed the advice of issue 1323 that disabled expand_edges and removed the facets

What are the flags that you ran bulk loader with?

dgraph-bulk-loader -r=~/rdf -expand_edges=false -s=~/my.schema -x

schema:

p : uid @reverse @count .
tx : uid @reverse .
i : uid @reverse @count .
ts : dateTime @index(hour) .
a : uid @reverse @count .
v : int @index(int) .
o : uid @reverse @count .

Hi @forchain, with some tweaking of parameters I’m sure we’ll be able to help you get the bulk loader running at a decent speed.

In the MAP phase, CPU should be utilised at close to 100% all of the time. We need to fix that first (by tuning parameters).

The bulk loader doesn’t do anything smart to avoid out of memory. The memory usage is controlled indirectly via the flags that get passed in.

I suspect memory swapping is causing the slowdown.

A few questions below will help to diagnose the problem though:

1. What version of the bulk loader are you using?

2. Do you have any swap space setup? As soon as dgraph starts swapping memory, its performance is reduced dramatically. I’ve found it’s best (if possible) to run the bulk loader with swap disabled so that it’s more obvious when there are memory problems - so that you can restart the loader with tweaked parameters rather than it just slowing down.

3. How long does it take for the slow down to occur? Does it only occur once your memory fills up?

Once we can confirm that the issue is memory swapping, we can start tuning params. The first param to tune would be reducing the value of --map_output_mb.

1 Like

Thanks peter,

My answers

  1. What version of the bulk loader are you using?
    0.83

  2. Do you have any swap space setup?
    No, by default.

  3. How long does it take for the slow down to occur?
    Sorry, I didn’t expect the process will take that slow so I forgot to run it on background(by nohup, screen, etc.) and did not take a record when memory was filled up. I could not give you an exact time but I am sure it was under an hour. Besides I am sure it did’t finish map process since the out directory is empty

$ du -hd 1
196G ./tmp
12K ./out
196G .

At the time I wrote this reply, the import is still on process and it takes more than 20 hours. I don’t know current speed, as I told before, I ran it on foreground and after I got back home(the server is on my office), the terminal window (on my laptop) disconnected.
Though bulk loader didn’t shutdown with terminal disconnection(it is a feature for long run task, right?), I don’t know how to attach it and inspect current status. So is there any log files or something I can get in touch with? If not, I think it may be a suggestion since many users will under estimate the time that the process is done, so they will just run it on a terminal window on foreground and it will be easy to lose the connection.

Add some information on my data, hope it will help:

My data consists of around 1000 rdf.gz, with 70M each, so total size is 70G.
Each rdf has around 3M edges, paste a sample that covers all the predicates.

<0000000000000000003d73b305dbfc318d4b8cf8f6d7fcd0cad1d9eeb0ef4c3d>

<0000000000000000002cecef1c5a0213c20c31f261c6d26ef475c0848d72d3cf> .
<0000000000000000003d73b305dbfc318d4b8cf8f6d7fcd0cad1d9eeb0ef4c3d> “2017-09-09T20:04:07+08:00”^^xs:dateTime .
<0000000000000000003d73b305dbfc318d4b8cf8f6d7fcd0cad1d9eeb0ef4c3d> .
<0000000000000000003d73b305dbfc318d4b8cf8f6d7fcd0cad1d9eeb0ef4c3d.0> .
<df1c85ac76cbbdfa98bf85008fecdc6746d7f141e5f6534e184b71ca86e5aace.0> <1Hz96kJKF2HLPGY15JWLB5m9qGNxvt8tHJ> .
<df1c85ac76cbbdfa98bf85008fecdc6746d7f141e5f6534e184b71ca86e5aace.0> “1442812613”^^
xs:int .
<df1c85ac76cbbdfa98bf85008fecdc6746d7f141e5f6534e184b71ca86e5aace.0> .

From your top screenshot, I can see that there is some swapping happening. You’ve got ~64G RAM in used, and ~12G swap in use. This will be the cause of the slowdown.

You will need to tweak the parameters passed to the bulk loader in order to get the memory usage down below 64GB.

--map_output_mb is the first place to start. By default, it’s set to 64. Could you try setting it to 16 and try again? This will reduce the amount of memory that is used.

1 Like

I’d recommend even turning off swap while you’re doing this.

1 Like

Agreed^^. Swap will be a hinderance, it’s better for the bulk loader to crash with an out of memory error so that you can change the params and start again.

Ok, I’ll try disabling swap and -mapoutput_mb with 16G and start over, just a little concern on oom .

You mean --mapoutput_mb=16 (16MB).

The number of count indices that you have is also a bit concerning to me. Count index is expensive, and I’d recommend reducing its usage, if possible.

1 Like

yea, typo, thanks for reminding

start:
╰─$ dgraph-bulk-loader -r=/home/hj/rdf -expand_edges=false -s=/home/hj/my.schema -x -mapoutput_mb=16
{
“RDFDir”: “/home/hj/rdf”,
“SchemaFile”: “/home/hj/my.schema”,
“DgraphsDir”: “out”,
“LeaseFile”: “LEASE”,
“TmpDir”: “tmp”,
“NumGoroutines”: 32,
“MapBufSize”: 16777216,
“ExpandEdges”: false,
“BlockRate”: 0,
“SkipMapPhase”: false,
“CleanupTmp”: true,
“NumShufflers”: 1,
“Version”: false,
“StoreXids”: true,
“MapShards”: 1,
“ReduceShards”: 1
}

The bulk loader needs to open many files at once. This number depends on the size of the data set loaded, the map file output size, and the level of indexing. 100,000 is adequate for most data set sizes. See man ulimit for details of how to change the limit.
Current max open files limit: 4096
MAP 01s rdf_count:194.2k rdf_speed:194.1k/sec edge_count:670.6k edge_speed:670.3k/sec
MAP 02s rdf_count:951.2k rdf_speed:475.4k/sec edge_count:3.268M edge_speed:1.634M/sec
MAP 03s rdf_count:1.602M rdf_speed:532.2k/sec edge_count:5.482M edge_speed:1.821M/sec
MAP 04s rdf_count:2.144M rdf_speed:531.6k/sec edge_count:7.309M edge_speed:1.812M/sec
MAP 05s rdf_count:2.861M rdf_speed:568.3k/sec edge_count:9.722M edge_speed:1.931M/sec
MAP 06s rdf_count:3.439M rdf_speed:569.8k/sec edge_count:11.64M edge_speed:1.929M/sec
MAP 07s rdf_count:3.929M rdf_speed:558.5k/sec edge_count:13.30M edge_speed:1.890M/sec
MAP 08s rdf_count:4.141M rdf_speed:515.0k/sec edge_count:14.00M edge_speed:1.741M/sec
MAP 09s rdf_count:4.521M rdf_speed:500.1k/sec edge_count:15.27M edge_speed:1.689M/sec
MAP 10s rdf_count:4.845M rdf_speed:482.5k/sec edge_count:16.35M edge_speed:1.628M/sec
MAP 11s rdf_count:5.688M rdf_speed:515.2k/sec edge_count:19.15M edge_speed:1.734M/sec

end:

MAP 44m12s rdf_count:438.2M rdf_speed:165.2k/sec edge_count:1.295G edge_speed:488.1k/sec
MAP 44m13s rdf_count:438.2M rdf_speed:165.2k/sec edge_count:1.295G edge_speed:487.9k/sec
[1] 29475 killed dgraph-bulk-loader -r=/home/hj/rdf -expand_edges=false -x -mapoutput_mb=16

I turned off swap and set mapoutput_mb to 16 , then OOM:joy: , and are there any document about recommend memory requirements.Like 3 billion edges need 488G memory , so I could tell if dgrah is fit for my project

Is the data publicly available? We could try running it at our end, and see if there’s something special about the data which causes bulk loader to OOM.

Of course, you could try upgrading the machine to have more RAM as well. We don’t have any specific recommendations about data size -> RAM usage. Tweaking flag values to decrease the RAM usage is what you can do.

You could try to reduce the --num_go_routines. That would cause less memory usage at the expense of decreased performance.

1 Like

Thanks,
The data is generated by myself and a little bit huge so I will try to solve myself first, I think reducing go routine number may be a good advice , since I found my another laptop with only 16G RAM, 8 core CPU, 256 SSD didn’t OOM with the same data and the Swap is idle.

Problem solved, I guess (just guess, I have no idea how to verify) the problem is mainly related to the strategy of OS memory management, then the second factor is number of go routines.

My personal server(64G, 32 core, 480G+1T SSD) previous installed with centOS 7, and I found once the memory required larger than physical memory than it started consuming swap, when swap used up, bulk crashed with OOM.

However, when my laptop (Ubuntu 17.10) with lower memory (16G, 8 core CPU, 250G SSD) didn’t OOM and worked much faster with the same data. I suspect it may be the OS issue, so I reinstall my server with Ubuntu 17.10, then I found though it still crashed with OOM, it didn’t consume swap when the memory required become larger than 64G.

Then I followed Manish’s advice to reduce go routine number from 32(default) to 16, and it worked pretty well and stable.

MAP 01h16m06s rdf_count:963.1M rdf_speed:210.9k/sec edge_count:2.786G edge_speed:610.1k/sec

As you see, only around 1 hour, it finished near 1 billion RDF and the speed is around 600k, compared my last test in centOS, it cost more than 20 hours to 1 billion and the speed declined to 30k.

P.S.:
Anyway, it is just a guess, I am not sure it is the centOS problem, I just record what I observed, anyone with similar could have a try

Awesome, I’m glad you got the data successfully loaded :smiley:

A bit of an explanation about what is most likely happening:

  1. The bulk loader requires memory proportional to the number of goroutines (-j flag).
  2. By reducing -j, you keep all memory in RAM, rather than having to go out to swap.
  3. Because RAM is very fast compared to swap, the total time is less (1 hr vs >20 hrs), even though there are less workers running.
1 Like

@peter : Let’s mention these two flags in the Deploy page where we talk about bulk loader, and explain how they could be used to decrease memory usage.

1 Like

Excellent idea, I’ve now added some notes there.

2 Likes