Dgraph bulk pollutes /tmp when --tmp is given

Dgraph bulk loader pollutes /tmp though --tmp /dgraph/tmp is given. It works fine with v20.07.1 but with later version v20.11.0-g1003e71bd, there are plenty of buffer and dgraph files in /tmp. Bulk loader logs "TmpDir": "/dgraph/tmp".

This is how to reproduce:

mkdir tmp
seq 1 1000000 | while read i; do echo "<uri$i> <p> \"uri $i\" ."; done | gzip > tmp/data.rdf.gz
echo "<p>: string @index(fulltext) ." > tmp/schema.dgraph
docker run --rm -it -v $(pwd)/tmp:/dgraph dgraph/dgraph:1003e71bd /bin/bash -c "dgraph zero > /dev/null 2>&1 & dgraph bulk --tmp /dgraph/tmp -f data.rdf.gz -s schema.dgraph --out /dgraph/out > /dev/null 2>&1 & while jobs -r | grep -v zero > /dev/null; do ls -lah /tmp; sleep 2; echo; done"

You will see the output of ls -lah /tmp. After some time there appears:

total 163M
drwxrwxrwt  2 root root 4.0K Oct 11 18:25 .
drwxr-xr-x 36 root root 4.0K Oct 11 18:25 ..
-rw-------  1 root root  32M Oct 11 18:25 buffer049356621
-rw-------  1 root root  32M Oct 11 18:25 buffer170668734
-rw-------  1 root root 2.2G Oct 11 18:25 buffer174371183
-rw-------  1 root root  32M Oct 11 18:25 buffer181006997
-rw-------  1 root root  32M Oct 11 18:25 buffer191023236
-rw-------  1 root root  32M Oct 11 18:25 buffer201006853
-rw-------  1 root root  32M Oct 11 18:25 buffer237318031
-rw-------  1 root root  32M Oct 11 18:25 buffer247489981
-rw-------  1 root root  32M Oct 11 18:25 buffer286216890
-rw-------  1 root root  32M Oct 11 18:25 buffer287581995
-rw-------  1 root root  32M Oct 11 18:25 buffer398772464
-rw-------  1 root root  32M Oct 11 18:25 buffer401250772
-rw-------  1 root root  32M Oct 11 18:25 buffer417570888
-rw-------  1 root root  32M Oct 11 18:25 buffer418930726
-rw-------  1 root root  32M Oct 11 18:25 buffer483733369
-rw-------  1 root root  32M Oct 11 18:25 buffer486332408
-rw-------  1 root root  32M Oct 11 18:25 buffer552307975
-rw-------  1 root root  32M Oct 11 18:25 buffer561689665
-rw-------  1 root root  32M Oct 11 18:25 buffer564563946
-rw-------  1 root root  32M Oct 11 18:25 buffer604423889
-rw-------  1 root root 2.2G Oct 11 18:25 buffer628146768
-rw-------  1 root root  32M Oct 11 18:25 buffer652625038
-rw-------  1 root root  32M Oct 11 18:25 buffer657592841
-rw-------  1 root root  32M Oct 11 18:25 buffer677176492
-rw-------  1 root root  32M Oct 11 18:25 buffer682327423
-rw-------  1 root root  32M Oct 11 18:25 buffer698381559
-rw-------  1 root root  32M Oct 11 18:25 buffer775770710
-rw-------  1 root root  32M Oct 11 18:25 buffer778523132
-rw-------  1 root root  32M Oct 11 18:25 buffer781450259
-rw-------  1 root root  32M Oct 11 18:25 buffer800820946
-rw-------  1 root root  32M Oct 11 18:25 buffer907102880
-rw-------  1 root root  32M Oct 11 18:25 buffer907358210
-rw-------  1 root root  32M Oct 11 18:25 buffer910655515
-rw-------  1 root root  32M Oct 11 18:25 buffer979370275
-rw-r--r--  1 root root  189 Oct 11 18:25 dgraph.4eae0e89a4de.root.log.ERROR.20201011-182528.31
-rw-r--r--  1 root root  189 Oct 11 18:25 dgraph.4eae0e89a4de.root.log.INFO.20201011-182528.31
-rw-r--r--  1 root root  189 Oct 11 18:25 dgraph.4eae0e89a4de.root.log.INFO.20201011-182528.8
-rw-r--r--  1 root root  189 Oct 11 18:25 dgraph.4eae0e89a4de.root.log.WARNING.20201011-182528.31
lrwxrwxrwx  1 root root   53 Oct 11 18:25 dgraph.ERROR -> dgraph.4eae0e89a4de.root.log.ERROR.20201011-182528.31
lrwxrwxrwx  1 root root   52 Oct 11 18:25 dgraph.INFO -> dgraph.4eae0e89a4de.root.log.INFO.20201011-182528.31
lrwxrwxrwx  1 root root   55 Oct 11 18:25 dgraph.WARNING -> dgraph.4eae0e89a4de.root.log.WARNING.20201011-182528.31

Those buffer* files should go into /dgraph/tmp as requested via --tmp /dgraph/tmp. Even the log files are questionable to be placed into /tmp.

When you run that with v20.07.1, there will be no buffer files in /tmp.

To clarify, --tmp uses ./tmp folder (not /tmp) as the default. This directory contains the temporary files like map_output, shards and split100651177. These files can be cleaned up or retained based on --cleanup_tmp flag.
buffer* files are always cleaned up without affecting the /tmp eventually. I will look into the changed behavior as you mentioned.

There is a separate flag --log_dir which specifies the directory for logs. Also you can use --logtostderr instead.

You are right, --tmp defaults to tmp in current directory. But the bulk loader should not have two temp directories. User’s need to be able to control temporary files like buffer*, even if they are cleaned up eventually. Thanks for looking into this.

These are coming from z.Buffer @Naman

Any insights on this?

Hi @EnricoMi, bulk loader use a map-reduce model for data loading. These buffer* files are coming from the map phase. See that we are creating a mmap-based buffer for the mapper. This change was done towards recent memory optimizations.
In the doMmap function we are creating a buffer backed by file buffer*. That’s why we see such temporary files in /tmp.
As a curiosity, I want to understand is there any specific issue with the /tmp directory being used for these temporary files?

Well, what is the purpose of having a --tmp option in the first place? You want full control over where your application is consuming disk space, in this case temporary disk space. In an environment where /tmp does not fit your needs (size, speed, contention, visibility, …) you want to be able to point those files somewhere else.

I personally prefer a limited /tmp space so it does not kill your machine when it fills up and wastes space when unused, to have the remainder of my disks available to / or /home or /data and have the bulk loader consume space there.

Bulk loader is making a strong assumption that /tmp is always the best location for those buffer* files for everyone.

1 Like

@EnricoMi Yeah, this looks a valid point. Thanks. The ticket has been created for the same. The corresponding PR is chore(bulk): use --tmp directory for temporary buffers by NamanJain8 · Pull Request #6833 · dgraph-io/dgraph · GitHub

2 Likes

Thanks a lot, looks good!