Dgraphassigner doesn't work on a Mac

If I run the following command with an empty data directory:

$ dgraphassigner --numInstances 1 --instanceIdx 0 --numCpu 1 --rdfgzips benchmarks/data/actor-director.gz --uids data/u 

After a while the process crashes with the following error. The numCpu setting doesn’t seem to matter.

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fff9615bf06 m=2

This happens because the dgraphassigner runs out of threads. It stops when it creates 2048 threads.

On a Mac:

$ sysctl kern.num_taskthreads
kern.num_taskthreads: 2048

According to the man page of sysctl this value cannot be changed.

A couple of questions:

  1. Why do we need so many threads? Can the threads be reclaimed when their work is done?
  2. Is dgraph supported on a Mac, or should I try running it on AWS instead?

You could try decreasing the number of cores used. See --help.

I ran it with -numCpu 1. It still fails.

Can you add the logs here? How long after start does it fail? Also, which version of Go are you on?

Try 2 things:

  1. https://wiki.dgraph.io/Beginners_Guide#Dgraph_loader_fails_with_.27too_many_open_files.27_error
  2. Try decreasing the number of goroutines here:
    https://github.com/dgraph-io/dgraph/blob/master/loader/loader.go#L273

Hi @kostub, it seems to work fine on my Macbook. Here are some of my outputs.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited

$ sysctl kern.num_taskthreads
kern.num_taskthreads: 2048

$ go version
go version go1.7 darwin/amd64

$ cat run.sh 
set -e

rm -Rf u

dgraphassigner --numInstances 1 --instanceIdx 0 --rdfgzips rdf-films.gz --uids u --numCpu=1

dgraphloader --numInstances 1 --instanceIdx 0 --rdfgzips rdf-films.gz --uids u --postings p --numCpu=1
1 Like

Worked fine for me too without playing around with any kernel values.

$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                7168

$ sysctl kern.num_taskthreads
kern.num_taskthreads: 2048

$ sysctl kern.num_threads
kern.num_threads: 10240

$ go version
go version go1.7 darwin/amd64

$ rm -rf ~/dgraph/u
$ ./dgraphassigner --numInstances 1 --instanceIdx 0 --rdfgzips ~/work/src/github.com/dgraph-io/benchmarks/data/actor-director.gz --uids ~/dgraph/u --numCpu=1

You might want to build the latest binary from the code inside cmd/dgraphassigner on master because that has more accurate memory usage info.

It does take a while(more than an hour) for it to finish on my machine(with 4 gigs RAM and 1 processor), so you might better off using an EC2 instance if your machine isn’t very powerful @kostub.

Awesome guys. Please also remember to update the wiki page with these values. So, our users can check theirs when troubleshooting.

1 Like

My settings are the same

$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                7168

$ go version
go version go1.7 darwin/amd64

$ sysctl kern.num_taskthreads 
kern.num_taskthreads: 2048

$ sysctl kern.num_threads 
kern.num_threads: 10240

$ dgraphassigner --version
Dgraph version 0.4.3

I just built the latest version using go get -v github.com/dgraph-io/dgraph/.... I assume that builds master?

Here is the log output when I run it: https://www.dropbox.com/s/jepzfmikcssfi80/debugassignerlog.txt?dl=0

Yes, that should fetch master and build binaries.

Though you don’t need to do a go get everytime to build the master. You can just go to

$ cd $GOPATH/src/github.com/dgraph-io/dgraph
$ cd cmd/dgraphassigner
$ go install

to install the dgraphassigner binary.

1 Like

I ran a script while running the dgraphassigner to count the number of threads. (Note: It is off by 1)

$ while [[ True ]]
do
echo threads in dgraphassigner `ps M 91953 | wc -l`
sleep 1
done

This is the output


threads in dgraphassigner 18
threads in dgraphassigner 33
threads in dgraphassigner 59
threads in dgraphassigner 105
threads in dgraphassigner 286
threads in dgraphassigner 726
threads in dgraphassigner 1562
threads in dgraphassigner 2049
threads in dgraphassigner 2049
threads in dgraphassigner 2049
threads in dgraphassigner 1

It crashes as soon as it reaches 2048 threads and it increases pretty rapidly from 18 (where it stays for quite a while).

I changed the number of goroutines here
https://github.com/dgraph-io/dgraph/blob/master/loader/loader.go#L273
and
https://github.com/dgraph-io/dgraph/blob/master/loader/loader.go#L316

to 1000 but that did not help either.

Here are details about my Mac:

Model Name: Mac mini
Model Identifier: Macmini6,1
Processor Name: Intel Core i5
Processor Speed: 2.5 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 3 MB
Memory: 16 GB

System Version: OS X 10.11.6 (15G31)
Kernel Version: Darwin 15.6.0
Boot Volume: Macintosh HD
Boot Mode: Normal

I figured out why this is a problem and temporarily resolve my issue by reducing the number of goroutines in loader.go#L316 to 1000. Ignore the previous message that said it didn’t work (maybe I didn’t build it correctly)

Now I can run it will -numCpus 4 and it will still not crash.

The reason this happens is that all the goroutines are blocked in a cgo call to rocksdb. Here is a stack trace of a blocked goroutine:

goroutine 38 [runnable, locked to thread]:
github.com/dgraph-io/dgraph/vendor/github.com/tecbot/gorocksdb._Cfunc_rocksdb_get(0x5a42d30, 0x4f1fbe0, 0xc482119420, 0xf, 0xc482119470, 0xc48209ecc8, 0x0)
        github.com/dgraph-io/dgraph/vendor/github.com/tecbot/gorocksdb/_obj/_cgo_gotypes.go:1059 +0x4e
github.com/dgraph-io/dgraph/vendor/github.com/tecbot/gorocksdb.(*DB).Get(0xc42010d420, 0xc42002c1a0, 0xc482119420, 0xf, 0x10, 0x0, 0x0, 0x0)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/vendor/github.com/tecbot/gorocksdb/db.go:224 +0x28e
github.com/dgraph-io/dgraph/store.(*Store).Get(0xc4201504b0, 0xc482119420, 0xf, 0x10, 0x0, 0x0, 0xc45bae6000, 0x40ef1d0, 0xc482128dd0)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/store/store.go:63 +0x73
github.com/dgraph-io/dgraph/posting.(*List).getPostingList(0xc482128dd0, 0x43cc530)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/posting/list.go:256 +0xd3
github.com/dgraph-io/dgraph/posting.(*List).init(0xc482128dd0, 0xc482119420, 0xf, 0x10, 0xc4201504b0, 0x0)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/posting/list.go:222 +0x142
github.com/dgraph-io/dgraph/posting.GetOrCreate(0xc482119420, 0xf, 0x10, 0xc4201504b0, 0x0)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/posting/lists.go:274 +0x1f3
github.com/dgraph-io/dgraph/uid.GetOrAssign(0xc47282a425, 0x9, 0x0, 0x1, 0x8573ec69b78b3600, 0x0, 0x0)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/uid/assigner.go:206 +0xbb
github.com/dgraph-io/dgraph/loader.(*state).assignUid(0xc420013540, 0xc47282a425, 0x9, 0x4c92563671bd5246, 0x9)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/loader/loader.go:197 +0x86
github.com/dgraph-io/dgraph/loader.(*state).assignUidsOnly(0xc420013540, 0xc420118ad0)
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/loader/loader.go:233 +0x2d9
created by github.com/dgraph-io/dgraph/loader.AssignUids
        /Users/kostub/Work/go/src/github.com/dgraph-io/dgraph/loader/loader.go:317 +0x27e

When a goroutine is blocked in a cgo, the go scheduler creates a new threads for the remaining goroutines. See discussion at https://groups.google.com/forum/#!topic/golang-nuts/8gszDBRZh_4

A good explanation can be found at: https://www.cockroachlabs.com/blog/the-cost-and-complexity-of-cgo/

Most likely, the disk on my mac isn’t fast enough (it’s not an SSD) for the rocksdb calls to return in time which causes go to create the number of threads == number of goroutines and thus go over the thread limit.

A simple solution for the problem is to drop the max number of goroutines like I did, however this is not a good solution as we still end up creating 1000s of threads. A better way might be to avoid using cgo and either write a native go storage or use a separate C process to write to rocksdb.

3 Likes

Nice analysis! This seems to make a lot of sense. Can you please add this in troubleshooting section in Beginners Guide? In fact, we can expose the maximum number of goroutines as a flag. Maybe create a PR for that.

We’re stuck with Cgo until someone writes RocksDB in Go. RocksDB is very efficient though. And has been worked upon by both Google and Facebook engineers over many years. So, it’s a solid piece of engineering, that I doubt anything else comes close to. Here’s my take on a similar request:

What about using a separate process to read/write to rocksdb and then communicate with that process from dgraph using rpc/http?

I don’t know if this problem occurs outside of assigner/loader, but if it does occur under high load then it might be worth considering.

That’d be inefficient. Interprocess communication is slow. Cgo isn’t Go. That’s a known thing. The problem only occurs during loader, and as @pawan and @jchiu tested, doesn’t happen on newer systems. I think just having a flag to allow people to decrease the number of goroutines is a sufficient solution here with least side-effects.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.