Fatal error when live loading data


(Aaron Clay) #1

Trying to dgraph-live load some data to test with:

... (lots of processed data)
[ 6m56s] Txns: 10981 RDFs: 10981000 RDFs/sec: 26396 Aborts: 0                                                                                                                                                                               
[ 6m58s] Txns: 10981 RDFs: 10981000 RDFs/sec: 26270 Aborts: 0                                                                                                                                                                               
[  7m0s] Txns: 10981 RDFs: 10981000 RDFs/sec: 26145 Aborts: 0                                                                                                                                                                               
[  7m2s] Txns: 10981 RDFs: 10981000 RDFs/sec: 26021 Aborts: 0                                                                                                                                                                               
[  7m4s] Txns: 10981 RDFs: 10981000 RDFs/sec: 25898 Aborts: 0                                                                                                                                                                               
[  7m6s] Txns: 10981 RDFs: 10981000 RDFs/sec: 25777 Aborts: 0                                                                                                                                                                               
[  7m8s] Txns: 10981 RDFs: 10981000 RDFs/sec: 25657 Aborts: 0                                                                                                                                                                               
[ 7m10s] Txns: 10981 RDFs: 10981000 RDFs/sec: 25537 Aborts: 0                                                                                                                                                                               
2018/12/19 18:06:16 transport is closing                                                                                                                                                                                                    
github.com/dgraph-io/dgraph/x.Fatalf                                                                                                                                                                                                        
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:115                                                                                                                                                                            
github.com/dgraph-io/dgraph/dgraph/cmd/live.handleError                                                                                                                                                                                     
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:140                                                                                                                                                              
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).request                                                                                                                                                                               
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:182                                                                                                                                                              
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).makeRequests                                                                                                                                                                          
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:194                                                                                                                                                              
runtime.goexit                                                                                                                                                                                                                              
        /usr/local/go/src/runtime/asm_amd64.s:1333                                                                                                                                                                                          
2018/12/19 18:06:16 transport is closing                                                                                                                                                                                                    
github.com/dgraph-io/dgraph/x.Fatalf                                                                                                                                                                                                        
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:115                                                                                                                                                                            
github.com/dgraph-io/dgraph/dgraph/cmd/live.handleError                                                                                                                                                                                     
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:140                                                                                                                                                              
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).request                                                                                                                                                                               
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:182                                                                                                                                                              
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).makeRequests                                                                                                                                                                          
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:194                                                                                                                                                              
runtime.goexit                                                                                                                                                                                                                              
        /usr/local/go/src/runtime/asm_amd64.s:1333                                     

Alpha logs:

Zero logs:

This is running the kubernetes ha (3 alphas, 3 zeros, 1 ratel) on m4.2xl AWS instances. Memory seems steady @ 10GB for alpha-0, and I don’t see metrics for the alpha-1 or alpha-2 pods. Earlier it made it about 20 minutes into the job when I realized I ran out of disk space (when looking at logs). The crash had the same error though - “transport is closing”


(Aaron Clay) #2

1st update: thought this might be an idle timeout issue (it is not)

2nd Update:

as writing this, I had another crash (ELB set with a 4000 second idle timeout)

[ 34m0s] Txns: 38741 RDFs: 38741000 RDFs/sec: 18991 Aborts: 0
[ 34m2s] Txns: 38741 RDFs: 38741000 RDFs/sec: 18972 Aborts: 0
[ 34m4s] Txns: 38741 RDFs: 38741000 RDFs/sec: 18954 Aborts: 0
2018/12/19 19:30:09 transport is closing
github.com/dgraph-io/dgraph/x.Fatalf
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:115
github.com/dgraph-io/dgraph/dgraph/cmd/live.handleError
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:140
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).request
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:182
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).makeRequests
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:194
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1333
2018/12/19 19:30:09 transport is closing
github.com/dgraph-io/dgraph/x.Fatalf
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:115
github.com/dgraph-io/dgraph/dgraph/cmd/live.handleError
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:140
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).request
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:182
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).makeRequests
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:194
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1333
2018/12/19 19:30:09 transport is closing
github.com/dgraph-io/dgraph/x.Fatalf
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:115
github.com/dgraph-io/dgraph/dgraph/cmd/live.handleError
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:140
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).request
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:182
github.com/dgraph-io/dgraph/dgraph/cmd/live.(*loader).makeRequests
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/live/batch.go:194
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1333
2018/12/19 19:30:09 transport is closing
github.com/dgraph-io/dgraph/x.Fatalf
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:115
github.com/dgraph-io/dgraph/dgraph/cmd/live.handleError
        /ext-go/1/src/github.com/dgrap

memory at idle (post crash):

|dgraph-alpha-0 (dafdcb77-03f9-11e9-92f7-0a8fc384c01e) | 22%
dgraph-alpha-1 () | 21%|
dgraph-alpha-2 (06edc6c6-03fa-11e9-92f7-0a8fc384c01e) | 54%
dgraph-ratel-7cff746589-shbq6 (db0c06de-03f9-11e9-92f7-0a8fc384c01e) | 0.02%
dgraph-zero-0 (daef5151-03f9-11e9-92f7-0a8fc384c01e) | 1.90%
dgraph-zero-1 (eb68b0d9-03f9-11e9-92f7-0a8fc384c01e) | 0.42%
dgraph-zero-2 (fb19e78e-03f9-11e9-92f7-0a8fc384c01e) | 0.47%

weirdly unbalanced memory is unexpected, not sure if related.


(Daniel Mai) #3

Is there a pod memory limit set for Alpha? It looks like the Alpha is getting out-of-memory killed and requires more memory to run. The OOM kill can be confirmed by looking at the pod event log.

m4.2xlarge instances have 32GB of available memory, so the Alpha shouldn’t be killed if it’s memory usage is consistent at 10GB.


(Aaron Clay) #4

@dmai there is not. (previous post earlier today has the config I used).

Where do you see an OOM error? Or is your suggestion to look at pod errors. (I can try to run the import again, and watch to see if the pod gets killed)

I’m not seeing anything that looks like that around the time of the error.
That being said, the memory at idle seems really imbalanced.Is that expected? I’m hitting the ELB with the dgraph-live binary

During my second crash, memory usage looked like 29GB, but again super unbalanced.


(Aaron Clay) #5

update 3

I forgot to use --previous on kubectl logs and I missed the real one. dgraph-alpha-1 crashed and had these logs:

I1220 03:26:23.729441       1 node.go:83] 2 [logterm: 11, index: 54938, vote: 2] rejected MsgPreVote from 3 [logterm: 11, index: 54927] at term 11
I1220 03:26:26.008364       1 node.go:83] 2 [logterm: 11, index: 54949, vote: 2] rejected MsgPreVote from 3 [logterm: 11, index: 54927] at term 11
I1220 03:26:28.251600       1 draft.go:323] Creating snapshot at index: 54912. ReadTs: 95543.
I1220 03:26:45.217459       1 predicate.go:186] Got StreamSnapshot request: context:<id:1 group:1 addr:"dgraph-alpha-0.dgraph-alpha.dgraph.svc.cluster.local:7080" > index:51385 read_ts:89199 
I1220 03:26:46.263479       1 stream.go:240] Sending SNAPSHOT Time elapsed: 01s, bytes sent: 4.3 MB, speed: 4.3 MB/sec
I1220 03:26:56.498622       1 stream.go:240] Sending SNAPSHOT Time elapsed: 11s, bytes sent: 90 MB, speed: 8.2 MB/sec
I1220 03:27:02.762681       1 stream.go:240] Sending SNAPSHOT Time elapsed: 17s, bytes sent: 180 MB, speed: 11 MB/sec
I1220 03:27:20.403172       1 stream.go:240] Sending SNAPSHOT Time elapsed: 35s, bytes sent: 347 MB, speed: 9.9 MB/sec
I1220 03:27:26.250321       1 stream.go:240] Sending SNAPSHOT Time elapsed: 40s, bytes sent: 377 MB, speed: 9.4 MB/sec
I1220 03:27:26.717617       1 stream.go:240] Sending SNAPSHOT Time elapsed: 41s, bytes sent: 459 MB, speed: 11 MB/sec
I1220 03:27:28.614460       1 stream.go:240] Sending SNAPSHOT Time elapsed: 43s, bytes sent: 463 MB, speed: 11 MB/sec
I1220 03:27:34.355329       1 stream.go:240] Sending SNAPSHOT Time elapsed: 49s, bytes sent: 544 MB, speed: 11 MB/sec
I1220 03:27:50.844276       1 stream.go:240] Sending SNAPSHOT Time elapsed: 01m05s, bytes sent: 676 MB, speed: 10 MB/sec
I1220 03:27:52.982660       1 stream.go:240] Sending SNAPSHOT Time elapsed: 01m07s, bytes sent: 755 MB, speed: 11 MB/sec
I1220 03:28:07.905067       1 stream.go:240] Sending SNAPSHOT Time elapsed: 01m22s, bytes sent: 901 MB, speed: 11 MB/sec
I1220 03:28:08.511391       1 stream.go:240] Sending SNAPSHOT Time elapsed: 01m23s, bytes sent: 982 MB, speed: 12 MB/sec
I1220 03:28:22.506652       1 stream.go:240] Rolling up Time elapsed: 01s, bytes sent: 0 B, speed: 0 B/sec
I1220 03:28:23.503283       1 stream.go:240] Rolling up Time elapsed: 02s, bytes sent: 0 B, speed: 0 B/sec
I1220 03:28:24.502313       1 stream.go:240] Rolling up Time elapsed: 03s, bytes sent: 0 B, speed: 0 B/sec
I1220 03:28:25.502341       1 stream.go:240] Rolling up Time elapsed: 04s, bytes sent: 0 B, speed: 0 B/sec
I1220 03:28:26.502365       1 stream.go:240] Rolling up Time elapsed: 05s, bytes sent: 0 B, speed: 0 B/sec
I1220 03:28:27.502365       1 stream.go:240] Rolling up Time elapsed: 06s, bytes sent: 0 B, speed: 0 B/sec
I1220 03:28:28.502621       1 stream.go:240] Rolling up Time elapsed: 07s, bytes sent: 4.3 MB, speed: 612 kB/sec
I1220 03:28:32.425304       1 stream.go:240] Rolling up Time elapsed: 10s, bytes sent: 25 MB, speed: 2.5 MB/sec
I1220 03:28:47.357539       1 stream.go:240] Rolling up Time elapsed: 25s, bytes sent: 63 MB, speed: 2.5 MB/sec
I1220 03:29:06.561122       1 stream.go:240] Sending SNAPSHOT Time elapsed: 02m21s, bytes sent: 1.5 GB, speed: 10 MB/sec
I1220 03:29:15.465833       1 stream.go:240] Rolling up Time elapsed: 53s, bytes sent: 145 MB, speed: 2.7 MB/sec
I1220 03:29:27.681793       1 stream.go:240] Sending SNAPSHOT Time elapsed: 02m42s, bytes sent: 1.6 GB, speed: 9.8 MB/sec
I1220 03:29:36.008838       1 stream.go:240] Sending SNAPSHOT Time elapsed: 02m50s, bytes sent: 1.6 GB, speed: 9.4 MB/sec
I1220 03:29:36.665754       1 stream.go:240] Sending SNAPSHOT Time elapsed: 02m51s, bytes sent: 1.6 GB, speed: 9.4 MB/sec
I1220 03:29:40.857642       1 stream.go:240] Sending SNAPSHOT Time elapsed: 02m53s, bytes sent: 1.6 GB, speed: 9.5 MB/sec
E1220 03:29:43.364550       1 pool.go:206] Echo error from dgraph-zero-2.dgraph-zero.dgraph.svc.cluster.local:5080. Err: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W1220 03:29:43.364550       1 node.go:347] No healthy connection to node Id: 1 addr: [dgraph-alpha-0.dgraph-alpha.dgraph.svc.cluster.local:7080], err: Unhealthy connection
E1220 03:29:44.712276       1 pool.go:206] Echo error from dgraph-zero-0.dgraph-zero.dgraph.svc.cluster.local:5080. Err: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1220 03:29:44.712276       1 pool.go:206] Echo error from dgraph-alpha-0.dgraph-alpha.dgraph.svc.cluster.local:7080. Err: rpc error: code = DeadlineExceeded desc = context deadline exceeded```

I suspect this is the cause of the batch job failure.

(Daniel Mai) #6

Which Alpha is the live loader connecting to? Perhaps you can share the live loader command you’re using to make it clear.

If Kubernetes is indeed killing the Alpha because of memory issues you should see an OOMKilled event when using kubectl describe pod <alpha-pod>.


(Aaron Clay) #7
dgraph live -r test.rdf.gz -d 'redacted.us-west-2.elb.amazonaws.com:9080' -z 'redacted.us-west-2.elb.amazonaws.com:5080

While investigating with a colleague, we see that we had an error code 137 on the container, which seems to be an OOM event killing the container (alpha-1 which is above). We also see that container idling at 10GB RAM usage. The total dataset is ~1.5Gb cleartext.

edit: we didn’t have any events, but we see it was exited with code 137, which google tells me is OOM. Not sure why we don’t have an event for it, unless we’re skipping it over.

 lastState:
      terminated:
        containerID: docker://6619bbe79e174d91d7a7c97fad25f1580e1bacecedc76fd0d2718872ea1b7639
        exitCode: 137
        finishedAt: 2018-12-20T03:30:24Z
        reason: Error
        startedAt: 2018-12-20T01:53:19Z

(Daniel Mai) #8

During live loading Dgraph would certainly require more memory than if it was not serving any requests. And the Alpha that is taking the incoming live loading requests would be busier than the others.

Depending on your schema and indices used the memory requirements can be more than expected. I suggest either trying with a higher memory size or using the bulk loader for the initial data loading.