Alpha crashes when loading data

Hello,

I am currently testing out dgraph for a small project around customer data. Everything is fine loading small amounts of data:

e.g. one table has 19k rows (~38k nodes), loads them fine
another table has 52k rows (~52k nodes), loads them fine

I noticed, however, that when loading slightly larger tables, alpha crashes at around 180k rows and doesnt even recover. I have to start alpha again and restart my data load. I tried chunking it to 10k rows each load but it crashes at around the same amount of data ~180k rows (~180k nodes) (the whole table has about 1.8M rows, so its not even scratching the surface). I have far bigger tables to load (~300M rows in total) and Im not sure whats causing the crash. I scan the stdout and it would have lines that say mem flush around the time it crashed.

Is this due to a small configuration? Im running a single zero/single alpha on windows (yes, i know, why am i on windows, existing servers) (company devops is stingy and wont give me a few linux instances). I forgot to capture the error message. I’ll be sure to capture them next time.

After I restart alpha, i would get random error: “applying proposal. error cannot retrieve posting for uid <> from list with key” scattered through out the stdout. Is this something to be worried about?

I kind of feel you will ask me to load it via the offline method but its regular for us to have 200k new rows of data, at minimum, per day. It would not be feasible for me to stop-start dgraph so i can use the offline loader.

please help.

Thanks
Enrico

edit: i forgot to mention Im on v20.03.0

Hi @eleon00, Welcome to the Dgraph community.

It would be really useful for us if you provide us the logs as and when you get it. I assume that you are using a live loader. You can look at the troubleshooting in the meanwhile. We would be happy if you provide us the exact logs.

Hello Naman,

As experienced earlier, alpha crashed at more or less the same amount of rows. The following are alpha last few lines. Im trying to get the full stderr/stdout. Any help towards solving this would be appreciated!

THanks

I0701 14:24:14.636190 11180 oracle.go:209] ProcessDelta: Max Assigned: 4746566
I0701 14:24:14.636190 11180 oracle.go:210] ProcessDelta: Group checksum: map[1:11902495650763678008]
I0701 14:24:14.640188 11180 server.go:135] Received ALTER op: schema:"\n\t\tcontact.email: string @index(exact) .\n\t\tcontact.phone: string @index(exact) .\t\t\n\t\tcontact.ekey: string .\n\t\tcontact.pkey: string .\n\t\ttype Contact {\n\t\t\tcontact.email\n\t\t\tcontact.phone\n\t\t\tcontact.ekey\n\t\t\tcontact.pkey\n\t\t}\n\t"
I0701 14:24:14.640188 11180 server.go:1122] Got Alter request from “127.0.0.1:55433”
I0701 14:24:14.641190 11180 groups.go:968] Batched 1 updates. Max Assigned: 4746568. Proposing Deltas:
I0701 14:24:14.641190 11180 groups.go:974] Committed: 4746566 → 4746567
I0701 14:24:14.641190 11180 server.go:275] Got schema: &{Preds:[predicate:“contact.email” value_type:STRING directive:INDEX tokenizer:“exact” predicate:“contact.phone” value_type:STRING directive:INDEX tokenizer:“exact” predicate:“contact.ekey” value_type:STRING predicate:“contact.pkey” value_type:STRING ] Types:[type_name:“Contact” fields:<predicate:“contact.email” > fields:<predicate:“contact.phone” > fields:<predicate:“contact.ekey” > fields:<predicate:“contact.pkey” > ]}
I0701 14:24:14.643189 11180 oracle.go:209] ProcessDelta: Max Assigned: 4746568
I0701 14:24:14.643189 11180 oracle.go:210] ProcessDelta: Group checksum: map[1:11902495650763678008]
I0701 14:24:14.643189 11180 oracle.go:215] ProcessDelta Committed: 4746566 → 4746567
I0701 14:24:14.648188 11180 draft.go:104] Operation completed with id: opRollup
panic: close of closed channel

goroutine 306 [running]:
github.com/dgraph-io/badger/v2/y.(*Closer).Signal(…)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/y/y.go:205
github.com/dgraph-io/badger/v2/y.(*Closer).SignalAndWait(…)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/y/y.go:232
github.com/dgraph-io/dgraph/worker.(*node).startTask(0xc0000d2150, 0x3, 0x0, 0x0, 0x0)
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:125 +0x1ca
github.com/dgraph-io/dgraph/worker.runSchemaMutation(0x1886ca0, 0xc03ab43300, 0xc0718d22e0, 0x4, 0x4, 0x486d48, 0x0, 0x0)
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/mutation.go:164 +0x103
github.com/dgraph-io/dgraph/worker.(*node).applyMutations(0xc0000d2150, 0x1886ca0, 0xc03ab43300, 0xc03a9010e0, 0x0, 0x0)
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:315 +0xe63
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted(0xc0000d2150, 0xc03a9010e0, 0x210ea20, 0x17)
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:458 +0xe72
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1(0xc03ab38218, 0x1, 0x1)
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:587 +0x19c
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh(0xc0000d2150)
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:628 +0x24d
created by github.com/dgraph-io/dgraph/worker.(*node).InitAndStartNode
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1547 +0x4c4
[Sentry] 2020/07/01 14:24:15 ModuleIntegration wasn’t able to extract modules: module integration failed
[Sentry] 2020/07/01 14:24:15 Sending fatal event [04c9977a6b1f45dc96ac1f9fd0a6d738] to sentry.io project: 1805390
[Sentry] 2020/07/01 14:24:15 Buffer flushed successfully.

edit: alpha:

dgraph alpha --my=localhost:7080 --lru_mb=4096 --zero=localhost:5080 --logtostderr -v=3 --acl_secret_file ./acl/hmac-secret

running on windows aws ec2 i3.2xlarge with 61gb of ram (im currently sharing it with some other application)

We have seen this issue earlier as well. We have a ticket open for this and is being investigated. @ibrahim , fyi.

Would you be willing to share your sample data and the steps that can reliably reproduce this?
Also how are you loading this into dgraph? Curl, Dgo, Ratel or something else?

And I believe you are on v20.03.0 and as you mentioned “running on windows aws ec2 i3.2xlarge with 61gb of ram (im currently sharing it with some other application)”

@eleon00 The crash you’re seeing was fixed in dgraph in a newer release by Fix panic in Task FrameWork, fixes #5034 (#5081) · dgraph-io/dgraph@8482527 · GitHub

This is the original issue Task management framework tries to stop a task twice (on master) · Issue #5034 · dgraph-io/dgraph · GitHub

Hey Paras,

Unfortunately, its customer email and Im constrained by privacy laws so wont be able to share data. I am loading through dgo.

At a high level, it starts with a query to sql database and goes row by row. It firsts upserts the email address to get the uid, and then updates the node with other details (scalar) or edges to other nodes.

What it was doing here was upserting the email address before it crashed. as it reaches the 10k threshold, it will collate the uids and update the email node with other relevant data and updates the node. Once its finished with the update, it will drop the dgraph and sql connections and create new ones for the next 10k rows. It will repeat until the max keys in sql and in dgraph match.

There are about 1.8 million emails to be stored. It crashes at 180k.

Thank you for looking into it!

Hello Ibrahim,

I’ll check out the new release and update here.

Thank you!

thanks @ibrahim

@eleon00 Please upgrade to v20.03.3.

(Because another related fix had been done with git commit dont set n.ops map entries to nil. Instead just delete them (#5551) · dgraph-io/dgraph@b05c525 · GitHub )