Update schema error

Hi, I want to confirm, is it because the data volume of the node is too large, which causes the timeout, but why does the node fail

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version

Have you tried reproducing the issue with the latest release?

no

What is the hardware spec (RAM, OS)?

K8s

Steps to reproduce the issue (command/config used to run Dgraph).

After importing 15 million nodes, update the schema on Ratel

Expected behaviour and actual result.

Alpha node failed to run

I0317 08:40:04.432980      19 log.go:34] 2 became pre-candidate at term 18
I0317 08:40:04.432985      19 log.go:34] 2 received MsgPreVoteResp from 2 at term 18
I0317 08:40:04.433000      19 log.go:34] 2 [logterm: 18, index: 271378] sent MsgPreVote request to 1 at term 18
I0317 08:40:04.433009      19 log.go:34] 2 [logterm: 18, index: 271378] sent MsgPreVote request to 3 at term 18
I0317 08:41:45.148732      19 log.go:34] Block cache metrics: hit: 213338 miss: 245032 keys-added: 200574 keys-updated: 73 keys-evicted: 41616 cost-added: 880891165 cost-evicted: 182959635 sets-dropped: 0 sets-rejected: 29635 gets-dropped: 256 gets-kept: 361984 gets-total: 458370 hit-ratio: 0.47
I0317 08:41:45.149886      19 run.go:744] Caught Ctrl-C. Terminating now (this may take a few seconds)...
E0317 08:41:45.149952      19 run.go:393] GRPC listener canceled: accept tcp [::]:9080: use of closed network connection
badger 2021/03/17 08:41:45 WARNING: Block cache might be too small. Metrics: hit: 7 miss: 248698 keys-added: 503 keys-updated: 30 keys-evicted: 491 cost-added: 9181836070 cost-evicted: 8914029757 sets-dropped: 0 sets-rejected: 15452 gets-dropped: 0 gets-kept: 242560 gets-total: 248705 hit-ratio: 0.00
W0317 08:41:45.150319      19 groups.go:869] No membership update for 10s. Closing connection to Zero.
I0317 08:41:45.150337      19 groups.go:807] Got address of a Zero leader: dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
badger 2021/03/17 08:41:45 WARNING: Cache life expectancy (in seconds): 
 -- Histogram: 
Min value: 0 
Max value: 3 
Mean: 0.02 
Count: 491 
[0 B, 2 B) 488 99.39% 
[2 B, 4 B) 3 0.61% 
 --

I0317 08:41:45.150475      19 groups.go:821] Starting a new membership stream receive from dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080.
I0317 08:41:45.150582      19 log.go:34] 2 became follower at term 18
I0317 08:41:45.150596      19 log.go:34] raft.node: 2 elected leader 3 at term 18
E0317 08:41:45.151009      19 server.go:24] error from cmux serve: accept tcp [::]:8080: use of closed network connection
I0317 08:41:45.151040      19 run.go:788] GRPC and HTTP stopped.
I0317 08:41:45.151108      19 worker.go:120] Stopping group...
E0317 08:41:45.151199      19 groups.go:829] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
E0317 08:41:45.151333      19 groups.go:773] While sending membership update: rpc error: code = Canceled desc = context canceled
I0317 08:41:45.151353      19 groups.go:885] Closing processOracleDeltaStream
I0317 08:41:45.151359      19 groups.go:786] Closing receiveMembershipUpdates
E0317 08:41:45.151339      19 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": while receiving from stream: rpc error: code = Canceled desc = context canceled. closer err: context canceled
E0317 08:41:45.151039      19 server.go:66] Stopped taking more http(s) requests. Err: mux: listener closed
I0317 08:41:45.151410      19 groups.go:1112] SubscribeForUpdates closing for prefix: "\x00\x00\vdgraph.cors\x00"
E0317 08:41:45.151604      19 groups.go:773] While sending membership update: rpc error: code = Canceled desc = context canceled
I0317 08:41:45.151623      19 groups.go:739] Closing sendMembershipUpdates
I0317 08:41:45.151632      19 worker.go:124] Updating RAFT state before shutting down...
I0317 08:41:45.151718      19 worker.go:129] Stopping node...
I0317 08:41:45.151745      19 draft.go:1021] Stopping node.Run
W0317 08:41:45.151808      19 raft_server.go:239] Error while raft.Step from 0x3: raft: stopped. Closing RaftMessage stream.
I0317 08:41:45.151812      19 draft.go:1095] Raft node done.
I0317 08:41:45.651594      19 server.go:70] All http(s) requests finished.
badger 2021/03/17 08:42:40 WARNING: Block cache might be too small. Metrics: hit: 7 miss: 265674 keys-added: 525 keys-updated: 30 keys-evicted: 514 cost-added: 9505201004 cost-evicted: 9244666483 sets-dropped: 0 sets-rejected: 19726 gets-dropped: 0 gets-kept: 259456 gets-total: 265681 hit-ratio: 0.00
badger 2021/03/17 08:42:40 WARNING: Cache life expectancy (in seconds): 
 -- Histogram: 
Min value: 0 
Max value: 106 
Mean: 0.83 
Count: 514 
[0 B, 2 B) 507 98.64% 
[2 B, 4 B) 3 0.58% 
[64 B, 128 B) 4 0.78% 
 --

badger 2021/03/17 08:43:34 WARNING: Block cache might be too small. Metrics: hit: 7 miss: 292225 keys-added: 623 keys-updated: 30 keys-evicted: 612 cost-added: 10671137786 cost-evicted: 10409863005 sets-dropped: 0 sets-rejected: 26408 gets-dropped: 0 gets-kept: 285504 gets-total: 292233 hit-ratio: 0.00
badger 2021/03/17 08:43:34 WARNING: Cache life expectancy (in seconds): 
 -- Histogram: 
Min value: 0 
Max value: 162 
Mean: 2.01 
Count: 612 
[0 B, 2 B) 600 98.04% 
[2 B, 4 B) 3 0.49% 
[64 B, 128 B) 4 0.65% 
[128 B, 256 B) 5 0.82% 
 --

W0317 08:43:34.533296      19 raft_server.go:239] Error while raft.Step from 0x3: raft: stopped. Closing RaftMessage stream.
badger 2021/03/17 08:44:50 WARNING: Block cache might be too small. Metrics: hit: 7 miss: 333738 keys-added: 715 keys-updated: 30 keys-evicted: 700 cost-added: 11137024532 cost-evicted: 10870634281 sets-dropped: 0 sets-rejected: 29451 gets-dropped: 0 gets-kept: 326912 gets-total: 333764 hit-ratio: 0.00
badger 2021/03/17 08:44:50 WARNING: Cache life expectancy (in seconds): 
 -- Histogram: 
Min value: 0 
Max value: 162 
Mean: 2.03 
Count: 700 
[0 B, 2 B) 685 97.86% 
[2 B, 4 B) 3 0.43% 
[32 B, 64 B) 2 0.29% 
[64 B, 128 B) 5 0.71% 
[128 B, 256 B) 5 0.71% 
 --

badger 2021/03/17 08:45:30 WARNING: Block cache might be too small. Metrics: hit: 7 miss: 336908 keys-added: 715 keys-updated: 30 keys-evicted: 700 cost-added: 11137024532 cost-evicted: 10870634281 sets-dropped: 0 sets-rejected: 29451 gets-dropped: 0 gets-kept: 329728 gets-total: 336915 hit-ratio: 0.00
badger 2021/03/17 08:45:30 WARNING: Cache life expectancy (in seconds): 
 -- Histogram: 
Min value: 0 
Max value: 162 
Mean: 2.03 
Count: 700 
[0 B, 2 B) 685 97.86% 
[2 B, 4 B) 3 0.43% 
[32 B, 64 B) 2 0.29% 
[64 B, 128 B) 5 0.71% 
[128 B, 256 B) 5 0.71%

Hard to tell from your logs. Looks like you haven’t resources enough for 15 million nodes at once.

At present, K8s is configured with 8 CPUs and 16G memory. Is it too little? Which logs can locate the problem, I will look for it again

It depends. If these 16GB are divided per pod, probably yes. For 21million I’d say you need 22GB of RAM in the connected pod.

You can try to “balance” the load. Adding all Alphas addresses in the liveload config. The problem is sending the whole load to a single Alpha instance. It can’t handle it with e.g 5.33GB of RAM(Let’s say you have 3 Alphas).

Yes, I do have insufficient memory, because after the node is down, only two nodes can start successfully after restarting. The last node failed to start due to insufficient resources. Do you need 22GB of RAM for each pod or three pods? , Is there any recommendation for the number of nodes and the corresponding resource configuration?

Only in the main one. But if you add the list of Alphas/Pods to the liveload this error will decrease a lot of happening.

e.g

dgraph live (...) --alpha "127.0.0.1:9080,127.0.0.1:9081,127.0.0.1:9082,127.0.0.1:9083"

22GB is when you are using a single bare metal instance. Multiple instances are different and each case is a new one.

You can follow these https://dgraph.io/docs/deploy/production-checklist/#cluster-requirements

A common configuration for Dgraph is 16 CPUs and 32 GiB of memory per machine.

That’s a lot, but if you have a balanced loading you don’t need too many resources. But as I said, each case is a new case. You have to find out your needs based on the size of your load and balancing strategy.

Cheers.

Sorry, I remember that this configuration is used to import data in batches. I don’t quite understand the meaning of this. Can you explain it in detail?
Or is there any other way to update the schema when there is a certain amount of data?

Which one?

I’m confused about which part confused you hehehehe

Schema? why schema?

This configuration is not very understanding.

This one:
https://dgraph.io/docs/deploy/fast-data-loading/live-loader/

If there are already a large number of nodes in the library, are there other ways to update the schema?

Okay. In K8s things gets really complex for who is new with this. You need the address of each Alpha to be able to send an even load via LiveLoad. You have two ways of doing it. One is exposing the pods to the host(if in the cloud, to the internet). You have to create a service to expose it. The other way is live loading in a sidecar or an init container and use the internal SVC addresses.

Yes, liveload loads in batches. But it can also balance the batches between alphas if you configure it.

Update why? if you already have a cluster up and running and you are just importing data. You don’t need a schema for this. If it asks for schema, you can give an empty one to ignore.

The only way to update schemas are via Ratel, Alter, and giving the new schema to an import tool.

Update schema error

Hum, so you are updating the schema in a massive way? during an import?

If you are not using live load and it died cuz the schema update only. it should not fail for low resources. That is new to me tho. The internal process never failed in my hand. If it happens again please tell us more details, something that we could reproduce.

PS. I think I’m mixing some topics with the other issues you have opened. Sorry for that.

Yes, but I did not fail to import data, but update the existing schema on the cluster of existing data, such as adding indexes, changing data types, etc. If the node fails due to insufficient resources, is it necessary to expand the existing resources (cpu, memory)? Or are there other methods?

Yes.

Reduce the work. Try to change each predicate separately or avoid changing it all the time. With 15 million nodes, There is no time to breathe. Increase resources or do a strategy to reduce the internal work.

The predicate that starts with “dgraph*” is an internal predicate, you can’t change them.

Well, thank you very much for your detailed explanation. I will pay attention to the relevant memory indicators to avoid node exceptions.

1 Like