Dgraph filters with eq() lt() or gt() stop working after a while for string, bool and int predicates

We’re running dgraph 1.1.1 in production and we noticed that simple queries like the following return the count as 0.

{ q(func: type(ORGANIZATION_USER)) @filter(eq(isDisabled, false)) {

isDisabled is indexed and is also part of the ORGANIZATION_USER type and there are many vertices that satisfy this condition. The predicate is indexed like so.

<isDisabled>: bool @index(bool) .

The query used to work and stopped working. This is just one example and we have noticed this across predicates. This issue is observed in bool, string and int type predicates as well. The issue has been observed in lt() and gt() functions as well for int predicates. This issue has caused some side effects in atleast one use case as the query did not return the expected data and the application, created unnecessary duplicates.When I dropped the bool index and reapplied it, the query started working again. Reapplying the index works for int type too. But then, it stops working AGAIN after a while/some hours. I know this cause we have reapplied the index multiple times for strings and finally went ahead and used regexp instead of eq() as a workaround. But what do we do about ints and bool?

Hey @joseph, thanks for reporting this.
Would you be able to share your data set so that we can try to reproduce it? We have tests for count(..) queries that seem to work. Maybe this issue shows up with some specific data set.

Also, have you tried running one of the latest releases of dgraph?

Thanks Ibrahim, appreciate your help. We have the issue in a lower environment as well but exports dont work as we have crossed the 30 day trial. Is there any other way?

Hey @joseph, you can zip and share your data directories. Each alpha has p and w directory and each zero has a zw directory. Please share all the three directories.

You can share the directories here or you can email them to me at ibrahim[at]dgraph.io .

Hi Ibrahim, I have sent you an email regarding this. Thanks.

Hey Joseph,
I’ve replied to your email and copied over the following two questions from our email conversation.

Can I import this data in my local machine running v20.03.0? If so what steps should I follow? The folders above are from v1.1.1. Since this was not exported, will live & bulk loader work?.

The data format of dgraph v1.1.1. is not compatible with v20.03.0. The only way to run the same data on v20.03.0 would be by doing an export from v1.1.1. and then an import in v20.03.0 (using bulk/live loader) . @MichelDiz Please correct me if I’m wrong.

We just have one zero and one alpha (for now) and we store p, w and zw in the same folder.
If we used docker and added more alphas in the same host, should we map them to different volumes?

Ideally, you should map different alphas to different volumes so that loss of one volume would not result in the loss of all the data.

Yeah. that’s right.

Hum, weird, export feature is a free open source feature. What is trial is the binary backup.
see https://dgraph.io/docs/deploy/#exporting-database

I was able to reproduce this issue after running the same query for about 4 hours. This is related to roll ups in alpha.
These are the alpha logs when the first query failed at 22:57:31 and it has been failing since then.
One important observation is that roll ups were running in alpha at 22:57:31

I0416 22:19:23.103866   23073 log.go:34] Compaction for level: 0 DONE
I0416 22:54:29.544599   23073 groups.go:604] No healthy Zero leader found. Trying to find a Zero leader...
I0416 22:54:29.551467   23073 groups.go:897] Zero leadership changed. Renewing oracle delta stream.
E0416 22:54:29.552915   23073 groups.go:873] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0416 22:54:29.645984   23073 groups.go:641] Found connection to leader: localhost:5080
I0416 22:54:29.696129   23073 groups.go:833] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0416 22:54:29.696150   23073 groups.go:842] Got Zero leader: localhost:5080
I0416 22:55:30.942289   23073 draft.go:678] [0x1] Set Raft progress to index: 562031.
I0416 22:55:32.108007   23073 draft.go:383] Creating snapshot at index: 562083. ReadTs: 6160730.
I0416 22:55:32.108031   23073 storage.go:589] CreateSnapshot i=562083, cs=nodes:1 
I0416 22:55:34.090251   23073 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0416 22:55:34.090315   23073 log.go:34] Running for level: 0
I0416 22:55:41.604493   23073 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 7.514154308s
I0416 22:55:41.604526   23073 log.go:34] Compaction for level: 0 DONE
I0416 22:56:29.701271   23073 draft.go:678] [0x1] Set Raft progress to index: 564962.
I0416 22:57:29.710277   23073 draft.go:678] [0x1] Set Raft progress to index: 567688.
I0416 22:57:30.696727   23073 log.go:34] Rolling up Time elapsed: 01s, bytes sent: 0 B, speed: 0 B/sec
I0416 22:57:31.450630   23073 log.go:34] Rolling up Created batch of size: 1.5 MB in 4.53621ms.
I0416 22:57:31.450650   23073 log.go:34] Rolling up Sent 668 keys
I0416 22:57:31.454309   23073 draft.go:1076] Rolled up 665 keys. Done
I0416 22:57:31.454331   23073 draft.go:425] List rollup at Ts 6160730: OK.
I0416 22:57:31.456136   23073 draft.go:1109] Sent tablet size update to Zero. Total size: 279 MB
I0416 22:58:29.699431   23073 draft.go:678] [0x1] Set Raft progress to index: 567775.

I don’t see any activity in zero at the time of failure

I0416 19:01:02.070633   22755 log.go:34] 1 is starting a new election at term 15
I0416 19:01:02.070731   22755 log.go:34] 1 became pre-candidate at term 15
I0416 19:01:02.070779   22755 log.go:34] 1 received MsgPreVoteResp from 1 at term 15
I0416 19:01:02.070940   22755 log.go:34] 1 became candidate at term 16
I0416 19:01:02.070969   22755 log.go:34] 1 received MsgVoteResp from 1 at term 16
I0416 19:01:02.071167   22755 log.go:34] 1 became leader at term 16
I0416 19:01:02.071204   22755 log.go:34] raft.node: 1 elected leader 1 at term 16
I0416 19:01:02.071292   22755 raft.go:663] I've become the leader, updating leases.
I0416 19:01:02.071343   22755 assign.go:41] Updated Lease id: 690001. Txn Ts: 6170001
W0416 19:01:03.971531   22755 node.go:669] [0x1] Read index context timed out
I0416 19:01:18.607681   22755 zero.go:415] Got connection request: cluster_info_only:true 
I0416 19:01:18.609582   22755 zero.go:433] Connected: cluster_info_only:true 
I0416 19:01:18.612966   22755 zero.go:415] Got connection request: addr:"localhost:7080" 
I0416 19:01:18.615132   22755 zero.go:452] Found a member with the same address. Returning: id:1 group_id:1 addr:"localhost:7080" leader:true last_update:1582009811 
I0416 19:01:18.615213   22755 zero.go:454] Connected: addr:"localhost:7080" 
I0416 22:54:29.645034   22755 zero.go:415] Got connection request: cluster_info_only:true 
I0416 22:54:29.645551   22755 zero.go:433] Connected: cluster_info_only:true 
I0416 22:55:34.697404   22755 oracle.go:107] Purged below ts:6160730, len(o.commits):0, len(o.rowCommit):0

Hey @joseph, the bug you’re seeing has been fixed in the latest release of dgraph. Can you upgrade to dgraph v20.03?

1 Like

This is awesome news! Thanks for the update, Ibrahim!

Can you point me to the relevant PR which fixes this? Another user has reported a similar issue here. I feel best solution would be to upgrade, I want to ensure that their queries would need minimum changes once they go from v1.1.1 to v20.3 before I suggest the switch.

https://github.com/dgraph-io/dgraph/pull/4574 and https://github.com/dgraph-io/dgraph/pull/4908 are needed to fix the issue.