What's the point is to put this message every few milliseconds?


(Nikita Zaletov) #1


we have 40 millions of such messages for last day only. it costs money if you use tools like Datadog for logs aggregation, also it just looks as spam, where it’s hard to find anything else

also, both 3 nodes answer “ok” on /health and /state, but cluster is actually unhealthy, one node is spamming in this manner for more than 1 day, and all queries, even very simple, like “give me 10 first nodes for some predicate”, are running infinitely without any answer on both 3 nodes. whereas it should run for few milliseconds


On which branch is the unreleased version stored at?
(Reda Bourial) #2

Can you give more information :

  • What version are you running (I assume all nodes are from the same release) ?
  • What type of orchestration is it running on ?
  • Do backups works (link here)?
  • Did you try to backup the data, update dgraph and reload the data to see if the problem is reproducible with the last version (1.0.11), i see on the changelog here that a race condition has been solved since the last release ?
Will fill this part later with a dockerfile for starting dgraph with the last fixes !
  • What’s the size of you data ?

(Nikita Zaletov) #3

we have 3 nodes v1.0.10 running in dockers on 3 amazon instances
we can’t upgrade to 1.0.11 due to https://github.com/dgraph-io/dgraph/issues/2832 bug, so waiting for 1.0.12 (hope it will be released soon)

if this issue is fixed in 1.0.11, then it’s fine, will check it after 1.0.12 released
thanks!


(Michel Conrado) #4

Hey Reda, this link isn’t correct. There’s no Backup endpoint in todays version. This link is from v0.7.6
The right one is https://docs.dgraph.io/master/deploy/#export-database

How did you get to this version and not the most current one? I want to understand for then in the future that other people will not do the same.


Old documentation might appear first on google
(Nikita Zaletov) #5

oh, as i see it was introduced as new bug in 1.0.11, so not my case…


(Michel Conrado) #6

Hard to say, please share your state.


(Nikita Zaletov) #7

unfortunately already reissued environment (it was test one). but have /state response saved

   "counter":"316339",
   "groups":{  
      "1":{  
         "members":{  
            "1":{  
               "id":"1",
               "groupId":1,
               "addr":"<ip>:7080",
               "lastUpdate":"1549173168"
            },
            "2":{  
               "id":"2",
               "groupId":1,
               "addr":"<ip>:7080",
               "lastUpdate":"1549175728"
            },
            "3":{  
               "id":"3",
               "groupId":1,
               "addr":"<ip>:7080",
               "lastUpdate":"1549172556"
            }
         },
         "tablets":{  
            "_predicate_":{  
               "groupId":1,
               "predicate":"_predicate_",
               "space":"1795084516"
            },
            "attr_description":{  
               "groupId":1,
               "predicate":"attr_description",
               "space":"1137958"
            },
            "attr_label":{  
               "groupId":1,
               "predicate":"attr_label",
               "space":"5794283"
            },
            "attr_type":{  
               "groupId":1,
               "predicate":"attr_type",
               "space":"64737411"
            },
            "attr_workType":{  
               "groupId":1,
               "predicate":"attr_workType",
               "space":"64471321"
            },
            "class_instance":{  
               "groupId":1,
               "predicate":"class_instance",
               "space":"59707827"
            },
            "class_metadata":{  
               "groupId":1,
               "predicate":"class_metadata",
               "space":"44836296"
            },
            "class_work":{  
               "groupId":1,
               "predicate":"class_work",
               "space":"56013341"
            },
            "dummy":{  
               "groupId":1,
               "predicate":"dummy",
               "space":"38503372"
            },
            "entity_key":{  
               "groupId":1,
               "predicate":"entity_key",
               "space":"385374879"
            },
            "entity_type":{  
               "groupId":1,
               "predicate":"entity_type",
               "space":"179020003"
            },
            "i2wlink":{  
               "groupId":1,
               "predicate":"i2wlink",
               "space":"57968337"
            },
            "inherited":{  
               "groupId":1,
               "predicate":"inherited",
               "space":"113199231"
            },
            "label":{  
               "groupId":1,
               "predicate":"label",
               "space":"276165116"
            },
            "link_agg_key":{  
               "groupId":1,
               "predicate":"link_agg_key",
               "space":"443638432"
            },
            "link_key":{  
               "groupId":1,
               "predicate":"link_key",
               "space":"1047520397"
            },
            "m2mlink":{  
               "groupId":1,
               "predicate":"m2mlink",
               "space":"23"
            },
            "m2mlink_agg":{  
               "groupId":1,
               "predicate":"m2mlink_agg",
               "space":"31"
            },
            "r2mlink":{  
               "groupId":1,
               "predicate":"r2mlink",
               "space":"153135658"
            },
            "r2mlink_agg":{  
               "groupId":1,
               "predicate":"r2mlink_agg",
               "space":"170779412"
            },
            "r2mlink_direct":{  
               "groupId":1,
               "predicate":"r2mlink_direct",
               "space":"49689744"
            },
            "rel_955314c8-1dd8-405c-b351-2679bdddb961":{  
               "groupId":1,
               "predicate":"rel_955314c8-1dd8-405c-b351-2679bdddb961",
               "space":"358"
            },
            "rel_a97be0d2-29aa-4e6a-8f1a-45b58ce6aa00":{  
               "groupId":1,
               "predicate":"rel_a97be0d2-29aa-4e6a-8f1a-45b58ce6aa00",
               "space":"176"
            },
            "rel_c9938da6-211b-40b4-bd25-1be5e5095aa4":{  
               "groupId":1,
               "predicate":"rel_c9938da6-211b-40b4-bd25-1be5e5095aa4",
               "space":"43342492"
            },
            "rel_ddc00115-e7db-44f8-b99b-11aab93af237":{  
               "groupId":1,
               "predicate":"rel_ddc00115-e7db-44f8-b99b-11aab93af237",
               "space":"178"
            },
            "rel_e0cda17c-7b66-45be-bcc1-e95393c39650":{  
               "groupId":1,
               "predicate":"rel_e0cda17c-7b66-45be-bcc1-e95393c39650",
               "space":"358"
            },
            "type_Agent":{  
               "groupId":1,
               "predicate":"type_Agent",
               "space":"24469719"
            },
            "type_Concept":{  
               "groupId":1,
               "predicate":"type_Concept",
               "space":"18185354"
            },
            "type_Instance":{  
               "groupId":1,
               "predicate":"type_Instance",
               "space":"58790067"
            },
            "type_Work":{  
               "groupId":1,
               "predicate":"type_Work",
               "space":"55097175"
            },
            "type_archival":{  
               "groupId":1,
               "predicate":"type_archival",
               "space":"405"
            },
            "type_collection":{  
               "groupId":1,
               "predicate":"type_collection",
               "space":"783"
            },
            "type_electronic":{  
               "groupId":1,
               "predicate":"type_electronic",
               "space":"41839027"
            },
            "type_manuscript":{  
               "groupId":1,
               "predicate":"type_manuscript",
               "space":"3619"
            },
            "type_print":{  
               "groupId":1,
               "predicate":"type_print",
               "space":"17377245"
            },
            "w2mlink":{  
               "groupId":1,
               "predicate":"w2mlink",
               "space":"152329298"
            },
            "workType_audio":{  
               "groupId":1,
               "predicate":"workType_audio",
               "space":"3626268"
            },
            "workType_cartography":{  
               "groupId":1,
               "predicate":"workType_cartography",
               "space":"19184"
            },
            "workType_mixedMaterial":{  
               "groupId":1,
               "predicate":"workType_mixedMaterial",
               "space":"1205"
            },
            "workType_movingImage":{  
               "groupId":1,
               "predicate":"workType_movingImage",
               "space":"581750"
            },
            "workType_notatedMusic":{  
               "groupId":1,
               "predicate":"workType_notatedMusic",
               "space":"652241"
            },
            "workType_object":{  
               "groupId":1,
               "predicate":"workType_object",
               "space":"103"
            },
            "workType_stillImage":{  
               "groupId":1,
               "predicate":"workType_stillImage",
               "space":"597"
            },
            "workType_text":{  
               "groupId":1,
               "predicate":"workType_text",
               "space":"54019457"
            }
         }
      }
   },
   "zeros":{  
      "10":{  
         "id":"10",
         "addr":"<ip>:5080",
         "leader":true
      },
      "11":{  
         "id":"11",
         "addr":"<ip>:5080"
      },
      "12":{  
         "id":"12",
         "addr":"<ip>:5080"
      }
   },
   "maxLeaseId":"30370000",
   "maxTxnTs":"1040000",
   "maxRaftId":"3",
   "cid":"a407c413-c0ee-4e09-99e3-36e4aa618e2a"
}

dgraph is running with replicas 3 set, so all data is replicated over all 3 nodes


(Michel Conrado) #8

What’s the point is to put this message every few milliseconds?

I think it’s to be clear that there’s a communication problem. But maybe we can by something like “silent mode” or reduce the interval.

Please, if it isn’t ask too much, fill up an issue requesting this.

Just guessing here, Analyzing your State, if it is the same as in the previous case. It seems to me that your leader died at some point and maybe his address changed or was inaccessible for some reason when he returned. Or did not bounce back at all because of lack of resources or some AWS policy not sure. I thought you had more than one group, but you only have one. (BTW, you can delete the state).


(Reda Bourial) #9

My mistake, i used a documentation link on a github issue.


(Nikita Zaletov) #10

ok i will create an issue
the same problem for other messages


(Manish R Jain) #11

TBH, it is a bit serious that the Alpha which needs to receive a snapshot isn’t able to talk to the leader. We did do some work around this issue, using not only the context information present in the request, but also the asking Alpha’s own membership state of the cluster to try to reach any other member in the group. That should fix the problem, as long as the caller Alpha is connected to the cluster.

Until the caller Alpha gets the snapshot, it can’t do anything. Logs are just reflecting that reality.


(Nikita Zaletov) #12

We did do some work around this issue

good news. is it already in rc5, can we test it?

i agree, that nothing can be done here, but logging this problem several times in a millisecond seems excessive here. it would be good to reduce frequency of such messages. we had 40m+ of such messages on 1 node for one day


(Daniel Mai) #13

Yes, the commit is in v1.0.12-rc5.

To quiet the logs, you can use an event processor such as Riemann to rollup similar logs before shipping them to Datadog.


(Nikita Zaletov) #14

unfortunately, rc5 is much more unstable than 1.0.10, we had same issues there, where alpha is loosing the leader, also we had OOM and “panic: interface conversion: interface {} is nil, not string” crashes every few minutes, just running simple mutations and lookups, so we switched back to 1.0.10


(Michel Conrado) #15

That’s odd. Did you exported the data and start from scratch?


(Nikita Zaletov) #16

we just removed p/w/zw dirs and started from scratch


(Daniel Mai) #17

Can you share the logs where you see the panic?


(Nikita Zaletov) #18

here they are. please read them from bottom to up:

|---|---|---|---|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|created by google.golang.org/grpc.(*Server).serveStreams.func1|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/google.golang.org/grpc/server.go:680 +0x9f|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc1993d5f90, 0xc000130380, 0x1595340, 0xc054d94e00, 0xc0fc793900)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/google.golang.org/grpc/server.go:1249 +0x1308|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|google.golang.org/grpc.(*Server).handleStream(0xc000130380, 0x1595340, 0xc054d94e00, 0xc0fc793900, 0x0)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/google.golang.org/grpc/server.go:1011 +0x4cd|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|google.golang.org/grpc.(*Server).processUnaryRPC(0xc000130380, 0x1595340, 0xc054d94e00, 0xc0fc793900, 0xc00047f1d0, 0x1ed05b8, 0x0, 0x0, 0x0)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgo/protos/api/api.pb.go:1749 +0x23e|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgo/protos/api._Dgraph_Query_Handler(0x13c1c20, 0x1f9a2d0, 0x158fba0, 0xc1460dc960, 0xc168324e00, 0x0, 0x0, 0x0, 0xc0fc793900, 0x1)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgraph/edgraph/server.go:495 +0x4bf|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgraph/edgraph.(*Server).Query(0x1f9a2d0, 0x158fba0, 0xc1460dc9c0, 0xc16d46a410, 0xc0f1df2540, 0x0, 0x0)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgraph/query/query.go:2614 +0x5e|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgraph/query.(*QueryRequest).Process(0xc0149a5a30, 0x158fba0, 0xc1460dc9c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgraph/query/query.go:2520 +0x61d|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgraph/query.(*QueryRequest).ProcessQuery(0xc0149a5a30, 0x158fba0, 0xc1460dc9c0, 0x1, 0x1)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgraph/query/query.go:1557 +0x39|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgraph/query.(*SubGraph).recursiveFillVars(0xc0d5250840, 0xc1460dd9b0, 0xc0f1df3da0, 0x1)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgraph/query/query.go:1617 +0x249|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgraph/query.(*SubGraph).fillVars(0xc0d5250840, 0xc1460dd9b0, 0x1, 0xc0000e3b77)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgraph/query/query.go:1646 +0x3e9|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgraph/query.(*SubGraph).replaceVarInFunc(0xc0d5250840, 0xc1460dd9b0, 0xc0000e3b77)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|/ext-go/1/src/github.com/dgraph-io/dgraph/types/conversion.go:349 +0x108a|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|github.com/dgraph-io/dgraph/types.Marshal(0x0, 0x0, 0x0, 0xc0149a5320, 0xc0000e3b34, 0xa)|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|goroutine 28820 [running]:|
|Feb 05 22:58:01.826|i-0dd4493d3f656b465|dgraph-alpha|panic: interface conversion: interface {} is nil, not string|
|Feb 05 22:58:01.816|i-0dd4493d3f656b465|dgraph-alpha|Lastcommit 30161 > current 30074. This would cause some commits to be lost.|
|Feb 05 22:58:01.816|i-0dd4493d3f656b465|dgraph-alpha|Lastcommit 30161 > current 30074. This would cause some commits to be lost.|

also, there were oom crashes, which is odd, since as i know, rc5 doesn’t use lru cache anymore and should take less memory. so, like there are memory leaks as well.

btw, after returning back to 1.0.10, there is interesting observation. when i ran dgraph with expand_edges=false, it started working much better. not ideal, still loosing some commits in logs, but cluster is alive all the day under load.


(Manish R Jain) #19

Hey @makitka, it would help a lot if you were able to create a Github issue, along with all the relevant information. This seems like a recent bug.


(Nikita Zaletov) #20

here you go: https://github.com/dgraph-io/dgraph/issues/2980