Skewed Query Results with Ongoing Mutations

Moved from GitHub dgraph/5021

Posted by grantsavage:

What version of Dgraph are you using?

v1.2.1

Have you tried reproducing the issue with the latest release?

No

System Background

We are running Dgraph on Kubernetes using the supplied Helm charts. We have 3 Dgraph Zeros running and 6 Dgraph Alphas running. All pods are on their own node. For storage, we are using direct attached storage to each node (could this be the issue?). Access mode is ReadWriteMany.

Steps to reproduce the issue (command/config used to run Dgraph).

We are running a system that is executing at a maximum of 60 mutations per second. Our mutations are executed over the HTTPS API (TLS enabled) with the route /mutate?commitNow=true and we are performing an upsert operation like so

upsert {
    query {
         var (...) { a as uid }
         var (...) { b as uid }
         ...
    }

    mutation {
        set {
             uid(a) <something> "something" .
             uid(b) <something> "something" .
        }
    }
}

The query we use to aggregate and report on our data uses a @groupby and count aggregate like so:

{
  var(func: type(typeA)) @filter(...) {
    replationshipA @filter(...) {
      relationshipB @filter(...) {
        relationshipC @filter(...) {
          a as uid
        }
      }
    }
  }

  query(func: uid(a)) @groupby(b,c,d)  {
    count(uid)
  }
}

When querying our data using the HTTPS API, using the endpoint /query?ro=true&be=true with the above query, we are seeing skewed results when running the same query multiple times in a short time frame (10 seconds). For example, we are seeing one element of the grouping come back with a value of 5000 on the first query, but on the second subsequent query, the value comes back as 100. Examples below:

Query 1

[
            {
                "count": 5671
            },
            {
                "count": 23535
            },
            ...
]

Query 2

[
            {
                "count": 113
            },
            {
                "count": 3000
            },
            ...
]

The reduction in the count and large amount of variance does not seem correct to us.

However, if we stop all mutations to Dgraph though, and run the same query experiment, we start to see reasonable and consistent results. What could be causing this behavior? Is this user error related?

Expected behaviour and actual result.

We expect to see consistent results when querying our data and not see fluctuations of +/- 5,000.

harshil-goel commented :

Hi, We are looking at the issue. If it is possible, could you provide us with your schema and mutations, so that we can replicate the issue?