Mutations can break queries on unrelated nodes

Moved from GitHub dgraph/5217

Posted by fwereade:

May be related to https://github.com/dgraph-io/dgraph/issues/5160

What version of Dgraph are you using?

Seen in 1.1.1, 1.2.1, 1.2.2, 20.3.0; seems worse in latest version.

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

Ubuntu 18.04.4 on an EC2 m5a.8xlarge (128GB of RAM) (using an io1 EBS volume with 3000 iops provisioned).

Steps to reproduce the issue (command/config used to run Dgraph).

dgraph alpha --lru_mb 6000 --zero localhost:5080 --query_edge_limit 9223372036854775807
dgraph zero -w zw --telemetry=false

  1. Load about 23M edges representing the source code of a large go project.
  2. Run a messy generated query (see below) and get results indicating success (below).
  3. Repeatedly load about 500 more edges, representing multiple independent copies of the source code of a small test project, and run a simple query against the new project. (The only point of contact between the two is that each project is reachable from the root node with UID 1 via ___child edges.)
  4. Run the original query (for the large project), and see results indicating failure (below).
  5. Restart zero and alpha, and observe that the failure still happens exactly as in (4).

Note that I’m not certain that (3) is necessary to trigger the failure, but the mutation spam seems to be sufficient to make it happen reliably within an hour or so.

Expected behaviour and actual result.

We expect that adding more data to one part of the graph would not change the results returned by a query concentrating on another part of the graph. We actually see that some parts of the query which filter on the ___s_name predicate start returning no results.

Note that:

  1. part of the query – which uses the same input nodes and doesn’t filter on ___s_name – still returns correct results.
  2. another part of the query – which uses different input nodes and filters on ___s_name – still returns correct results.
  3. ___s_name has a “hash” index.

Query

Forgive the mess; it’s autogenerated, and I’m sure it could be made much nicer, but it currently runs well enough for our purposes in general.

{
  root as root(func: uid(1))  {
    uid
  }

  _Y(func: uid(root))  {
    _Z as ___child {
      uid
    }
    uid
  }

  // Note that there's a successful filter on ___s_name here.
  _X as _a(func: eq(___kind,"git.repo"))  @filter(eq(___s_owner,"juju") and eq(___s_host,"github.com") and eq(___s_name,"juju") and uid(_Z)) {
    ___s_owner
    ___s_host
    ___s_name
    uid
  }

  // Following 9 blocks are not interesting.
  _V(func: uid(_X))  {
    _W as ___child {
      uid
    }
    uid
  }

  _f as _b(func: eq(___kind,"git.commit"))  @filter(eq(___s_sha,"ad1c30d8cad8736ff19de9440a066bacee58b743") and uid(_W)) {
    ___s_sha
    uid
  }

  _d(func: uid(_f))  {
    _e as ___child {
      uid
    }
    uid
  }

  _c(func: eq(___kind,"gotypes.project"))  @filter(uid(_e)) {
    uid
  }

  _U(func: uid(_f))  @recurse(depth: 6) {
    _T as ___child
    uid
  }

  _S as _g(func: eq(___commonkind,"common.dir"))  @filter(eq(___s_filename,".") and uid(_T)) {
    ___s_filename
    uid
  }

  _Q(func: uid(_S))  {
    _R as ___child {
      uid
    }
    uid
  }

  _P as _h(func: eq(___kind,"gotypes.package"))  @filter(uid(_R)) {
    ___s_name
    uid
  }

  _N(func: uid(_P))  {
    _O as ___child {
      uid
    }
    uid
  }

  // This is where it starts to get interesting; there are 3 very similar constructs all based on _q
  _q as _i(func: eq(___kind,"gotypes.named"))  @filter(uid(_O)) {
    ___s_name
    uid
  }

  // First example, down to _k
  _o(func: uid(_q))  {
    _p as ___child {
      uid
    }
    uid
  }

  _n as _j(func: eq(___kind,"gotypes.method"))  @filter(uid(_p)) {
    uid
  }

  _l(func: uid(_n))  {
    _m as ___reference {
      uid
    }
    uid
  }

  _k(func: eq(___kind,"gotypes.func"))  @filter(eq(___s_name,"Kill") and uid(_m)) {
    ___s_name
    uid
  }

  // Second example, down to _s
  _w(func: uid(_q))  {
    _x as ___child {
      uid
    }
    uid
  }

  _v as _r(func: eq(___kind,"gotypes.method"))  @filter(uid(_x)) {
    uid
  }

  _t(func: uid(_v))  {
    _u as ___reference {
      uid
    }
    uid
  }

  _s(func: eq(___kind,"gotypes.func"))  @filter(eq(___s_name,"Wait") and uid(_u)) {
    ___s_name
    uid
  }

  // Third example, down to _z
  _L(func: uid(_q))  {
    _M as ___child {
      uid
    }
    uid
  }

  _K as _y(func: eq(___kind,"gotypes.method"))  @filter(uid(_M)) {
    uid
  }

  _I(func: uid(_K))  {
    _J as ___reference {
      uid
    }
    uid
  }

  _H as _z(func: eq(___kind,"gotypes.func"))  @filter(uid(_J)) {
    uid
  }

  // Irrelevant from here on.
  _F(func: uid(_H))  {
    _G as ___link {
      uid
    }
    uid
  }

  _E as _A(func: eq(___kind,"gotypes.func_decl"))  @filter(uid(_G)) {
    uid
  }

  _D(func: uid(_E))  @recurse(depth: 1001) {
    _C as ___child
    uid
  }

  _B(func: eq(___kind,"gotypes.go_stmt"))  @filter(uid(_C)) {
    ___s_filename
    ___i_start_offset
    ___i_end_offset
    uid
  }
}

Success

block: bytes-of-json -> result-count

root: 15 -> 1
_a: 82 -> 1
_b: 74 -> 1
_c: 20 -> 1
_d: 52 -> 1
_g: 40 -> 1
_h: 30733 -> 706
_i: 341963 -> 6971
_j: 1335866 -> 68934
_k: 6324 -> 165
_l: 3844097 -> 68934
_o: 1532028 -> 6971
_r: 1335866 -> 68934
_s: 6834 -> 178
_t: 3844097 -> 68934
_w: 1532028 -> 6971
_y: 1335866 -> 68934
_z: 440528 -> 22738
_A: 337067 -> 17396
_B: 6740 -> 99
_D: 42797376 -> 17396
_F: 1128102 -> 22738
_I: 3844097 -> 68934
_L: 1532028 -> 6971
_N: 830905 -> 706
_Q: 13646 -> 1
_U: 7847179 -> 1
_V: 67 -> 1
_Y: 2348 -> 1

Failure

block: bytes-of-json -> result-count

root: 15 -> 1
_a: 82 -> 1
_b: 74 -> 1
_c: 20 -> 1
_d: 52 -> 1
_g: 40 -> 1
_h: 30733 -> 706
_i: 341963 -> 6971
_j: 1335866 -> 68934
_k: 2 -> 0              * smaller, filter is rejecting everything
_l: 3844097 -> 68934
_o: 1532028 -> 6971
_r: 1335866 -> 68934
_s: 2 -> 0              * smaller, filter is rejecting everything
_t: 3844097 -> 68934
_w: 1532028 -> 6971
_y: 1335866 -> 68934
_z: 440528 -> 22738
_A: 337067 -> 17396
_B: 6740 -> 99
_D: 42797376 -> 17396
_F: 1128102 -> 22738
_I: 3844097 -> 68934
_L: 1532028 -> 6971
_N: 830905 -> 706
_Q: 13646 -> 1
_U: 7847179 -> 1
_V: 67 -> 1
_Y: 3982 -> 1           * bigger, but not unexpected, root has more children now

Note that _I, _t, and _l are identical in both cases; they feed into _k, _s and _z respectively, in which _k and _s filter on ___s_name (and start returning no results) and _z doesn’t (and returns the exact same results even when _k and _s start failing).

fwereade commented :

Note also that an export/import causes the query to succeed again.

fwereade commented :

Also seen starting to find nothing at _b (and all parts onward from there). ___s_sha similarly has a “hash” index; may be worth noting that the small ingests which (probably) trigger the problem do indeed write the same edges (___s_sha and ___s_name) as we’ve seen failing to filter correctly.

harshil-goel commented :

Hi, We are looking at the issue. If it is possible, could you provide us with your schema and mutations, so that we can replicate the issue?

OmarAyo commented :

Hi @fwereade,

I am sending this message to ask if you can provide what Harshil asked in his previous message (schema and mutations)?

Thanks,