Sum over facets is incorrect

Moved from GitHub dgraph/4160

Posted by campoy:

What version of Dgraph are you using?

master

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

n/a

Steps to reproduce the issue (command/config used to run Dgraph).

Given the dataset generated by this mutation:

{
  set {
    _:a <name> "Anne" .
    _:b <name> "Brian" .
    
    _:jp <name> "Jurassic Park" .
    _:ij <name> "Indiana Jones" .
    
    _:a <rated> _:jp (rating=5) .
    _:a <rated> _:ij (rating=2) .
    _:b <rated> _:ij (rating=2) .
  }
}

If you run the following request:

{
  q(func: has(rated)) {
    name
    rated @facets(r as rating)
    partial_sum: sum(val(r))
  }
      
  sum() {
    total_sum: sum(val(r))
  }
}

Expected behaviour and actual result.

I’d expect partial_sum to be 7 for Anne and 2 for Brian, then total_sum would be 9.

Instead, the result is as follows:

{
  "data": {
    "q": [
      {
        "name": "Anne",
        "rated": [
          {
            "rated|rating": 5
          },
          {
            "rated|rating": 2
          }
        ],
        "partial_sum": 9
      },
      {
        "name": "Brian",
        "rated": [
          {
            "rated|rating": 2
          }
        ],
        "partial_sum": 4
      }
    ],
    "sum": [
      {
        "total_sum": 9
      }
    ]
  }
}

I have a theory about why we’re getting these weird numbers.

Variables attach values to uid, but in this case that’s not the right behavior, as the value of the variable should not be attached to the UID of the person nor the movie, but rather the combination of both linked by the predicate.

You can see the weird artifact by querying by this value on all of the nodes.

{
  var(func: has(rated)) {
    rated @facets(r as rating)
  }
      
  sum(func: has(name)) {
    name
    val(r)
  }
}

returns

{
  "data": {
    "sum": [
      {
        "name": "Jurassic Park",
        "val(r)": 5
      },
      {
        "name": "Indiana Jones",
        "val(r)": 4
      },
      {
        "name": "Anne"
      },
      {
        "name": "Brian"
      }
    ]
  }
}

This proves that the variable r has been attached to the movie UIDs by adding all of the values in the facets pointing to them.

Once we understand this, it makes sense that the sum of the ratings for Anne is 9 instead of 7, as it’s the sum of the ratings for the two movies. Same goes for the ratings for Brian being 4 instead of 2.

Fixing this might be complicated, as it might imply making variables work as a map from <uid, uid> to value rather than to value.

MichelDiz commented :

I think this query can fit it. But I’m not sure if it can cover all scenarios.

{
  var(func: has(rated)) {
    rated {
      ~rated @facets(r as rating)
    }
  }
   partial(func: uid(r), orderdesc: val(r)) {
    name
    partial_rated_sum : val(r)
  }

  sum() {
    total_sum: sum(val(r))
  }

}

Result

{
  "data": {
    "partial": [
      {
        "name": "Anne",
        "partial_rated_sum": 7
      },
      {
        "name": "Brian",
        "partial_rated_sum": 2
      }
    ],
    "sum": [
      {
        "total_sum": 9
      }
    ]
  }
}

MiLeung commented :

I am having the same issue with facet variables being scoped improperly and causing incorrect aggregation values. Is there an ETA on when this will be fixed?