Count distinct values of predicate: @groupby + count performance question

Hi Dgraph team,

I want to to count distinct values of a predicate. I have objects with a predicate o.color and would like to know how often I have green, blue etc.

Here is what I did:

curl -H "Content-Type: application/dql" -X POST localhost:8080/query -d $'{
 
    var(func: eq(o.tenant, "TheOneAndOnly")) {
			filtered as uid
    }
    o.color(func: uid(filtered)) @groupby(o.color){
        count(uid)
    }
	
}' | jq

Result:

"data": {
    "o.color": [
      {
        "@groupby": [
          {
            "o.color": "green",
            "count": 82439
          },
          {
            "o.color": "blue",
            "count": 82536
          },
          {
            "o.color": "yellow",
            "count": 82603
          },
          {
            "o.color": "orange",
            "count": 82676
          },
          {
            "o.color": "purple",
            "count": 82746
          },
          {
            "o.color": "red",
            "count": 83000
          }
        ]
      }
    ]
  }

This query gives me the desired results, but on a cluster with 500k objects it took 7 seconds to respond.

Using query blocks, i can significantly reduce the response time:

curl -H "Content-Type: application/dql" -X POST dgraphalpha2:8081/query -d $'{
   
		green(func: eq(o.color, "green")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    blue(func: eq(o.color, "blue")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    yellow(func: eq(o.color, "yellow")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    purple(func: eq(o.color, "purple")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    red(func: eq(o.color, "red")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    orange(func: eq(o.color, "orange")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}    	
}' | jq

Result:

"data": {
    "green": [
      {
        "count": 82439
      }
    ],
    "blue": [
      {
        "count": 82536
      }
    ],
    "yellow": [
      {
        "count": 82603
      }
    ],
    "purple": [
      {
        "count": 82746
      }
    ],
    "red": [
      {
        "count": 83000
      }
    ],
    "orange": [
      {
        "count": 82676
      }
    ]
  }

Responds in 135 ms!!

Big BUT: Now, I need to know all distinct values in advance.

Does anybody have an idea how to achieve my goal of counting distinct values of a predicate, in a performant way? Is there a way, performance for this operation could be optimized?

This was run on a 6 alpha cluster (2 shards @ 3 replicas). Dgraph version 21.03.2.

1 Like

This illustrates why Dgraph needs a query optimizer!

During a personal call with @gajanan we discussed the need for an optimizer and his ideas for such as well.

Even if you had a way to select distinct values (which afaik is impossible without restructuring your schema or doing it client side) you would still need to form the optimized query in a second round trip to the db. To do what you want, you would also need loops support in DQL so you could run the same query fpr each value element from another query all within a single trip to the db.