Count distinct values of predicate: @groupby + count performance question

Benjamin_Rupp · March 13, 2023, 3:39pm

Hi Dgraph team,

I want to to count distinct values of a predicate. I have objects with a predicate o.color and would like to know how often I have green, blue etc.

Here is what I did:

curl -H "Content-Type: application/dql" -X POST localhost:8080/query -d $'{
 
    var(func: eq(o.tenant, "TheOneAndOnly")) {
			filtered as uid
    }
    o.color(func: uid(filtered)) @groupby(o.color){
        count(uid)
    }
	
}' | jq

Result:

"data": {
    "o.color": [
      {
        "@groupby": [
          {
            "o.color": "green",
            "count": 82439
          },
          {
            "o.color": "blue",
            "count": 82536
          },
          {
            "o.color": "yellow",
            "count": 82603
          },
          {
            "o.color": "orange",
            "count": 82676
          },
          {
            "o.color": "purple",
            "count": 82746
          },
          {
            "o.color": "red",
            "count": 83000
          }
        ]
      }
    ]
  }

This query gives me the desired results, but on a cluster with 500k objects it took 7 seconds to respond.

Using query blocks, i can significantly reduce the response time:

curl -H "Content-Type: application/dql" -X POST dgraphalpha2:8081/query -d $'{
   
		green(func: eq(o.color, "green")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    blue(func: eq(o.color, "blue")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    yellow(func: eq(o.color, "yellow")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    purple(func: eq(o.color, "purple")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    red(func: eq(o.color, "red")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}
    orange(func: eq(o.color, "orange")) @filter(eq(o.tenant,  "TheOneAndOnly")) {
			count(uid)
		}    	
}' | jq

Result:

"data": {
    "green": [
      {
        "count": 82439
      }
    ],
    "blue": [
      {
        "count": 82536
      }
    ],
    "yellow": [
      {
        "count": 82603
      }
    ],
    "purple": [
      {
        "count": 82746
      }
    ],
    "red": [
      {
        "count": 83000
      }
    ],
    "orange": [
      {
        "count": 82676
      }
    ]
  }

Responds in 135 ms!!

Big BUT: Now, I need to know all distinct values in advance.

Does anybody have an idea how to achieve my goal of counting distinct values of a predicate, in a performant way? Is there a way, performance for this operation could be optimized?

This was run on a 6 alpha cluster (2 shards @ 3 replicas). Dgraph version 21.03.2.

amaster507 · March 13, 2023, 3:48pm

This illustrates why Dgraph needs a query optimizer!

During a personal call with @gajanan we discussed the need for an optimizer and his ideas for such as well.

Even if you had a way to select distinct values (which afaik is impossible without restructuring your schema or doing it client side) you would still need to form the optimized query in a second round trip to the db. To do what you want, you would also need loops support in DQL so you could run the same query fpr each value element from another query all within a single trip to the db.

Topic		Replies	Views
Select distinct values / deduplicate Dgraph kind:question , dql	2	1505	April 27, 2021
Remove Duplicates from Results GraphQL graphql , dgraph	3	3066	September 1, 2020
Find distinct values of predicates and sort them Dgraph	3	405	June 27, 2023
Count Queries in GraphQL GraphQL rfc , area:graphql	22	19908	March 19, 2021
Best way to count unique values? Dgraph	4	1449	September 12, 2023

Count distinct values of predicate: @groupby + count performance question

Related topics