Sorting and counting issue

margallardo · June 14, 2018, 2:09pm

Hello,

I am running the 2 following queries to count the number of nodes. I was expecting the results to be the same but somehow I am getting different results. Is that the correct behavior?

{
infoLevel(func: has(typedonation), orderdesc:amount) {
count(uid)
}
}

Result:
{
“data”: {
“infoLevel”: [
{
“count”: 1000
}
]
}

When I just remove the sorting:

{
infoLevel(func: has(typedonation)) {
count(uid)
}
}

Result:
{
“data”: {
“infoLevel”: [
{
“count”: 191662
}
]
}

Can somebody help explaining this behavior?

Thanks,
Marcelo

MichelDiz · June 14, 2018, 2:32pm

Let me ask you;
Why do you need to sort by amount? There is no need to sort if you are just seeking for count the total of donations. In that query you are counting the total of nodes that has Type donation. And you’re not querying for others predicates.

And maybe some of your nodes doesn’t have amount values. So the first query one just returns the nodes that has amount values. And the last one returns everything that has or hasn’t amount values.

But of course it’s strange. It should return everything in both queries.

margallardo · June 14, 2018, 2:36pm

I am not trying to do a count. I was just trying to debug why I am getting different results on some aggregations when I use sorting.

I checked the query without sorting and all the nodes returned have amount.

I am starting to think there is a bug.

Thanks,
Marcelo

MichelDiz · June 14, 2018, 2:44pm

Sorry but I see a count here

well, when you query like:
Q:returns different?

{
infoLevel(func: has(typedonation)) {
 uid
 name
 some
 someother
   }
}

{
infoLevel(func: has(typedonation), orderdesc:amount) {
 uid
 name
 some
 someother
   }
}

If so, could be a bug, but it’s hard to say without touching in your schema and mutations. It would be necessary to see every context. Or more details so I can reproduce here and attest a bug.

margallardo · June 14, 2018, 2:56pm

As I said my intention is not to have count on my final query. I am now using count to debug why the results of an aggregation query are returning different values when using sorting.

When I run the the query as you indicated, without count, the lists returned are definitely different. The one with sorting is much smaller.
Even the query latency is much larger when sorting is not used as the result set is much larger.
That is the reason I did the count. Counting manually will be tedious and time consuming.

Thanks,
Marcelo

shanghai-Jerry · June 15, 2018, 2:14am

i think it doesn’t matter if there is a sort or not, the result should not be different when you use this query.

if it happened, i believe this is not a bug. it happened when your dgraph were still inserting data, and not be stable yet. please check your dgraph servers’ log and see what happened exactly.

margallardo · June 15, 2018, 11:58am

@shanghai-Jerry That is not the case. The data was loaded several days ago using bulk method and the query returns the same results consistently.

I really thing there is some bug.

Thanks,
Marcelo

shanghai-Jerry · June 16, 2018, 6:26am

woow, it might be, i have no ideas for this any more.

BlankRain · June 19, 2018, 3:12am

orderdesc:amount

Hi ,I have an hypothesis.
Is your data all have amount predicates?

{
typedonation
amount
}
and 
{
typedonation
}

all match has(typedonation).
if some data don’t have amount predicates.
Is it the reason make difference?

I didn’t do any test. just an hypothesis.

shanghai-Jerry · June 19, 2018, 5:43am

good, it make sense, just need more test to prove that some data don’t have amount predicate will influence the result.

try

{
  infoLevel(func: has(typedonation) and has(amount), orderdesc:amount) {
    count(uid)
  }
}

@margallardo

JustinRong · August 30, 2018, 8:54am

the same question.

{
A(func: has(elementId),orderdesc:<panorama#Taxi/行驶距离>) @filter(has(<panorama#Taxi/行驶距离>)){
   count(uid)
}
B(func: has(elementId)) @filter(has(<panorama#Taxi/行驶距离>)){
   count(uid)
}

}

Result:


{
  "A": [
    {
      "count": 1000
    }
  ],
  "B": [
    {
      "count": 22157
    }
  ]
}

dgraph version

Dgraph version : v1.0.5
Commit SHA-1 : 82787414
Commit timestamp : 2018-04-20 15:50:53 +1000
Branch : HEAD

JustinRong · August 30, 2018, 11:35am

github.com/dgraph-io/dgraph

sort result different by orderasc or not, the count only 1000 by orderasc

opened 10:59AM - 30 Aug 18 UTC

closed 05:02PM - 31 Aug 18 UTC

JustinRong

Produce: 1. prepare data data.js ``` const fs = require('fs'); const r…df = (i) => { return `_:x${i} <price> "1.5${i}" . #`; } const rdfs = (n) => { for (var i = 0; i < n; i++) { console.log(rdf(i)) } } rdfs(process.argv[2]) // node data.js 1 > 1.rdf // node data.js 1000 > 1000.rdf ``` 2. load data. first, load 1000 data. `dgraph live -r 1000.rdf` check the result with query: ``` { A (func: has(price), orderdesc: <price>){ count(uid) } B(func: has(price)){ count(uid) } } ``` the result is ``` "data": { "A": [ { "count": 1000 } ], "B": [ { "count": 1000 } ] ``` From now on , It's all ok. then , load 1 row. ``` dgraph live -r 1.rdf ``` re query it ``` { A (func: has(price), orderdesc: <price>){ count(uid) } B(func: has(price)){ count(uid) } } ``` the result is : ``` "data": { "A": [ { "count": 1000 } ], "B": [ { "count": 1001 } ] }, ``` 1000 != 1001 Dgraph version Info: ``` Dgraph version : v1.0.7 Commit SHA-1 : f1803442 Commit timestamp : 2018-08-10 13:00:21 -0700 Branch : HEAD For Dgraph official documentation, visit https://docs.dgraph.io. For discussions about Dgraph , visit https://discuss.hypermode.com. To say hi to the community , visit https://dgraph.slack.com. Licensed under Apache 2.0 + Commons Clause. Copyright 2015-2018 Dgraph Labs, Inc. ```

MichelDiz · August 31, 2018, 11:03pm

I believe that this problem has already been clarified. To circumvent the limitation simply use paging greater than 1000.

e.g:

 A (func: has(price), orderdesc: <price>, first: 10000 ){

Quoting below:

Well folks, it’s not a bug. This is a limitation by default.

// Sort and paginate directly as it’d be expensive to iterate over the index which
// might have millions of keys just for retrieving some values.

// Only retrieve up to 1000 results by default.

“if no “first” or “last” etc. argument is specified, it would default to 1000.” mrjn.

github.com

dgraph-io/dgraph/blob/06ea4c545bc3cf0ed730327557dbff96406e75f8/query/query.go#L2146


      
          	for _, it := range sg.Params.NeedsVar {
          		// TODO(pawan) - Return error if user uses var order with predicates.
          		if len(sg.Params.Order) > 0 && it.Name == sg.Params.Order[0].Attr &&
          			(it.Typ == gql.VALUE_VAR) {
          			// If the Order name is same as var name and it's a value variable, we sort using that variable.
          			return sg.sortAndPaginateUsingVar(ctx)
          		}
          	}
          
          	if sg.Params.Count == 0 {
          		// Only retrieve up to 1000 results by default.
          		sg.Params.Count = 1000
          	}
          
          	x.AssertTrue(len(sg.Params.Order) > 0)
          
          	sort := &intern.SortMessage{
          		Order:     sg.Params.Order,
          		UidMatrix: sg.uidMatrix,
          		Offset:    int32(sg.Params.Offset),
          		Count:     int32(sg.Params.Count),

Thank you for reporting this.

Cheers.

BlankRain · September 3, 2018, 4:10am

Thanks …

However,is it reasonable?

Mentioum · November 22, 2022, 2:44pm

At the risk of necroing a thread, this has come up a bunch of times across the forum over the years and I’d like to surface it to the new dgraph team. I’m replying to this one as I feel this is your most complete response on the matter from @MichelDiz

I just came across this issue again recently while my data science was performing some risk analysis on our data sets. I don’t think this default behaviour really makes sense.

I completely understand that you dont want users on shared clusters doing this as it’d be bad for other users experience.

However…

Given that one of the main reasons that people choose to use dgraph is to analyse large and disparate data sets in a high performance manner, you’d imagine that it’d be very common to want to query a whole data set with sorting. 1000 is not many when doing something like fraud detection based on user interactions in the last month (for example).

I’d like to propose removing these limitations from dedicated and self hosted clusters. If its for quality of service protection on shared clusters, I can understand the limitation, but for dedicated or self hosted clusters I’m not sure I agree with the limitation. In my opinion its also a very reasonable ‘upsell’ for getting users to upgrade to a dedicated cluster as well.

If you don’t want to do that, it’d be great if you could include an option to disable this ‘safety’ feature in scenarios where it wont affect other users.

@Raphael

MichelDiz · November 22, 2022, 3:27pm

We could add a “unsafe” flag in --limit string Limit options

Please, open a feature request. It would be good to have more places to set ~unsafe tho

Raphael · November 22, 2022, 5:08pm

@Mentioum Those are good points. Let me dig in this topic with Michel’s help and decide how we can better support those use cases.

Topic		Replies	Views
Adding sorting to DQL query reduces the number of results Dgraph kind:bug , dql	8	1465	September 21, 2021
[BUG] Dgraph is limiting query results that use "order" to 1000 GraphQL kind:bug	3	1393	November 26, 2021
Sorting by count Dgraph	6	2839	September 9, 2023
Sorting doesn't work on queries with cascade Dgraph kind:bug	11	1513	August 1, 2025
Predicate multisorting problem Dgraph kind:question , dgraph , ordering	2	55	January 22, 2025

Sorting and counting issue

Related topics