Sorting and counting issue

Hello,

I am running the 2 following queries to count the number of nodes. I was expecting the results to be the same but somehow I am getting different results. Is that the correct behavior?

{
infoLevel(func: has(typedonation), orderdesc:amount) {
count(uid)
}
}

Result:
{
“data”: {
“infoLevel”: [
{
“count”: 1000
}
]
}

When I just remove the sorting:

{
infoLevel(func: has(typedonation)) {
count(uid)
}
}

Result:
{
“data”: {
“infoLevel”: [
{
“count”: 191662
}
]
}

Can somebody help explaining this behavior?

Thanks,
Marcelo

1 Like

Let me ask you;
Why do you need to sort by amount? There is no need to sort if you are just seeking for count the total of donations. In that query you are counting the total of nodes that has Type donation. And you’re not querying for others predicates.

And maybe some of your nodes doesn’t have amount values. So the first query one just returns the nodes that has amount values. And the last one returns everything that has or hasn’t amount values.

But of course it’s strange. It should return everything in both queries.

I am not trying to do a count. I was just trying to debug why I am getting different results on some aggregations when I use sorting.

I checked the query without sorting and all the nodes returned have amount.

I am starting to think there is a bug.

Thanks,
Marcelo

Sorry but I see a count here :stuck_out_tongue:

well, when you query like:
Q:returns different?

{
infoLevel(func: has(typedonation)) {
 uid
 name
 some
 someother
   }
}

{
infoLevel(func: has(typedonation), orderdesc:amount) {
 uid
 name
 some
 someother
   }
}

If so, could be a bug, but it’s hard to say without touching in your schema and mutations. It would be necessary to see every context. Or more details so I can reproduce here and attest a bug.

As I said my intention is not to have count on my final query. I am now using count to debug why the results of an aggregation query are returning different values when using sorting.

When I run the the query as you indicated, without count, the lists returned are definitely different. The one with sorting is much smaller.
Even the query latency is much larger when sorting is not used as the result set is much larger.
That is the reason I did the count. Counting manually will be tedious and time consuming.

Thanks,
Marcelo

i think it doesn’t matter if there is a sort or not, the result should not be different when you use this query.

if it happened, i believe this is not a bug. it happened when your dgraph were still inserting data, and not be stable yet. please check your dgraph servers’ log and see what happened exactly.

@shanghai-Jerry That is not the case. The data was loaded several days ago using bulk method and the query returns the same results consistently.

I really thing there is some bug.

Thanks,
Marcelo

woow, it might be, i have no ideas for this any more.

orderdesc:amount

Hi ,I have an hypothesis.
Is your data all have amount predicates?

{
typedonation
amount
}
and 
{
typedonation
}

all match has(typedonation).
if some data don’t have amount predicates.
Is it the reason make difference?

I didn’t do any test. just an hypothesis. :joy:

good, it make sense, just need more test to prove that some data don’t have amount predicate will influence the result.

try

{
  infoLevel(func: has(typedonation) and has(amount), orderdesc:amount) {
    count(uid)
  }
}

@margallardo

the same question.

{
A(func: has(elementId),orderdesc:<panorama#Taxi/行驶距离>) @filter(has(<panorama#Taxi/行驶距离>)){
   count(uid)
}
B(func: has(elementId)) @filter(has(<panorama#Taxi/行驶距离>)){
   count(uid)
}

}

Result:


{
  "A": [
    {
      "count": 1000
    }
  ],
  "B": [
    {
      "count": 22157
    }
  ]
}

dgraph version

Dgraph version : v1.0.5
Commit SHA-1 : 82787414
Commit timestamp : 2018-04-20 15:50:53 +1000
Branch : HEAD

I believe that this problem has already been clarified. To circumvent the limitation simply use paging greater than 1000.

e.g:

 A (func: has(price), orderdesc: <price>, first: 10000 ){

Quoting below:

Well folks, it’s not a bug. This is a limitation by default.

// Sort and paginate directly as it’d be expensive to iterate over the index which
// might have millions of keys just for retrieving some values.

// Only retrieve up to 1000 results by default.

“if no “first” or “last” etc. argument is specified, it would default to 1000.” mrjn.

Thank you for reporting this.

Cheers.

1 Like

Thanks …

However,is it reasonable?

1 Like

At the risk of necroing a thread, this has come up a bunch of times across the forum over the years and I’d like to surface it to the new dgraph team. I’m replying to this one as I feel this is your most complete response on the matter from @MichelDiz

I just came across this issue again recently while my data science was performing some risk analysis on our data sets. I don’t think this default behaviour really makes sense.

I completely understand that you dont want users on shared clusters doing this as it’d be bad for other users experience.

However…

Given that one of the main reasons that people choose to use dgraph is to analyse large and disparate data sets in a high performance manner, you’d imagine that it’d be very common to want to query a whole data set with sorting. 1000 is not many when doing something like fraud detection based on user interactions in the last month (for example).

I’d like to propose removing these limitations from dedicated and self hosted clusters. If its for quality of service protection on shared clusters, I can understand the limitation, but for dedicated or self hosted clusters I’m not sure I agree with the limitation. In my opinion its also a very reasonable ‘upsell’ for getting users to upgrade to a dedicated cluster as well.

If you don’t want to do that, it’d be great if you could include an option to disable this ‘safety’ feature in scenarios where it wont affect other users.

@Raphael

We could add a “unsafe” flag in --limit string Limit options

Please, open a feature request. It would be good to have more places to set ~unsafe tho

@Mentioum Those are good points. Let me dig in this topic with Michel’s help and decide how we can better support those use cases.

1 Like