Query is very slow while adding le function for float predicate in filter


Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
Dgraph version   : v22.0.0
Dgraph codename  : dgraph
Dgraph SHA-256   : bc4cc6d649fa2328df1d9d5702b30e2204934ba0b58cbbfa6816d52e98a9d191
Commit SHA-1     : c36206a
Commit timestamp : 2022-10-21 11:32:45 +0000
Branch           : release/v22.0.0
Go version       : go1.18.5
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud   , visit https://dgraph.io/cloud.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2021 Dgraph Labs, Inc.

Performance problem

When I run the following query, it is extremely slow:

{
    q(func: eq(cid3_1, 717), first: 1000) @filter(eq(brand_1, "GODEX") and le(price_1, 2000)) {
        sku_id_1
        cid3_1
        price_1
    }
}

It costs 3.6 seconds!

However, it’s much faster after I remove the le(price_1, 2000) function in filter:

{
    q(func: eq(cid3_1, 717), first: 1000) @filter(eq(brand_1, "GODEX")) {
        sku_id_1
        cid3_1
        price_1
    }
}

It only costs 67.6ms!

Schema:

Define Types

type Product_1 {
    sku_id_1
    title_1
    brand_1
    cid3_1
    price_1
    click_cnt_1
    attrs_1
    has_tag_1
    relate_sku_1
}

Define Directives and index


sku_id_1: string @index(exact) @upsert .
title_1: string @index(fulltext) .
brand_1: string @index(hash) .
cid3_1: string @index(hash) .
price_1: float @index(float) .
click_cnt_1: int @index(int) .
attrs_1: string @index(term) .
has_tag_1: [uid] @reverse @count .
relate_sku_1: [uid] @count .

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

CPU: Intel(R) Xeon(R) CPU @ 2.60GHz, 16 Cores.
RAM: 32G
OS: CentOS Linux release 7.2

Steps to reproduce the issue (command/config used to run Dgraph).

  1. Run query without float predicate in filter, record the time cost.
  2. Run query with float predicate in filter, record the time cost.
    And you will see the second query will be much slower.

Slow query:

{
    q(func: eq(cid3_1, 717), first: 1000) @filter(eq(brand_1, "GODEX") and le(price_1, 2000)) {
        sku_id_1
        cid3_1
        price_1
    }
}

Fast query:

{
    q(func: eq(cid3_1, 717), first: 1000) @filter(eq(brand_1, "GODEX")) {
        sku_id_1
        cid3_1
        price_1
    }
}

Expected behaviour and actual result.

It’s also fast while using float predicate(with float index) in filter.


Hey @zedzhu,

Thanks for your post. I created a similar (?) environment with 250k “Product” types across 50 CID groups each with 5000 products in each group. BrandIDs for each Product were randomly selected from a group of 9 brands. The prices for each Product were randomly selected between 10 and 3000. Not sure if that similar to your data, but seemed like a good guess.

On my local, single machine cluster (Mac Mini M1), the query latency is pretty reasonable, ~250ms.

  "extensions": {
    "server_latency": {
      "parsing_ns": 2272209,
      "processing_ns": **267562792**,
      "encoding_ns": 11001667,
      "assign_timestamp_ns": 9761292,
      "total_ns": 293255666
    },
    "txn": {
      "start_ts": 159
    },
    "metrics": {
      "num_uids": {
        "": 5000,
        "_total": 16125,
        "brand_1": 5000,
        "cid3_1": 375,
        "price_1": 5375,
        "sku_id_1": 375
      }
    }
  }

Can you share if the data I created is similar to yours?

Also, here’s the code I used to create my test environment: GitHub - matthewmcneely/dgraph-v21.03-sandbox at testdata-products-v22

There are 2 alpha groups and 3 nodes in each group in my env. Your data is similar to mine. I’m not sure if there are multiple functions in @filter, each run independently and then do result intersection, if the query runs in this way, it’s going to be slow.

@zedzhu I haven’t had time to set up an HA cluster to support this query yet, but in the meantime could you try the following query? One thing about DQL is that it really just as much a set query language as it is a graph query language. The idea of the query below is to reduce the initial set against the most possible restricted predicate. Then keep reducing the set by additional predicates. You can use the num_uids in the metrics in the extended results as a guide.

{
  CID_SET as var(func: eq(cid3_1, "717"))
  
  BRAND_SET as var(func: uid(CID_SET)) @filter(eq(brand_1, "GODEX"))

  PRICE_SET as var(func: uid(BRAND_SET)) @filter(le(price_1, 2000))
    
  q(func: uid(PRICE_SET), first: 1000, offset: 0) {
		sku_id_1
		cid3_1
		price_1
  }
}
1 Like

Great! Your query runs much faster in my cluster, now it’s only 202.7ms! Thanks so much!

"extensions": {
   "server_latency": {
     "parsing_ns": 92916,
     "processing_ns": 199125991,
     "encoding_ns": 2606398,
     "assign_timestamp_ns": 683008,
     "total_ns": 202727903
   },
   "txn": {
     "start_ts": 5967681
   },
   "metrics": {
     "num_uids": {
       "": 0,
       "_total": 547233,
       "brand_1": 544081,
       "cid3_1": 639,
       "price_1": 1874,
       "sku_id_1": 639
     }
   }
1 Like

@matthewmcneely could you say more about why this is faster? Naively, it seems the original query would also first query for UIDs using cid3_1=“717” and then filter by the brand and price. But obviously there is a big difference here. When you say to reduce the initial set, what is that initial set, and why is it reduced? Thanks!

@Damon My understanding is that the DQL parser is not capable of optimizing queries in the manner in which I recommended zedzhu try. Take a look at this comparison of query.Result objects returned from DQL parsing both of these queries. The results of original query is on the left and the query I recommended to zedzhu is on the right.

Link: JSONCompare - The Advanced JSON Linting & Comparison Tool

Note the chain of QueryVars starting at 383 (right side). It’s my understanding that this forces query processing to build subsequent, dependent subgraphs using the parameters defined in each query step. This reduces the first subgraph to just the first query results. Whereas the original, slower query gloms them all together in “one query” (and subsequently visits all uids in the query space).

Whenever I query a large graph, I always approach it by trying to find the smallest set (subgraph) first. Of course that requires an understanding of the composition of one’s graph, but hey, there’s no free lunch.

An original team member wrote some thoughts on a query planner that I think might address some of the issues and is an interesting read: Query Planner - #3 by chewxy