May be adding BTree index

Hi!

As I understand, Dgraph aims to become a next-gen operational database, which can replace SQL databases in some cases. I personally really like this vision.

To make this vision come true I’d like to propose an essential feature without which replacing SQL isn’t possible: a BTree index.

There are many use cases for it. For example: autocomplete in a search field is usually implemented as a prefix-based search using btree indexes.

ArangoDB has btree indexes, Dgraph doesn’t, yet this functionality can be simulated by trigram index.

Dataset is cities of the world(where at least 1000 peoples lives): http://download.geonames.org/export/dump/cities1000.zip

Dataset is 132595 cities, where this predicates( name - row, alternatives, region, country_code, population, coordinate, timezone ) are saved in database.

Desired feature is autocomplete, which shows 10 cities (sorted from biggest to smallest) to select.
I have tested, how long query takes (uncached by Arangodb as first call always slower, than subsequent queries on the same query) with Dgraph.

Dgraph + trigram on name looks like (GRPC interface):

query all($name: string, $limit: int)
{
  results(func: regexp(location.name, $name), orderdesc: location.population, first: $limit) {
    expand(_all_)
  }
}

Limitation, autocomplete can start from 3 letters (because of trigram index, btree index doesn’t have this limitation).

Queries results (search term: time):

ber: 78ms
fra: 39ms
lon: 50ms
new: 29ms
sam: 30ms
par: 50ms
tok: 25ms
sin: 40ms

Arangodb + prefix search on name, where query looks like (via HTTP interface):

{
  "filters": ["FILTER e.population != @v1"],
  "for": "FOR e IN FULLTEXT(location, \"name\", @v0)",
  "limit": "LIMIT 0, 10",
  "return": "RETURN e",
  "sort": "SORT e.population DESC",
  "vars": {"v0": "prefix:ber", "v1": 0}
}

Queries results ( search term: time ):

ber: 11ms
fra: 10ms
lon: 6ms
new: 6ms
sam: 6ms
par: 12ms
tok: 5ms
sin: 14ms

Depending on a search term arangodb is from 4-5 to 8 times faster.

If you want to implement pagination, based on something sorted, than without btree index, such query probably will be very slow, because it needs to do full scan of population predicate before returning results:

{
  locations(func: type(Location), ordedesc: population, after: <some_uid>, first: 10) {
      name
  }
}

My proposal would be to support btree index to allow implementing autocomplete and sorted pagination and other features(there are definitely more, which depends on btree index) on top of DGraph.

8 Likes

Dgraph doesn’t do Btree, because it uses bucketing as a mechanism for indexing, which works better with the existing design of posting lists. That’s how date index does sorting, by bucketing all the events with the same granularity (day, hour, minute, second) into the same bucket. So, when iteration needs to happen, Dgraph looks at the contents of the entire bucket in one loop.

Now, population is most likely an integer. And we didn’t implement a granularity for integer indexing. So it is possible that the bucket sizes for integer indices can be 1. We could look into allowing granularity specification for integer indexing as well, like we do for datetime.

Thank you for your answer!

So I checked now the queries with population sorting and without population sorting, they are twice faster without population sorting, still slower than prefix-based query on BTree index in ArangoDB, but it could be definitely improvement.

Do I understand, that bucketing should replace BTree index and have equivalent speed?

And will make all this queries much faster?

(from linked thread: Query performance of large database (over 12g edges) - #2 by liveforeverx)

{ q(func:has(price), first: 10, orderdesc: price){ uid } } --> takes 247sec

This query with BTree index should return results in no time, which complexity it will have with bucketing strategy?

What if for sorting the string is used? Does bucketing strategy is available for strings? ( Example: No cursor-based pagination and offset-based pagination is slow ) where sorting slowdown.

I still can’t get pagination to work fast in my tech-demo because of #3472 and #3473 but Btree indexes would be the solution I, and anyone else facing a similar problem, desperately need.

Let’s say we have many (100k+) post entities consisting of the following nodes:

Post.id: string @index(exact) .
Post.publication: dateTime .
Post.title: string .
Post.contents: string .
Post.author: uid .
Post.reactions: uid .

And a GraphQL API:

type Post {
  id: ID!
  publication: Time!
  title: String!
  contents: String!
  author: User!
  reactions: ReactionList!
}

type PostList {
  size: Int!
  version: String!

  # list lists all posts sorted by their publication time (desc)
  list(after: ID, limit: Int!): [Post!]!
}

query {
  posts: PostList!
}

PostList.list is paginable and we can use it this way to get the first page of posts:

query($max: Int!) {
  posts {
    size
    version
    list(limit: $max) {
      id
      title
      author { d displayName }
    }
  }
}

…and the second page…

query($max: Int!, $after: ID!) {
  posts {
    size
    version
    list(limit: $max, after: $after) {
      id
      title
      author { id displayName }
    }
  }
}

Because order & after don’t work together I have to resort to offset-based pagination which is incredibly slow on large datasets though:

type PostList {
  size: Int!
  version: String!

  # list lists all posts sorted by their publication time (desc)
  list(offset: Int!, limit: Int!): [Post!]!
}
{
  posts(
    func: has(Post.id),
    orderdesc: Post.publication,
    first: 10,
    offset: 99990
  ) {
    uid
    Post.id
    Post.title
    Post.contents
    Post.publication
  }
}

In SQL, I’d just apply an index on the posts.publication column:

CREATE TABLE posts (
  id varchar(32) PRIMARY KEY,
  title text,
  contents text,
  publication timestamp,
  author varchar(32)
);

CREATE INDEX post_pub ON posts (publication);

Which would make these kinds of queries very fast:

-- Page #1 (p9,p8,p7)
SELECT * FROM "posts"
ORDER BY "publication" DESC, "id" DESC
LIMIT 3;

-- Page #2 (p6,p5,p4)
SELECT * FROM "posts"
WHERE "id" < 'p7'
ORDER BY "publication" DESC, "id" DESC
LIMIT 3;

Working demo: SQL Fiddle

Since the database engine doesn’t need to do a full-table-scan then a sort and then a reduction of the resulting data set to the first 3 elements, but this is what Dgraph seems to be doing all the time!

A Btree index would allow Dgraph to avoid scanning all posts in a query like this one using a combination of after and first:

{
  posts(
    func: has(Post.id),
    orderdesc: Post.publication,
    orderdesc: Post.id,
    first: 10,
    after: 0xb
  ) {
    uid
    Post.id
    Post.title
    Post.contents
    Post.publication
  }
}

…where 0xb is the node we found by previously searching for func: eq(Post.id, "p3")

If Dgraph aims to be an actual operational database and an SQL replacement then Btree indexes are inevitable it seems. So far I’ve not found a way to make these kinds of queries fast in Dgraph, and they’re terribly slow (several seconds on datasets smaller than a million entries is nuts!) and the issues on GitHub seem to be stalled.

2 Likes

We’re struggle with same problem in some use cases where we need paginate large sorted result set.

has func is designed to NOT use any index. It is slow, because it has to iterate to generate the entire list of uids, all the nodes in the predicate. So, I think you’re confusing one thing with another.

If you want to use an index, you could index price as float or int. And then ask for q(func: lt(price, _some_very_high_value), first: 10, orderdesc: price). That should be much faster.

1 Like