May be adding BTree index

(Dmitry Russ) #1


As I understand, Dgraph aims to become a next-gen operational database, which can replace SQL databases in some cases. I personally really like this vision.

To make this vision come true I’d like to propose an essential feature without which replacing SQL isn’t possible: a BTree index.

There are many use cases for it. For example: autocomplete in a search field is usually implemented as a prefix-based search using btree indexes.

ArangoDB has btree indexes, Dgraph doesn’t, yet this functionality can be simulated by trigram index.

Dataset is cities of the world(where at least 1000 peoples lives):

Dataset is 132595 cities, where this predicates( name - row, alternatives, region, country_code, population, coordinate, timezone ) are saved in database.

Desired feature is autocomplete, which shows 10 cities (sorted from biggest to smallest) to select.
I have tested, how long query takes (uncached by Arangodb as first call always slower, than subsequent queries on the same query) with Dgraph.

Dgraph + trigram on name looks like (GRPC interface):

query all($name: string, $limit: int)
  results(func: regexp(, $name), orderdesc: location.population, first: $limit) {

Limitation, autocomplete can start from 3 letters (because of trigram index, btree index doesn’t have this limitation).

Queries results (search term: time):

ber: 78ms
fra: 39ms
lon: 50ms
new: 29ms
sam: 30ms
par: 50ms
tok: 25ms
sin: 40ms

Arangodb + prefix search on name, where query looks like (via HTTP interface):

  "filters": ["FILTER e.population != @v1"],
  "for": "FOR e IN FULLTEXT(location, \"name\", @v0)",
  "limit": "LIMIT 0, 10",
  "return": "RETURN e",
  "sort": "SORT e.population DESC",
  "vars": {"v0": "prefix:ber", "v1": 0}

Queries results ( search term: time ):

ber: 11ms
fra: 10ms
lon: 6ms
new: 6ms
sam: 6ms
par: 12ms
tok: 5ms
sin: 14ms

Depending on a search term arangodb is from 4-5 to 8 times faster.

If you want to implement pagination, based on something sorted, than without btree index, such query probably will be very slow, because it needs to do full scan of population predicate before returning results:

  locations(func: type(Location), ordedesc: population, after: <some_uid>, first: 10) {

My proposal would be to support btree index to allow implementing autocomplete and sorted pagination and other features(there are definitely more, which depends on btree index) on top of DGraph.

Query performance of large database (over 12g edges)
Is dgraph fit for time series and finding equal value in array