Understanding why `count(~fooPredicate) @filter(eq(dgraph.type, "fooType"))` is slow

geoyws · May 8, 2020, 5:34pm

Just need some clarification on my understanding.

Schema:

type Hotel {
	name
}

type Room {
	hotel
	name
}

type Ledger {
	hotel
	room
	createdTs
	amount
}

type Cafe {
    name
    hotel
}

name: string @index(exact, term) .
hotel: uid @reverse .
room: uid @reverse .
createdTs: datetime @index(hour) .
amount: int @index(int) .

Query:

{
  getRoomCountForAllHotels(func: eq(dgraph.type, "Hotel")) {
    roomCount: count(~hotel) @filter(eq(dgraph.type, "Room"))
  }
}

I’ve got 100 hotels, and each hotel has 1000 rooms and 1 cafe.
After a few minutes this returns a : context deadline exceeded error.

I’m assuming because Badger is inspired by RocksDB which was inspired by LevelDB, the storage engine would be sort of a hexastore… is this true?

So then my intuition told me:

Perhaps the query planner might use the indexed hexastore to find the hotels via Predicate(dgraph.type)-Object(“Hotel”)-Subject(uid) and load 100 of the hotel uids into memory,
and for each of these hotel uids, we’ll go into a nested loop and resolve their ~hotel via Predicate(hotel)-Object(hotel_uid)-Subject(room_uid or cafe_uid) which would bring up 100K room_uids and 100 cafe_uids that would be loaded into memory.
Then we’d have to filter these 100K + 100 records by querying each of them (my goodness) via Subject(room_uid or cafe_uid)-Predicate(dgraph.type)-Object(“Room”) to make sure they’re a dgraph.type of Room and not Cafe.
Then we’d sum them up according to their hotel uid (the RDBMS equivalent of GROUP BY) and return the values.

Am I sort of correct?

pawan · May 11, 2020, 10:04am

This filtering step is essentially an intersection of two sorted lists where the two lists are

List 1 => uids of rooms for a hotel
List 2 => uids of all rooms

This intersection would happen for every hotel.

I am still surprised that your query doesn’t return after minutes with such a small amount of data. Perhaps you could provide us with a sample data set which would help us dig deeper into your issue? Also, another suggestion is that won’t it be better if your schema instead looked like

type Hotel {
	name
        rooms
        ledgers
        cafes
}

type Room {
	name
}

type Ledger {
	rooms
	createdTs
	amount
}

type Cafe {
    name
}

name: string @index(exact, term) .
rooms: [uid] @reverse .
ledgers: [uid] @reverse .
cafes: [uid] @reverse .
createdTs: datetime @index(hour) .
amount: int @index(int)

Then your query won’t need the second filter.

{
  getRoomCountForAllHotels(func: eq(dgraph.type, "Hotel")) {
    roomCount: count(rooms)   
  }
}

Topic		Replies	Views
Filter performance Dgraph	15	768	March 27, 2020
Sharing a little trick Dgraph	4	168	March 20, 2024
Filtering is slow on large amount of data Dgraph dgraph , status:accepted , priority:p1 , popular , area:performance	5	1050	June 15, 2020
Optimizing sum() performance, Dgraph takes 10s (16 vCPU, 128GB RAM), PostgreSQL takes 1.12s (4vCPU, 16GB RAM) Dgraph	13	1036	June 5, 2020
Query to slow, how to optimize query Dgraph	5	399	April 25, 2021

Understanding why `count(~fooPredicate) @filter(eq(dgraph.type, "fooType"))` is slow

Related Topics