I’m using Dgraph to import some arbitrary data sets and join them together on certain matching keys.
In my tests, I’ve got four large data sets ranging from 700k to 3 million documents each. They are joining on things like email or UUID predicates across the sets.
Ratel tells me this is the data usage across the tablets:
last_name, 1.5TB
full_name, 320.0GB
_system_mapped_title, 50.7GB
first_name, 27.5GB
user_id, 18.7GB
email, 18.6GB
phone, 18.0GB
created_at, 5.7GB
mobile, 5.2GB
dgraph.type, 4.6GB
id, 164.3MB
address, < 64MB
city, < 64MB
company_name, < 64MB
... and 15 more ..., < 64MB
Each predicate has both “hash” and “fulltext” indices enabled. The tool searches through all of the predicates using alloftext() queries joined with an “OR”.
During testing with smaller data sets (~50k documents on average), the query time was very fast. Now it has these larger data sets, the query time has increased massively to10 seconds or longer.
Will a clustered setup get the query time back down or should I be doing something else here? I was contemplating a 3- or 6-node cluster but I would like some advice to how much RAM each node should have and whether this would definitely fix the problem.
Thanks in advance for any help or advice you can give