I’m using Dgraph to import some arbitrary data sets and join them together on certain matching keys.
In my tests, I’ve got four large data sets ranging from 700k to 3 million documents each. They are joining on things like email or UUID predicates across the sets.
Ratel tells me this is the data usage across the tablets:
last_name, 1.5TB full_name, 320.0GB _system_mapped_title, 50.7GB first_name, 27.5GB user_id, 18.7GB email, 18.6GB phone, 18.0GB created_at, 5.7GB mobile, 5.2GB dgraph.type, 4.6GB id, 164.3MB address, < 64MB city, < 64MB company_name, < 64MB ... and 15 more ..., < 64MB
Each predicate has both “hash” and “fulltext” indices enabled. The tool searches through all of the predicates using alloftext() queries joined with an “OR”.
During testing with smaller data sets (~50k documents on average), the query time was very fast. Now it has these larger data sets, the query time has increased massively to10 seconds or longer.
Will a clustered setup get the query time back down or should I be doing something else here? I was contemplating a 3- or 6-node cluster but I would like some advice to how much RAM each node should have and whether this would definitely fix the problem.
Thanks in advance for any help or advice you can give