Complex edge filtering with large data is too slow. I'm not sure what mistake i make.Can you give me some advice and help?


#1

Now I plan to apply dgraph to the production environment to solve the inefficiency of complex join queries in relational databases, but I have encountered difficulties.
I try to transform some relational database tables into dgraph nodes and create uid predicates according to their relationships. After the model is established, I use dgraph scalar predicates to filter node data, and the relationships between nodes are queried using UID predicates. My environment is like this.

test
dgraph version v1.0.14
Hardware Environment: 8G RAM, 8 Core Intel ® Core ™ i7-6700HQ CPU@2.60Ghz, Windows 10 Operating System

  1. Seventeen tables (20 columns per table) in the relational database have a total of 200,000 data. There are 18 relationships among these tables. I converted these data into RDF and SCHEMA files.

  2. Load the created RDF file (200 MB, 3 million lines of RDF) and schema file into a new single instance dgraph using dgraph bulk.

  3. Start dgraph with the following command

dgraph zero --my=localhost:5081 -o=1
dgraph alpha --lru_mb 1024 --badger.tables ram --my=localhost:7081 --zero=localhost:5081 -o=1
  1. Then I do some filtering queries along the edge, like this
 { 
    var(func:eq(kjk_jiedao.tydzbm,"440305007")){
     	~jiedaogldylink {
     		jzwgldylink  {
     			fwgldylink {  
				  ~frzcdlink @filter(eq(jck_fr_jcxx.frlb,"11","12")) {
				  frssztlink @filter(eq(ywk_fr_sszt.hyfl,"I","O") and eq(ywk_fr_sszt.qyzt,"6","8") and eq(ywk_fr_sszt.ssztlx,"01")) {
				  result as ~frssztlink @filter(eq(jck_fr_jcxx.frlb,"91","92"))
				  }}}}}}  
	total(func:uid(result)) { total : count(uid) } 
	body(func:uid(result),first:10,offset:0) { expand(_all_) }
}

In such an environment, queries take 2-3 seconds.

Later, I used the above method to process a relational database with a table structure consistent with the 32 million data mentioned above (only two tables were very large, with 14 million data per table).

The created RDF is about 27GB in size.

I used eight virtual machines, each with 16G RAM, 12 Core Cpu, and Centos 7.2 operating system.Deploy a dgraph alpha instance per virtual machine.

Start the command like this

nohup ./dgraph zero --my=10.253.173.35:5080 >zerolog.out 2>&1 &

nohup ./dgraph alpha --lru_mb=4096 --my=10.253.173.35:7080 --zero=10.253.173.35:5080 >alphalog.out 2>&1 &

nohup ./dgraph alpha --lru_mb=4096 --my=10.253.173.34:7080 --zero=10.253.173.35:5080 >alphalog.out 2>&1 &
and so on ...

Execute the above query, I can hardly get results.Then I removed all the filters and executed them again. It took nearly 10 minutes for the server to return the results.

This data set is not particularly large for some of your examples, and my query has not been filtered on a fairly large result set. Two very large types of nodes are in the middle path of my query. I understand that using uid predicates to query connected nodes should be very fast. Is there anything I misunderstood?
I’m not sure where the problem is. Is there something wrong with my model architecture design , or is my operation incorrect in building dgraph, or is my query not good enough?
perhaps should I just store primary keys and relationships in dgraph, and keep the details of the data outside, such as HBase.
Can you give me some advice and help? Thank you very much!