Complex edge filtering with large data is too slow. I'm not sure what mistake i make.Can you give me some advice and help?

xiawei · August 29, 2019, 8:27am

Now I plan to apply dgraph to the production environment to solve the inefficiency of complex join queries in relational databases, but I have encountered difficulties.
I try to transform some relational database tables into dgraph nodes and create uid predicates according to their relationships. After the model is established, I use dgraph scalar predicates to filter node data, and the relationships between nodes are queried using UID predicates. My environment is like this.

test
dgraph version v1.0.14
Hardware Environment: 8G RAM, 8 Core Intel (R) Core ™ i7-6700HQ CPU@2.60Ghz, Windows 10 Operating System

Seventeen tables (20 columns per table) in the relational database have a total of 200,000 data. There are 18 relationships among these tables. I converted these data into RDF and SCHEMA files.
Load the created RDF file (200 MB, 3 million lines of RDF) and schema file into a new single instance dgraph using dgraph bulk.
Start dgraph with the following command

dgraph zero --my=localhost:5081 -o=1
dgraph alpha --lru_mb 1024 --badger.tables ram --my=localhost:7081 --zero=localhost:5081 -o=1

Then I do some filtering queries along the edge, like this

 { 
    var(func:eq(kjk_jiedao.tydzbm,"440305007")){
     	~jiedaogldylink {
     		jzwgldylink  {
     			fwgldylink {  
				  ~frzcdlink @filter(eq(jck_fr_jcxx.frlb,"11","12")) {
				  frssztlink @filter(eq(ywk_fr_sszt.hyfl,"I","O") and eq(ywk_fr_sszt.qyzt,"6","8") and eq(ywk_fr_sszt.ssztlx,"01")) {
				  result as ~frssztlink @filter(eq(jck_fr_jcxx.frlb,"91","92"))
				  }}}}}}  
	total(func:uid(result)) { total : count(uid) } 
	body(func:uid(result),first:10,offset:0) { expand(_all_) }
}

In such an environment, queries take 2-3 seconds.

Later, I used the above method to process a relational database with a table structure consistent with the 32 million data mentioned above (only two tables were very large, with 14 million data per table).

The created RDF is about 27GB in size.

I used eight virtual machines, each with 16G RAM, 12 Core Cpu, and Centos 7.2 operating system.Deploy a dgraph alpha instance per virtual machine.

Start the command like this

nohup ./dgraph zero --my=10.253.173.35:5080 >zerolog.out 2>&1 &

nohup ./dgraph alpha --lru_mb=4096 --my=10.253.173.35:7080 --zero=10.253.173.35:5080 >alphalog.out 2>&1 &

nohup ./dgraph alpha --lru_mb=4096 --my=10.253.173.34:7080 --zero=10.253.173.35:5080 >alphalog.out 2>&1 &
and so on ...

Execute the above query, I can hardly get results.Then I removed all the filters and executed them again. It took nearly 10 minutes for the server to return the results.

This data set is not particularly large for some of your examples, and my query has not been filtered on a fairly large result set. Two very large types of nodes are in the middle path of my query. I understand that using uid predicates to query connected nodes should be very fast. Is there anything I misunderstood?
I’m not sure where the problem is. Is there something wrong with my model architecture design , or is my operation incorrect in building dgraph, or is my query not good enough?
perhaps should I just store primary keys and relationships in dgraph, and keep the details of the data outside, such as HBase.
Can you give me some advice and help? Thank you very much！

system · September 28, 2019, 8:27am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filtering is slow on large amount of data Dgraph dgraph , status:accepted , priority:p1 , popular , area:performance	5	1152	June 15, 2020
Sharing a little trick Dgraph	4	260	March 20, 2024
Query performance of large database (over 12g edges) Dgraph	5	1785	July 2, 2019
Slow performance on a single node with millions of documents Dgraph performance , area:performance	7	1774	August 24, 2020
Query to slow, how to optimize query Dgraph	5	472	April 25, 2021

Complex edge filtering with large data is too slow. I'm not sure what mistake i make.Can you give me some advice and help?

Related topics