How to optimize the query?

What I want to do

I want to have the best performance query to get a recommended list.

What I did

  • I have a list of users, each user has a unique address and a list of tags (tag can be anything the user has: tokens, projects - number of tags of each user is from 5 → 500 tags).

  • I want to get a list of 10 users whose tags are most similar to the current user.

  • My current query get bigger latency when the number of user increase (>10M users)

  • Here is my dgraph data model:

{
  "data": {
    "user": [
      {
        "address": "0x79852a2b8386587daad501d90674996dd19d88c9",
        "tagged": [
          {
						// token
            "name": "token:bsc_btc"
          },
          {
						// chain
            "name": "chain:bsc"
          },
          {
            // project
            "name": "project:bsc_aavev3"
          }
        ],
      }
    ]
}
  • And here is my current query:
{
	my_token(func: eq(address, "0x80c1adfb1192d781a03cae1ac84faecac5c91a8a")) {
		t as tagged
	}
	

	var(func: type(User)) {
		x as count(tagged @filter(uid(t)))
		norm as math(1)
		score as math(x*norm)
	}
	

	suggestions(func: uid(score), orderdesc: val(score), first: 10) {
			address
                val(score)
	}
}

The query performance get bigger latency when the number of user increase

  • 200 record
"server_latency": {
     "parsing_ns": 93100,
     "processing_ns": 1513700,
     "encoding_ns": 37700,
     "assign_timestamp_ns": 834700,
     "total_ns": 2.531.400
   },
  • 4000 record
"extensions": {
   "server_latency": {
     "parsing_ns": 94400,
     "processing_ns": 6001900,
     "encoding_ns": 30800,
     "assign_timestamp_ns": 630300,
     "total_ns": 6.824.800
   },
  • 20.000 record
"extensions": {
   "server_latency": {
     "parsing_ns": 1465400,
     "processing_ns": 167639600,
     "encoding_ns": 110500,
     "assign_timestamp_ns": 775900,
     "total_ns": 170.061.400
   }
  • This is my current docker-compose
version: "3.2"
networks:
  dgraph:

services:
  zero1:
    image: dgraph/dgraph:v21.03.2
    volumes:
      - ./dgraph_data/zero1:/dgraph
    ports:
      - "5081:5080"
      - "6081:6080"
    networks:
      - dgraph
    command: dgraph zero --my=zero1:5080 --replicas 3 --raft="idx=1"

  zero2:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      - zero1
    volumes:
      - ./dgraph_data/zero2:/dgraph
    ports:
      - "5082:5080"
      - "6082:6080"
    networks:
      - dgraph
    command: dgraph zero --my=zero2:5080 --replicas 3 --peer zero1:5080 --raft="idx=2"
  zero3:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      - zero2
    volumes:
      - ./dgraph_data/zero3:/dgraph
    ports:
      - "5083:5080"
      - "6083:6080"
    networks:
      - dgraph
    command: dgraph zero --my=zero3:5080 --replicas 3 --peer zero1:5080 --raft="idx=3"

  alpha1:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      - zero3
    volumes:
      - ./dgraph_data/alpha1:/dgraph
    ports:
      - "8081:8080"
      - "9081:9080"
    networks:
      - dgraph
    command: dgraph alpha --my=alpha1:7080 --zero=zero1:5080,zero2:5080,zero3:5080
      --security "whitelist=0.0.0.0/0"
      --telemetry "reports=false; sentry=false;"

  alpha2:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      - alpha1
    volumes:
      - ./dgraph_data/alpha2:/dgraph
    ports:
      - "8082:8080"
      - "9082:9080"
    networks:
      - dgraph
    command: dgraph alpha --my=alpha2:7080 --zero=zero1:5080,zero2:5080,zero3:5080
      --security "whitelist=0.0.0.0/0"
      --telemetry "reports=false; sentry=false;"

  alpha3:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      - alpha2
    volumes:
      - ./dgraph_data/alpha3:/dgraph
    ports:
      - "8083:8080"
      - "9083:9080"
    networks:
      - dgraph
    command: dgraph alpha --my=alpha3:7080 --zero=zero1:5080,zero2:5080,zero3:5080
      --security "whitelist=0.0.0.0/0"
      --telemetry "reports=false; sentry=false;"
  alpha4:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      - alpha3
    volumes:
      - ./dgraph_data/alpha4:/dgraph
    ports:
      - "8084:8080"
      - "9084:9080"
    networks:
      - dgraph
    command: dgraph alpha --my=alpha4:7080 --zero=zero1:5080,zero2:5080,zero3:5080
      --security "whitelist=0.0.0.0/0"
      --telemetry "reports=false; sentry=false;"
  alpha5:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      - alpha4
    volumes:
      - ./dgraph_data/alpha5:/dgraph
    ports:
      - "8085:8080"
      - "9085:9080"
    networks:
      - dgraph
    command: dgraph alpha --my=alpha5:7080 --zero=zero1:5080,zero2:5080,zero3:5080
      --security "whitelist=0.0.0.0/0"
      --telemetry "reports=false; sentry=false;"
  alpha6:
    image: dgraph/dgraph:v21.03.2
    depends_on:
      -  alpha5
    volumes:
      - ./dgraph_data/alpha6:/dgraph
    ports:
      - "8086:8080"
      - "9086:9080"
    networks:
      - dgraph
    command: dgraph alpha --my=alpha6:7080 --zero=zero1:5080,zero2:5080,zero3:5080
      --security "whitelist=0.0.0.0/0"
      --telemetry "reports=false; sentry=false;"

  ratel:
    image: dgraph/ratel:v21.03.2
    ports:
      - "8000:8000"
    networks:
      - dgraph
    command: dgraph-ratel

Instead of iterating over all users (with type(User)) you should get the user uids via a reverse edge in the my_token query and use them in your second query.

1 Like

Indeed, vnium. I would explore the tagged edge. I would use reverse <~tagged> and then it would capture all users by previously limiting the amount of tags. If you start from the top, you will make a very wide query. And therefore it will consume a lot of resources.

The ideal would also be to be able to extract all the tags and make a query block for each tag. This would be done in code, as in DQL it is not possible without query loops. The more separate blocks you have for each thing/part, the more performance you will extract.

1 Like

I don’t think you’re using Docker Swarm or something like that. Correct? So you don’t go very far as you are limited vertically. Depending on the number of threads, this may even work a little, but ideally, your cluster should be spread over several machines. And not just subdivided between containers on the same machine. I would recommend having a few containers on the same machine if it has a lot of resources available. However, the ideal is to have an NVMe SSD for each container. It’s just complicated to configure it via Docker.

If you want to get the most out of performance. I would recommend a manual run of the Dgraph binary. And run them on bare-metal and each Alpha instance running on its respective SSD. In addition to having a good distribution of the cluster.

Tks so much, will start from here.

Updated the query

{
	  me as var(func: eq(address, "` + account + `")) {
		t as tagged
		c as chain
	  }
	
	  var(func: uid(t)) {
		filteredUser as  tokens: ~tagged
	  }

	  var(func: uid(filteredUser)) {
		x as count(tagged @filter(uid(t)))
		y as count(chain @filter(uid(c)))
		norm as math(1)
		score_x as math(x*norm)
		score_y as math(y*norm)
		score as math(score_x+score_y/2)
	  }

	  suggestions(func: uid(score), first: 10, orderdesc: val(score))
		@cascade
		@filter(NOT uid(me)) 
		{	
			balance: balance
			address: address
			score: val(score)
	    }
	}