Streaming Clickstream Data into Dgraph - High Memory Usage

We want to use Dgraph to create a user-product interaction graph of an online e-commerce website. We want to find which users have high affinity to which product attributes like brand, or category. We want to use this information in our Recommendation Engine, and also for our marketing campaigns.

We created a Proof of Concept cluster using Dgraph v21.03.0 in AWS with 3 zero and 3 alpha nodes. We used 3 r5.2xlarge ec2 instances with 500GB gp3 EBS disk with 3000 IOPS for each instance, where each ec2 instance has one zero and one alpha node.

What I want to do

We want to write clickstream events (user clicks on the website) into the graph in real-time as they happen on the website. We have an already working setup for this with an API backed with AWS Kinesis Stream. Kinesis stream feeds the clickstream events to various data pipelines, one being Dgraph. We have created an AWS Lambda Function in python using pydgraph, to read data from Kinesis and send the clickstream events to Dgraph.

We can successfully send the events to Dgraph with this setup, but the memory usage monotonically increases. After 3 days of streaming events into the Dgraph, memory usage hits 100%, and we start to get OOM crashes. The total size of data on disk is less than 3GB, but somehow Dgraph uses all 64GB of memory and crashes. Here is a screenshot from AWS Cloudwatch:

This was a test stream, so the load was constant. We set lambda concurrency to one, so that we didn’t have concurrent write exceptions. Here are the metrics from Dgraph plotted in Grafana:

You can see that in the beginning mutation queries run very fast, less than a second, and then they got slower. From about 0.5 sec in the beginning, to 60 sec on 3th day. We can send about 20 queries per minute in the beginning, but about 1 query per min on 3th day. The number of nquads in a single query vary from 3k nquads to 10k nquads. At the end of 3th day, the graph had 5 million User nodes, 11 million Interaction nodes, and 120k Product nodes.

We want to have an up to date graph that changes every day. We want to add new nodes constantly, and delete old ones with a daily delete query. So the mutations will run indefinitely, but Dgraph memory usage seems to increase monotonically as the graph size increases and mutations keep coming, and the memory usage (64GB) seems very high compared to total data size (3GB) used in disk.

What I did

We basically have three types of nodes, User, Interaction, and Product. Here is the schema:
sample_schema.dql (521 Bytes)

User nodes are connected to Interaction nodes, and Interaction nodes are connected to Product nodes. We use DQL for queries. Here is a sample query to upsert User and Product nodes, and insert Interaction nodes:
sample_query_2.dql (6.4 KB)

The sample query have about 100 nquads, in real world, we send about 3k-10k nquads per query.


Why do the memory usage increases? Is our application not suitable for Dgraph? Is our query inefficient for memory? Should we tweak memory settings of Badger for better memory performance? What would be your suggestion to make this application work with Dgraph?