Given enough retries and spacing, a transaction which is facing conflict exception (>1 transaction modifying the same node in Dgraph), will eventually succeed. This means that a transaction could be retrying in a streaming listener for a while before succeeding. It is important to configure this aspect for your transaction profile appropriately and reduce unnecessary rejections / drops due to conflict exceptions.
Kafka Partitioning
Ensure that the workload profile across partition is similar. If majority of data ends up in fewer partitions, JVM based listeners could lose valuable time garbage collecting leading to rebalances/crashes.
Support for Adaptive behavior
Dgraph publishes metrics on the alpha. In unpredictable environments, you can leverage this information to perform some adaptive behavior, such as graceful shutdown or any other kind of signalling in the stream topology.
Things that my team has noticed while streaming data into Dgraph.
Batching up nquads to about 1000 gives a good balance of performance and not hammering dgraph too much.
Using a timeout interceptor at 2 seconds gives a decent job of keeping messages moving. It also gives an early indicator of when Dgraph is starting to have trouble.
Putting retry logic off of the kafka message processing thread is a good idea if you want to keep message processing up.
Using the uid leasing is a good way handle the mutations if you want to keep track of the UIDs.
Doing this with a single Kstream processor inserting into a 3 node replica cluster with no partitioning we see good through put up to about 650k nodes and 6ish million nquads. At that point dgraph would become unresponsive and all mutations would fail for about 15 minutes. After that time, the mutations would start succeeding again.
The dgraph clusters are 4 core 26GB memory X 3.
We have tried this with a dataset that is about 30m nodes. In total it took about 50 hours to load.
Questions that I have:
Would partitioning improve performance?
Is there an optimal transaction timeout, or just trial and error?
Would more smaller servers expect to have better throughput?
Would adding cores improve the dgraph performance?
We have used the bulk loader to load this same dataset and it takes 20-30 minutes.