Here are some suggestions:
- Conflict Exceptions and Retry Loops
Given enough retries and spacing, a transaction which is facing conflict exception (>1 transaction modifying the same node in Dgraph), will eventually succeed. This means that a transaction could be retrying in a streaming listener for a while before succeeding. It is important to configure this aspect for your transaction profile appropriately and reduce unnecessary rejections / drops due to conflict exceptions.
- Kafka Partitioning
Ensure that the workload profile across partition is similar. If majority of data ends up in fewer partitions, JVM based listeners could lose valuable time garbage collecting leading to rebalances/crashes.
- Support for Adaptive behavior
Dgraph publishes metrics on the alpha. In unpredictable environments, you can leverage this information to perform some adaptive behavior, such as graceful shutdown or any other kind of signalling in the stream topology.