correct me if I am wrong, as I understand user generated identifiers is not supported any more, instead its recommended use xid as an attribute to store it, this is great inconvenience for my use case and also inefficient. I am trying to evaluate dgraph for fraud detection system, and its highly connected graph and querying for uid before creating edges seems inefficient. Note that we have write over 3 billion edges, reading from db is the last thing on my mind. Eg, given email was used for an order, given phone was used for an order, in my use case email, phone and orderIds are ids, if 2 orders used same email, I would expect email will point to 2 different order but there will be only one email node. I was expecting to do only writes. Can you please advise? note that we have over 300million transactions with each order 8 attributes (like email, phone, address etc), can lead to 3-4 billion writes id we don’t generate id, if we do then we are looking twice the amount plus same amount of reads, also will lead to concurrency issues given dgraph will be generating uniqueIds for me.
If you want a single email node for all users, then unfortunately you need to query dgraph to get the id. You can probably cache email to uid mapping on client side(If it cant’ fit in memory, use some eviction strategy and query dgraph only if not found in cache).
But Given disk space is cheap you can just store email as scalar value and index it. You can easily retrieve all the orders with that email via the index.
Yes, I understand that I don’t think the issue here is the storage, its
efficiency, now for new inserts of email, I will have to do a read to find
out the relevant id, now that creates 2 problems first it increases the
number of operations on the database which is not ideal for the amount of
data we have also it will create concurrency issues as a write will be
dependent on the read (not atomic), scenario, what if you insert the same
email as same time will end up with 2 emial with different ids, may be
transaction can solve the issue.
Not sure we will get kind of performance which we expected
The solution won’t work, we have multiple Kafka consumer that will be writing to database, we are in a distributed env, local cache lookup is not possible
Really guys, you can’t identify existing nodes in the database? And if I want to refer a node I have to provide a mapping myself? I’m fighing with the same problme as amitgupta1202 and it’s hard to belive that I can’t match the same node in two different transactions. Any comments?
Wow, so when I start more then one client that tries to write nodes to the database I have a problem with concurrency, I have to admit that version 1.0 is quite useless. it can’t be scaled.