Natural id / business key

How efficient is retrieving Dgraph objects in DQL using natural ids e.g. @filter(eq(email, $email)) instead of func: uid($someId)?

The reason I ask is, instead of exposing Dgraph’s native uid (e.g. 0x123) to public, I would rather use, say, random string, so that people couldn’t guess how much data I have (for example).

Also, is there a native construct for that? E.g. in RDMBS we can name the key anything as long as we set the column as PRIMARY KEY.

Its very common to use externally tracked ids in dgraph. See here and here for a bit of information there, but basically every update will then be an upsert of the form:

upsert {
  query{
    me as var(func: eq(email,"me@them.org"))
  }
  mutation {
    set {
      uid(me) <myfield> "myvalue" .
      uid(me) <email> "me@them.org" .
    }
  }
}

… which will create a new node with that email and myfield if it does not exist, and if it does, it will apply myfield=myvalue to it. This is done atomically, so no need to worry about duplicates. You may also want to mark the field being used as an external id as @upsert in the schema to ensure uniqueness. See here for more on @upsert.

Note the external id (email in this case) is being inserted along with the mutation. You need to set this in the case of a new entry being made. Conditional mutations can gate this if you want to avoid writing the same value over and over, but that is purely an optimization.

  • Downside: slower, but how much? maybe you wont notice, depends on many things.
  • Upside: using integers as ids is awkward and this is much better.

I wouldn’t care about this. Isn’t that a big deal to guess it. Also, the size of data or number of nodes increases be it several small nodes or just a few nodes with several data. Make no sense to try to guess it. If you have billion of nodes related to comments from, let’s say, 300k users. The guesser would think that you have a Billion users?

I would care about exposing UIDs if they can have free usage of the API. So they would use the collected UIDs to explore the data from your cluster. Basically, besides the leasing is sequential, the UID usage isn’t sequential. So the attacker can’t exploit this as he would do with a common ID usage(sequential IDs instead of random).

PS. You can also lease a billion UIDs and not use them. So the guesser would be confused. Flood them with info is the best way to hide the real.