How hard is it to change the uid key to be a larger value

I don’t even know how possible this is, so I wanted to ask here for feasibility before I get to deep into it. There is an effort to integrate dgraph with an internal system we have that has linkages between data which are implied(to see if we can use dgraph for interesting data introspection).

The problem is that the UIDs in this system are larger (256bits), now there are some way of converting the 256bit numbers into 64bit numbers(hashing), but I wanted to see what the feasibility is of (possibly myself) changing the store/interaction numbers to use 256bit values [or n-length byte arrays in general].

So is there anyone who can point me in the right direction of what code I should look at?
or tell me this is outright lunacy and not feasible?
or have any other feedback?

Thanks!

I had a similar-ish situation here: Referencing foreign UIDs in dgraph - #2 by janardhan

One thought is you can just store the source system ID in a string predicate. So you would run a mutation like this when you’re loading your foreign entities into Dgraph:

mutation {
 set {
  _:sourceEntity1 <ForeignUID> "[string representation of source 256-bit UID]" .
  _:sourceEntity2 <ForeignUID> "[string representation of source 256-bit UID]" .
  ....
 }
}

Then to index the ForeignUID predicate and make it searchable with the full foriegn UID, you would add ForeignUID to the Dgraph schema with:

mutation {
  schema {
    ForeignUID: string @index(exact) .
  }
}

That’s a good solution, @tamethecomplex. We do something similar in dgraph loader, using xid edge.

We made the decision at the very beginning of the project to stick to uint64 ids, because that’s the max integer natively supported by Go. I wanted to avoid using strings as identifiers, which aren’t as efficient to store, intersect, iterate upon etc. Using ints allows Dgraph to act in part like a search engine, making retrieval and intersection really fast.

Thanks @mrjn. Quick question… is there anything special about the predicate? I’m wondering if I should have suggested using the predicate instead of creating a custom one ().

The main problem with this is it does not create the graph links, and essentially makes the dgraph itself useless for our use case(as data is being inserted into the graph in a streaming manner through morphs and it would just create lots and lots of nodes that are not connected to anything, and as such the graph cannot be used for queries as is)

Because the ids for us represent links e.g.: A → B . If we where to store the external IDs as just an attribute, then the relationship in the dgraph would not be formed(just a bunch of floating nodes) essentially requiring graph queries and morphs to actually connect data to create the edges so that an actual graph query can be run, alternatively every time any graph query is executed or executing a graph query+morph with every nquad insert(which is beyond un-ideal). Doing a read for every write is a pretty bad path…

So the original question still stands, as the proposed solution would not work
And I take it going in and modifying the code to use a different id bit-width is nye impossible? does anyone actually know about the feasibility of this?

@Aselus, it would create the graph links though. Let’s say in your current database, you have two nodes: A and B. A is connected to B via the relationships:

A <WorksFor> B
A <IsCEOof> B

Let’s say in your database, the UID of A is “123” and the UID of B is “456”. When you load Dgraph with both these relationships (which will be queryable, because real links are formed in Dgraph between the nodes), it will look like this:

mutation {
  set {
    _:A <ForeignUID> "123" .
    _:B <ForeignUID> "456" .
    _:A <WorksFor> _:B
    _:A <IsCEOof> _:B
  }
}

When you run this query, you will get back JSON with the auto-generated Dgraph UIDs for A and B. Then when you want to query on the data you loaded, you run:

{
  result (func: uid([Dgraph uid for A you got back from JSON])) {
    WorksFor {
      _uid_ (will be Dgraph UID for B)
      ForeignUID (will be the UID for B in your external store)
    }
    IsCEOof {
      _uid_ (will be Dgraph UID for B)
      ForeignUID (will be the UID for B in your external store)
    }
  }
}

Dgraph is using the Dgraph UIDs to link the data under the hood, not your UIDs. However, relationships are still preserved, and you can retrieve your foreign UIDs as shown above if you need those along with your query results.

that _: is the problem though, as i’d said for me the links are created via the uid (from seperate streaming processes). After testing we’ve found that if you do multiple inserts from different machines that the UIDs end up not matching up.

For example:

Node 1 does an insert of "_A: <IsBossOf> _:B"
Node 2 does an insert of "_B: <IsBossOf> _:C"

The two Bs above would get two seperate UIDs (as per our current testing), and as such there would be no connection from A to C. This actually also happens within one node in two different mutations. So if i where to need to get the top boss of C, i’d get B, even though it should be A

We tried xids to fix this but they’re local cached on the node that does the building[and we need the thing to scale horizontally while doing streaming mutations].

I’m up for any suggestions that do fix this other then the one we have right now (where we hash the item ID with a c64 to get a common uid on both systems [without introducing a distributed lock and/or race], but this creates an artificial cieling ont he number of vertices because of the c64 paradox birthday… and we want to avoid that).

Maybe I’m missing something though? apologies if i’m being dumb

I think it would be possible to run just one client, while still getting the write throughput expected from a cluster. That way, all the UIDs would pass through just one thing, and it would know what _:B was assigned to in the previous mutations.

If this is not possible due to a design issue, we have been thinking of writing a Dgraph passthrough (sort of like a load balancer) server, which can act as a middle layer between client and server. This passthrough has many advantages, such as splitting up the mutations to go to the right servers in a cluster as to avoid inter machine communication during bulk loading, which can significantly speed up data loading.

We could potentially modify this passthrough to maintain the XID → UID mapping. In effect this is the same as just running one Dgraph client for loading data; but if the latter isn’t possible, then this passthrough can do that for you.

That create a scaling/fault tolerance issue, wouldn’t it?

We have many workers doing forming nquads/linkages (due to the CPU load of doing the calculations, and work queue recovery if one goes down). If we where to reduce that to one node it might create a situation where all work stops if the node experiences issues(single point of failure, which would be bad). It would also create a cap as to the number of items that can be processed(since it would make it so that there is no longer any horizontal scaling/distribution of work).

The reason I’d brought up trying to change the UIDs to be 256bit is then i could safely use sha256(/) without any real world worry of collisions (sha256s collision worry point is[based on the birthday problem] 2^128 doesn’t have a realistic collision ceiling unlike 64bit hashing[~4billion]), while still retaining the ability to do work in parallel (if that makes sense, sorry if i’m being a bit confusing). So the thought for us is that if we hash anything we can provide for parallelism.

Given that we don’t natively support XIDs, the options mentioned above are your best bet.

Also, a single cheap passthrough server which doesn’t need to do anything else than be a proxy, it can push a lot of QPS. Most likely, you’ll be bottlenecked by write throughput, before this proxy becomes a problem.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.