Foreign Key Integrity

It is my understanding that If you provide an arbitrary value to the uid function, and there’s no matching uid for the provided value, dgraph will interpret this as an upsert, and create a new id. For example:

uid(known_reference) <foo> "foo"

Where known_reference references a known node, so it adds/updates predicate foo to value “foo”.

vs

uid(unknown_reference) <foo> "foo"

Where unknown_reference does not reference a known node, therefore it creates a new node, assigns it a new uid, and sets predicate foo to value “foo”.

Is there a way to force dgraph not to silently create a new node, but instead err?

Something like a uid!(unknown_reference) <foo> "foo" that results in some sort of NodeNotFoundErr

Hi Michael, while we cannot change the behavior of the UID function, we can definitely detect what the upsert has done and raise the error on the client side. As a practical approach in this situation, you may consider the following:

After executing the function, when you check the uids node, look for a line like this:

"uids": {
"uid(unknown_reference)": "0xc"
}

This typically means that a match was not found and a new node was created. In case the uids node indeed has that "unknown_reference"entry, you can conclude that this is an exception. You can then do further error handling, such as adding the referred node and then re-executing the upsert.

In case a matching value exists, the new node will not be created and you should not see that line in the uids node.

To be clear, I’m not suggesting that you make a backwards-incompatible change like having uid(unknown_reference) raise an exception instead of silently performing an upsert (today’s behavior). What I am suggesting is adding a new function (e.g. uid!(unknown_reference)) that raises the exception at the db layer. Your suggested approach would require the client to do work that, arguably, should be the database’s responsibility. Additionally, if the client does check the response output and sees that a new node is created, then the client has to make another call to the database to try and delete all the data that was erroneously just created.
Please reconsider adding a new function (e.g. uid!) that makes it easier for developers to trust their database in production.

On this point:

In scenarios like master data management, a database cannot reject information because certain attributes are of poor data quality could exist in reality, such as the reference key integrity issues you mentioned. In such cases we still want to detect (and perhaps even store) the poor quality data with some kind of fixing/standardization process kicked off. IMO, We might not want to remove this flexibility from the client side.

tagging @pawan for his thoughts on this topic.

@mvpmvh You can use len(v) for a variable v along with @if to only run mutations if it’s an update, not inserting a new node: https://dgraph.io/docs/mutations/conditional-upsert/. For example,

upsert {
  query {
    v as var(func: eq(name, "abc"))
  }
  mutation @if(gt(len(v), 0)) { # only runs if the node exists already
    set {
      uid(v) <name> "abc123" .
    }
  }
}
1 Like

If I’m understanding your response, you’re stating that dgraph shouldn’t raise an exception when there’s “bad data”, because the application may have a custom process to resolve the situation. If that’s what you’re saying, that’s fine, I agree, but that doesn’t really change anything. Clients can continue to handle that use case by using uid and checking the json response (like you mentioned earlier). Adding a new uid! function would support a separate use case (a common use case), without removing any use cases that exist today. It is strictly an additive change.

At best, that is a temporary workaround until a more concrete solution (e.g. uid! is implemented). I do not consider your suggestion a viable longterm solution because it is not explicit. As a client, I have to read the response and make an assumption that if no data was written, then it must be some sort of data integrity violation. In your simple example above, that may be true, but in practice, there could be a number of reasons why data was not written. I could have an if statement that says don’t write data if the user’s age value is even; don’t write data if the user’s email provider is not gmail. These are all arbitrary use cases, but my point is that there’s a difference between business logic and schema integrity. I would much prefer to have an explicity NodeNotFoundException returned from dgraph instead of an upsert silently being ignored.
I’m using the go client. I would like to be able to do something like this:

// an error is returned when a foreign key constraint is not met
if _, err := txn.Do(ctx, req); err != nil {
  return err
}

I think your suggestion would require me to unmarshal the response (because no error would be returned) and make an assumption as to whether or not I need to return an error. That feels odd to me. If I submit a malformed query, I get back an error; I don’t have to unmarshal the response and check to see if I need to return an error–the database handles database issues for me.