Absent uids are returned if querying via uid function

Moved from GitHub dgraph/5817

Posted by anurags92:

What version of Dgraph are you using?

Dgraph version : v2.0.0-rc1-452-g5c75240a3
Dgraph SHA-256 : c3fa23e3909376d2b5e567cbe963a81132c9c83e7e40b4051d56cba8105dece2
Commit SHA-1 : 5c75240a3
Commit timestamp : 2020-07-02 11:02:02 -0700
Branch : master
Go version : go1.14

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

RAM: 16gig
OS: Linux/Ubuntu
Model name: Intel® Core™ i7-7700 CPU @ 3.60GHz

Steps to reproduce the issue (command/config used to run Dgraph).

Query:

q (func: uid("0xa", "0xc")){
        uid
        name
    }

Expected behavior and actual result.

Expected results: It should not give any result or throw since uids have not been created.
Actual results:
It seems that uids are present but the predicate is missing in them.

{
  "q": [
    {
      "uid": "0xa"
    }, 
    {
      "uid": "0xc"
    }
  ]
}

sessionboy commented :

Looking forward to fixing this issue in the new version.

Any update ?

This is an old characteristic of Dgraph, it is like this by design - Several users have discussed about it here on Discuss and that it is a normal characteristic that must be ignored. If the user doesn’t wanna get this result. He can use cascade with his default predicates in the body of the query. Or use with expand like:

q (func: uid("0xa", "0xc")) @cascade {
        uid
        expand(User)
 }

Once your DB has leased a UID, it will “exists” no matter what. And if you use a non-existent uid in the query func “uid(x)”, Dgraph won’t verify if there is any value on that nodes or if it was already leased - It will assume that you are right, that you seeked for an existing UID. Cuz it would take time and increase latency (That final part is my opinion, I can be wrong - but sure, it would have to be aware of the type system and so on, that would increase process and latency).

So the user would be better to hard code a cascade op or combine with expand method.

@Anurag that works for you?

3 Likes

@MichelDiz :expressionless: I don’t think this design will bring much benefit, on the contrary it is very bad, it violates the usual database design principles.
Compared with its saved performance, it brings more confusion and insecurity to users.

In actual development, there are usually many queries that depend on id. When id does not exist, it means that the data does not exist. But dgraph has violated this principle and will cause many problems.

1, For applications with type checking, it will crash directly. Because it only returned the id that shouldn’t exist, and lost other required fields.

2, I cannot check whether the data exists based on the id, and if I do so, I will get an incorrect result.
If the node does not have other unique fields, I will not be able to check whether the data exists.

3,In actual business, queries based on id are common, such as /post/{id}, /user/{id}, /user/{id}/{status_id}, etc. Once the correct data cannot be queried based on id, it will result in 404 or other serious errors.
This will cause the application to crash, bring a very bad experience to the user, and even create messy and wrong data.

The above are some of the problems I encountered.

I know @cascade and expand, but these characteristic are not designed to solve this problem.

I am very confused about this solution, but I still try to use it, and the bad thing is that it does not work.

Suppose I have the user data as follows:

{
   "uid": "0x4",
   "name": "jack",
   "age": 22,
   "description": null,
   "birthday": null
}

1, When I execute the following query, it works normally:

{
  data(func: uid(0x4)) @cascade {
    uid
    name
  }
}

Get the result:

 "data": [
      {
        "uid": "0x4",
        "name": "jack"
      }
    ]

2, If a field with a value of null is queried, it will return empty:

{
  data(func: uid(0x4))@cascade {
    uid
    name
    description
  }
}

Get the result:

 "data": []

This is not the wrong result I want.

3, When I combine it with expand, it also got the wrong result:

{
  data(func: uid(0x4))@cascade {
    uid
    expand(User)
  }
}

Get the result:

 "data": []

4, When I specify other fields, it will throw an exception:

{
  data(func: uid(0x4))@cascade {
    uid
    name
    expand(User)
  }
}
"errors": [
    {
      "message": ": Repeated subgraph: [name] while using expand()",
      "extensions": {
        "code": "ErrorInvalidRequest"
      }
    }
  ]

The above is the result of my test in ratel.

Obviously @cascade and expand cannot solve this problem. Even if they can, this is still a very bad solution.

In short, I think it is necessary to check whether the id actually exists. Almost all databases are designed this way, but dgraph is an exception, which is very bad.

Performance is important, but practicality and safety are even more important.

I hope team can solve this problem as soon as possible, otherwise my project will not be able to go online, or give up dgraph and use other databases. :triumph:

Tree traversal happens from a root and Graph traversal happens from a node.

We can’t start traversing from an imaginary node, and hence we shouldn’t enter a random UID and hope it will work.

e.g. Users are the origin node from where all the other traversals should happen for user related stuff.

P.S. From where are you getting those UID before querying for their value?

@abhijit-kar

There are many situations.

For example:

1, User maliciously fabricated id. This is easy to occur, such as /user/23435, when the user enters the id 23435 which does not exist, it will cause the above problem.

2, The id that has been deleted but saved in other nodes.
For example, in social applications like twitter, the reply_id is relation with the status_id, and reply_id is not deleted after deleting status_id. status_id is still stored in reply_id.

Regardless of whether the above situation will occur, this problem should be resolved. According to the id query but get the wrong result, this in itself is an unreasonable design.

If user maliciously fabricates id, they won’t even get any result if id doesn’t exist.
But if id exists, Auth can prevent unauthorised access.

But why nodes will be retained?

If I am not following someone, it doesn’t make sense to keep a link to them in follows list.

Also if the user I follow has deleted their account, the link to their account will be severed and won’t be saved in following list.

P.S. Tagging @amaster507, so that he might add his perspective on this, as he has experience with building a much bigger project.

You may not really understand the issue. This is the most original description of the issue: The uid has not been created, but it appears in the query’s result list.

Even if the id does not exist, dgraph will still return it.
This causes the problems mentioned above:

This makes sense for the author itself, because I still want to know what replies I have written, even if the status has been deleted. This is very common in social applications.

But no matter how many questions you have, what I want to express in the end is:

I know, that confuses me when I first touch Dgraph too. But I had get used to it.

When you say insecurity, do you mean in a self-confidence sense?

Yeah, I had the same question, like, 3 years ago with @pawan - I knew nothing about the core design. That year I have asked if it would be better to return an empty query. So we could ignore it at the application level. Pawan replied that a simple length could solve the case. And I think so, it was a good point. But I see that If the user puts the UID in the query body, the check length wouldn’t help. So we need to do of one of the two, we do not ask for the UID (because we already know what it is, because we are using it as a parameter) and so the answer comes empty being possible to check length, or we do a more complicated check by ignoring the UID predicate.

I know that it is a bit complicated at first, but you get used to it fast.

Yes, you can, see the UID itself isn’t a source of truth. The other predicates are. If your query returns only the UID, you should consider it as a non-existent entity.

Yeah, this a REST API pattern. I see that several users are used to this pattern. But it isn’t the only one.

Yeah, it will return empty cuz your dataset used “null”. Dgraph doesn’t store null keys. It will be empty. And cascade just return nodes that have all fields asked on the body you know? If your node has only one of the fields requested on the query, it will be invalid. Cuz it has to have all of them.

This case is because the Type User already has the predicate name. So you shouldn’t be using it before.

I kind of agree with you. But I don’t see this as the end. We can do checks in our end (in the application level). That’s a choice with cost x benefit.

PS. A better solution to this instead o cascade. Is to not use the UID at all. e.g:

q (func: uid("0xa", "0xc")){
        name
    }

that would solve any issue.

Cheers.

I don’t think I value much in this conversation. It is nice to be aware of this, but not really up my alley. I am using DQL in a different kind of way. We are using DQL for some of the more admin side of the aspect to transform data as we change schema around while we build. But my primary use of DQL is multi-level filtering. This is just not possible with a pure GQL endpoint at this time (and possibly never). But, I cannot use strictly DQL, because I need to honor @auth rules. So what do I do? I use DQL to use var blocks and build a filtered set of IDs for the type where I want the filter applied on the GraphQL endpoint. Then I return this set of filtered IDs with a custom query to the client through the GraphQL endpoint and then use this filtered set of IDs to do what would be called an IN(...) clause in other terminology. The GraphQL ignores any id that may not exist, and it filters the ids that do not match the @auth rules. And as suggested above, use @cascade. I have learned to use @cascade many times over as it is a lifesaver.

1 Like

@MichelDiz
So…Based on the above discussion, the best solution at present is to create a unique id based on a certain algorithm instead of using dgraph’s own uid. But this seems to cause another problem. It is cannot to create an edge based on the id to realize the association of nodes. :expressionless:

Dgraph has many strange and unreasonable designs.

I am investigating nebula graph, which is an excellent graph database based on C++, hoping to bring good results.

I also hope that dgraph can make greater improvements, such as adjusting its architecture design and providing a more convenient and reasonable API. It would be even better if dgraph could be refactored using rust, the best language at present. As currently done by Microsoft, Deno, etc.

Of course I know this is actually difficult.

Hey @pandalive, let me share the approach that we took to solve this problem with the GraphQL API. This problem doesn’t exist there because all nodes have a type.

So for example if you had a type Author which looks like

type Author {
  id: ID!
  name: String!
}

and you ran the following query

{
  getAuthor(id: 123456) {
    id
    name
  }
}

you would get an empty result if id 123456 didn’t exist or wasn’t an author. This is because we have a nice type system while using GraphQL do the underlying DQL query

{
  getAuthor(id: 123456) @filter(eq(dgraph.type, "Author")) {
    Author.id
    Author.name
  }
}

So if you were using types with Dgraph and all your queries had a type filter, absent uids won’t be returned. Hope that solves the problem that you are facing.

3 Likes

@pawan Thanks for your suggestion.

In view of the following points, I am cautious about the GraphQL API.

1, At present, GraphQL has not been popularized. People who learn and use GraphQL account for a small number of people, and GraphQL is not more mature than restful. This means that using GraphQL will increase costs.

2, GraphQL API is a complete back-end service. I don’t know how to integrate it into my back-end service. This makes me very confused. My back-end service is built using rust.

I think I can directly initiate an http request to the GraphQL API server to get data. Or I can directly forward the web request to the GraphQL API server. But no matter what I do, I think it’s bad.

GraphQL API server seems to increase the request cost, my application request will become as follows:

web(nodejs) ---> rust backend server --> GraphQL API server  --> dgraph server 

This will be a very time-consuming request process, resulting in poor performance.

3,GraphQL API only supports http, it cannot use grpc to improve performance.

4,GraphQL API still does not support some advanced features of GraphQL±.

In view of the above points, I have been hesitant to use GraphQL API.

Of course, if can solve some of the problems mentioned above, I will be happy to use it.

@MichelDiz @pawan Please give me some advice.

You right, GraphQL isn’t popular as several techs that took time to gain traction. REST, although it is a bit complicated to create and maintain, is the highest top tech today. But before it, SOAP was the “king” in the long run. But it was terrible, complex(the opposite of what his name claimed), full of “XML” serialization, and pretty hard to maintain. Rest came to gain some visibility 8 years(since it started in the early 2000s) after SOAP and started to take SOAP’s place.

It is easy to see why people migrated rapidly to REST. SOAP was the “Old JAVA”(today JAVA seems way better).

So, it took 5 years to REST start to take SOAP’s place, and more 3 years to gain traction. GraphQL, on the other hand, was launched internally in 2015 and set as open-source in 2018. That is, GraphQL is officially a child in its early stages of life. However, already solving all REST problems.

It’s a choice that you make, learn GraphQL, and take advantage of all the advantages that Dgraph’s GraphQL is bringing out-of-the-box. Or use REST and maintain the entire system manually with N+1 issues, with over-fetching, under-fetching, more HTTP requests than you need, poorly documented API, URL path hell, and other problems that you may discover in comparisons on the Web.

It’s a personal choice, learn GraphQL or stay comfortable with the technology you already use. When REST appeared, a lot of people wrinkled their noses against it, but today I don’t see (at least in my country) more developers using SOAP (except within old companies or the government, who still use “COBOL”). We can see this story again with GraphQL.

In the case of Dgraph’s GraphQL, you right. But GraphQL is just a “layer” nothing more. It is a language for APIs. On the other hand, Dgraph offers an already done GraphQL API out-of-the-box. Maybe that’s where your confusion lives. There are ways to integrate Dgraph’s GraphQL with a custom GraphQL. You could try to use implementations like Apollo Federation https://www.apollographql.com/docs/apollo-server/federation/introduction/ - there are other solutions that you need to evaluate.

I’m not sure if we have already support for Apollo Federation. Let me ping @michaelcompton to get his word.

This concept although apparently logical, it is not so. Each Dgraph instance is a “clone” of the same API(you can access the same API on any Alpha). There is no separation between GraphQL and Dgraph. It is as if the two are the same.

This “web(nodejs)” is another server? what does it do?

You can use load balance (of course, if you have multiple Alphas) in front of Dgraph’s GraphQL API. That would make everything similar or faster than GRPC.

You can use custom DQL https://dgraph.io/docs/graphql/custom/graphqlpm/

Cheers.

not yet.

1 Like