Bug with multi-level (nested bocks) in Upsert Block

Moved from GitHub dgraph/4779

Posted by MichelDiz:

What version of Dgraph are you using?

Latest

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

N/A

Steps to reproduce the issue (command/config used to run Dgraph).

Mutation Sample

I added “dgraph.type” just to be sure.

{
  "set": [
    {
      "type": "node",
      "dgraph.type": "Node",
      "id": "0",
      "labels": ["Movie"],
      "properties": {
        "tagline": "Welcome to the Real World",
        "title": "The Matrix",
        "released": 1999,
        "dgraph.type": "Properties"
      }
    }
]
}
type Node {
  properties
  type
  id
  labels
  tagline
  title
  released
}

type Properties {
  tagline
  title
  released
}

Query example

{  
  movies(func: has(properties)) @filter(eq(type,"node")){
    type
    id
    labels
    dgraph.type
    properties {
      tagline
      title
      released
    }
      tagline
      title
      released
  }
}

Run the Upsert Block

upsert {
  query {
    movies(func: has(type)) @filter(has(properties)){
      v as uid
      Labels as labels
      properties {
        Title as title
        Released as released
        Tagline as tagline
      }
    }
  }
  mutation {
    set {
      uid(v) <migrationDone> "True" .
      uid(v) <title> val(Title) .
      uid(v) <dgraph.type> val(Labels) .
      uid(v) <released> val(Released) .
      uid(v) <tagline> val(Tagline) .
    }
  }
}

Expected behaviour and actual result.

This operation should be able to copy the values ​​in the variables of the nested block, “properties”. Thus allowing a kind of “migration” using Upsert Block. This feature is also equivalent to “merge”. Where we collect the values ​​of the nested block, we paste in the target node and delete the nested block right after.

With this “Hack” I was able to make it work

WARNING - This “hack” works mutation by mutation. Not Bulk Upsert. If you try to bulk upsert, it can get crazy results.

upsert {
  query {
    movies(func: has(type)) @filter(has(properties)){
      v as uid
      Labels as labels
      properties {
        Title as title
        Released as released
        Tagline as tagline
      }
    }
 me() {
    TitleMe as sum(val(Title))
    ReleasedMe as sum(val(Released))
    TaglineMe as sum(val(Tagline))
  }
  }
  mutation {
    set {
      uid(v) <migrationDone> "True" .
      uid(v) <title> val(TitleMe) .
      uid(v) <dgraph.type> val(Labels) .
      uid(v) <released> val(ReleasedMe) .
      uid(v) <tagline> val(TaglineMe) .
    }
  }
}

Just an Update

Testing the PR Fix bug, aggregate value var works with blank node in upsert by mangalaman93 · Pull Request #4767 · dgraph-io/dgraph · GitHub for this issue - but now creating a new node instead of update it (which is the goal of 4692, to just update instead of creating new ones). It works like that.

pay attention that this would work only with 1 to 1 operation. It won’t work in 1 to many (like bulk upsert).

upsert {
  query {
    movies(func: has(type), first:1) @filter(has(properties) AND not has(migrationDone)){
     uid
      qID as id
      Labels as labels
      properties {
        Title as title
        Released as released
        Tagline as tagline
      }
    }
 me() {
    LabelsMe as sum(val(Labels))
    TitleMe as sum(val(Title))
    ReleasedMe as sum(val(Released))
    TaglineMe as sum(val(Tagline))
    qIDMe as sum(val(qID))
  }
  }
  mutation {
    set {
      _:New <migrationDone> "True" .
      _:New <id> val(qIDMe) .
      _:New <title> val(TitleMe) .
      _:New <dgraph.type> val(LabelsMe) .
      _:New <released> val(ReleasedMe) .
      _:New <tagline> val(TaglineMe) .
    }
  }
}

mangalaman93 commented :

Further work that needs to be done:

  • How would SQL handle this case?
  • Is it possible to do this with existing functionality at all?

Blocks #4767

MichelDiz commented :

In SQL it has no exact similarity cuz it is not a Graph DB. But in theory, it would be the same as MERGE MERGE (Transact-SQL) - SQL Server | Microsoft Docs - Which is the ability to merge two tables in one. Which would be the case with this operation that I’m using Upsert Block.

In the above operation, I am merging two nodes. “Parent” with “Child”, that is, bringing the nested data to the parent. And then I would delete the nested data.

Yes, it is possible to do with the existing functionality. But it is limited. It is necessary to create an extra block in a kind of “Hacky way” for it to work properly. However, sometimes it doesn’t work as expected. Also, bulk upsert is impossible with this approach. The user needs to spend hundreds or thousands of Upsert Blocks to complete the entire process.

BTW, this issue is slightly different from #4767 the only thing that is similar is the use of an aggregation block to make this feature work.

MichelDiz commented :

Adding a use case based on a question did by a user in Slack.

Question:

Does anyone have an experience loading big datasets into a live cluster.
The issue we’re facing is to load a large amount of vertexes and edges into the cluster, but some of these nodes may already exist and we dont want duplicates.
I was hoping to be able to write all the data as files with an array of upsert blocks, and feed it to the live loader, but that does not seem to work.
Any ideas?

What I would answer.

Upsert Block is a too new feature. It won’t work in any loader.
One thing you can do is add a unique edge with a unique value for each entity. And then, when it is totally loaded. Do upserts to merge them.

But I just remembered that this isn’t possible either. Due to these issues.

mangalaman93 commented :

I was looking around as to how upsert works in other databases. I didn’t find any example where a database would allow updates using the values of variables defined in the query of the upsert.

I also noticed that databases only allow single row update or same update in all the rows that satisfy the query criteria.

I recommend that we keep the current behaviour for val as it is because it provides one to one mapping of nodes to their values which is similar to single row update that other databases provide. I also think that we should accept the change proposed in the PR Fix bug, aggregate value var works with blank node in upsert by mangalaman93 · Pull Request #4767 · dgraph-io/dgraph · GitHub to at least allow some of the use cases.

We can keep thinking how we can better solve the problems raised by this issue. I do not have a good solution that seems worth implementing.

cc @manishrjain @MichelDiz

seanlaff commented :

@mangalaman93 Perhaps this is a new feature unrelated to variables? What if there was a function to return a predicate’s value from a given uid? Intuitively, if I can say uid(v) in a mutation, I’d expect to also be able to say something like predicateValue(uid(v), "starttime"). I think this would solve my issues, what about you @MichelDiz ?

As a small note, dgraph does not error when you try to use vars like this in bulk contexts, which is probably a small bug given it is not currently unsupported, e.x the _:newNode <endtime> val(endVar) . line below doesn’t do anything

upsert {
  query {
    v AS var(func: eq("eventtype", "myevent")) @filter(lt(starttime, "2005") AND gt(endtime, "2005")) {
        startVar as starttime
        endVar as endtime
    }
  }
  mutation @if(gt(len(v), 0)) {
    set {
      uid(v) <endtime> "2005" .
      _:newNode <starttime> "2005" .
      _:newNode <endtime> val(endVar) .
    }
  }
}

I’m using dgraph to store event information, nodes look like this { eventType, start, end }.

My api receives discrete beginEvent and finishEvent events from a message bus (which happens to not guarantee ordering)

When I ingest to dgraph, I want to do conditional upserts- for example:
If I receive a beginEvent at t2, I’d like to find all events that span that time, for example an event like {eventType: myevent, start: t0, end: t5} and using an upsert… end up with two nodes {start: t0, end:t2} and {start: t2, end:t5}. (With an approach like this, I’ll have eventual consistency even if my events are delivered out of order)

The bug fix that recently went in makes this possible for a single event update at a time (using that me(){ myvar as sum()} syntax), but I’d like to be able to do batches for performance reasons.

mangalaman93 commented :

@seanlaff This seems like a good idea. Let me think through this.

mangalaman93 commented :

I just realized that what you are suggesting is already possible. You could do something like:

upsert {
 query {
  .
  .
  . 

  var(func: uid(v)) {
    p as  starttime
  }
 }

 # Now you can use val(p) along with uid(v) and it'd work fine.
 # something like below
 mutation {
  set {
   uid(v) <SomePredicate> val(p) .
  }
 }
}

Now, the problem we are trying to solve is when different set of UIDs needs to be correlated UIDs in a variable.

seanlaff commented :

@mangalaman93 Exactly, it’s possible with one match but does not work with an array of matches. I was thinking perhaps a new func since the mutation doesn’t exist within the closure of the block in which predicate var was defined… but on the other hand, my first intuition when writing the query was that I could use a predicate var within the mutation and the planner would “know” which uid to pull it from- so perhaps that is a better goal to strive for.

Sceat commented :

Is this issue related to the fact that properties is an edge so val(Title) could be many different titles ? no idea how Dgraph manages values underneath but i’d expect val(Title) to take the last found title in the sequence of properties nodes found, thus i don’t understand why extracting Title is different from extracting Labels or v :thinking:

Imaging with a pseudo cypher query it would show up like

MATCH (movie)-[properties]->(prop)
SET movie.title = prop.title

meaning (unless i’m wrong) that movie.title is equal to the last matched property title

MichelDiz commented :

I created an issue requesting a loop feature. And I believe that it would solve the problem that we have in this issue here.

Here’s an example.

upsert {
  query {
    source as movies(func: has(type), first:100)
    @filter(has(properties) AND not has(migrationDone)){
      qID as id
      Labels as labels
      properties {
        Title as title
        Released as released
        Tagline as tagline
      }
    }
  }

  mutation @exec(foreach(in: source)) {
    set {
      _:New <migrationDone> "True" .
      _:New <id> val(qID) .
      _:New <title> val(Title) .
      _:New <dgraph.type> val(Labels) .
      _:New <released> val(Released) .
      _:New <tagline> val(Tagline) .
    }
  }
}

What does that mean?
For each node found in the query, Dgraph will add a new mutation block to the transaction.
That is, if in the query finds 100 nodes, it will create 100 “mutation/set” blocks execute the mutation and finally commit the transaction. This is the idea.

Sceat commented :

I’m currently working on a rewrite of the graphql execute function so i think i understand the issue better (thanks for the ping)! i’m faced with the same problem that i plan to solve by treating exported datas as is.

  • a single value is exported as a single value
  • an array is exported as an array
  • a single value inside an array is exported as a sequence

The neo4j foreach is basically a scoped unwind #5165 and indeed would totally solve the problem

in my case i’m going to follow the concept that i find the most intuitive, meaning if i had to implement a foreach it would work on an exported array only but not on your current exemple which for me is :

  • source is a sequence of n movies
  • in each source:
    • qID is a single value
    • properties is an array
    • Title is a sequence of n properties

the unwind would work on everything as everything is a sequence (a single value is a sequence of 1) and each operation definition (the mutation in our case) represent a template of what needs to be done on each sequence (which is most of the time a sequence of 1 unless we are in a bulk situation like here in a sequence of n source)

ps: i have written an execution process (6.) here which may help

MichelDiz commented :

Hey Sceat,

Going back in time a little. Sorry for missing your question.

The answer will be really BIG (sorry also for that), but I will cover everything.

Is this issue related to the fact that properties is an edge so val(Title) could be many different titles?

Not exactly. But also that.

“thus i don’t understand why extracting Title is different from extracting Labels or v”

That’s about levels. I gonna explain below better, as far as I know.

Continuing

Internally Dgraph treats query levels as distinct, that is, level 1 is not related to level two and so on - so I created an extra block called “me()” - to capture the values ​​of any level and transport them for the mutation block (or any block I want). That is, with the block “me()” I brought up the nested values ​​to level one. That is why my “hacky” query works.

The values need to be at the same level in order for a mutation to work. Mutation does not capture nested levels.

I don’t know how to explain it technically because I’m not Core Dev. But it’s based on my experience that I claim this. But it has nothing to do with the fact that there could be several objects on that edge. This is unexpected behavior. But in the end, several objects in an edge could lead to more issues tho.

Let’s try this out (use a latest version or master version)

Run this mutation

{
  "set": [
    {
      "name": "MyTest",
      "properties":[
        {"propertie1": "Test 1"},
        {"propertie2": "Test 2"},
        {"propertie3": "Test 3"}
      ]
    }
  ]
}

You can also try Single property object

{
  "set": [
    {
      "name": "MyTest",
      "properties":[
        { 
            "propertie1": "Test 1",
            "propertie2": "Test 2",
            "propertie3": "Test 3"
        }
      ]
    }
  ]
}

Run this query

  query {
    movies(func: has(properties), first:1000) @filter(not has(migrationDone)) {
      uid
      name
      properties {
        propertie1
        propertie2
        propertie3
      }
    }
  }

You gonna see the mutation, all normal until here.

Run this upsert

upsert {
  query {
    movies(func: has(properties), first:1000) @filter(not has(migrationDone)){
      NAME as name #Level 1
      properties {
        propertie1V as propertie1 #Level 2
        propertie2V as propertie2 #Level 2
        propertie3V as propertie3 #Level 2
      }
    }
  }

  mutation {
    set {
      _:New <migrationDone> "True" .
      _:New <name> val(NAME) .
      _:New <propertie1> val(propertie1V) .
      _:New <propertie2> val(propertie2V) .
      _:New <propertie3> val(propertie3V) .
    }
  }
}

Now run this query to check what happened.

  query {
    A1 as var(func: has(propertie1))
    A2 as var(func: has(propertie2))
    A3 as var(func: has(propertie3))
    A4 as var(func: has(migrationDone))
    A5 as var(func: has(name)) @filter(not has(properties))

    q(func: uid(A1, A2, A3, A4, A5), first:1000) {
         name
         propertie1
         propertie2
         propertie3
         migrationDone
      properties { #This won't appear it is here only to release the conscience's weight.
        propertie1
        propertie2
        propertie3
      }
    }
}

Results

The Single property object would create one new object and mutate migrationDone in the previously created node.

With multiple objects you should see something like

{
  "data": {
    "q": [
      {
        "propertie1": "Test 1"
      },
      {
        "propertie2": "Test 2"
      },
      {
        "propertie3": "Test 3"
      },
      {
        "migrationDone": "True"
      }
    ]
  }
}

As you can see, he created 4 new nodes/objects. Because it treats each object from the query independently. And that is a problem. It is not the behavior we want. And this behavior is intrinsic to the fundamental design of the Dgraph query system.

But it could get even worse. Imagine that we have 1000 nodes with properties. It would simply create something around 4000 new nodes! That’s why I always limit this type of query to first: 1. And I add the extra block “me()”.

If we use an existing UID instead of a blank node, the same thing will happen. It simply ignores and creates three more extra nodes. The node you used it will probably add the value that is in “migrationDone”.

It is at that moment that the foreach would enter.

With foreach I can make mutations based on the root map, which in this case is the “source”. That is, it will “isolate” the context by step in a map from the “source” variable.

Let’s say again that we have 100 nodes with properties.

It will iterate over the “movies” in isolation. One by one collecting the values and assembling the mutation block for that movie 1 out of 100.

Back to the last comment

In your last comment, I realize that even with foreach there could still be problems. In some use cases.

For example, you expect “properties” to be an array. So far so good. But in practice, It would have to create another mechanism to expand this array within the mutation block. And that was never supported and it sounds like something even more complex to implement.

Foreach would work normally when expanding the nested blocks in the query, but not the nested block itself within a variable. In that case, you need something that expands it. And that is another issue.

BTW. It is important to emphasize that Upsert Block was never foreseen in the Dgraph’s design. So these things happens, and we need to find ways to make them work. It is not easy to change the wheel with the car moving.

Cheers.

Sceat commented :

Oh okay make sense! that bring an unexpected amount of conceptual problems… if a core dev could bring more internal details on how execution works ? technically if Dgraph could evaluate the result of query first and export a kinda flattened varmap to pass to the mutation there would be no problems i guess