Feature request: detached delete for node

Moved from GitHub dgraph/3447

Posted by liveforeverx:

What you wanted to do

Possibility without query all related ids/delete this ids to delete nodes without leaving edges, which points to nowhere.

What you actually did

As many people, delete node and forgot to delete edges, this will memory(edge) leak by leaving edges, which shouldn’t exists still existing (may be dgraph should have background workers, which do cleanup of non existing references or support DETACH_DELETE ?).

I couldn’t find any documentation, so I assumed that it willn’t leave non existing data in database.

Any external references to support your case

DETACH_DELETE is supported in neo4j: Neo4j - Delete a Relationship using Cypher

MichelDiz commented :

This issue started after a conversation here
1 - https://dgraph.slack.com/archives/C13LH03RR/p1558273471042900
2 - https://dgraph.slack.com/archives/C13LH03RR/p1558276319045600

Example to reproduce this:

Data used: dgraph/family.json at master · dgraph-io/dgraph · GitHub

set name index to exac

{
  q(func: eq(name, "Bart")){
    uid
    name
  }
}
{
  q(func: eq(name, "Homer")){
    uid
    name
  }
}

Now delete Bart.

{
  "delete": [
    {
      "uid": "0xe"
    }
  ]
}

Now query for Homer and his parent_to

{
  q(func: eq(name, "Homer")){
    uid
    name
    parent_to {
      uid
      name
    }
  }
}
{
  "data": {
    "q": [
      {
        "uid": "0xd",
        "name": "Homer",
        "parent_to": [
          {
            "uid": "0xe"
          },
          {
            "uid": "0xf",
            "name": "Lisa"
          },
          {
            "uid": "0x10",
            "name": "Maggie"
          }
        ]
      }
    ]
  }
}

“Bart ghost still there”

MichelDiz commented :

Temporary way to work around this:

Use Cascade https://docs.dgraph.io/query-language/#cascade-directive

{
  q(func: eq(name, "Homer")) @cascade
  {
    uid
    name
    parent_to {
      uid
      name
    }
  }
}

Count issue

Count don’t solve with cascade, so you have to use filters. Btw you can always use filters to solve this if you don’t wanna use cascade.

{
    q2(func: eq(name, "Homer"))  {
    count(parent_to)@filter(has(name))
  }
}

romshark commented :

I’d also like to add that deleting nodes by hand using Ratel becomes increasingly difficult when there are many outbound and inbound relations without a detach-delete option. It’s just too easy to forget to clean up all of the in- and out-going references and it’s incredibly painful to do by hand.

Even if a detach-delete option would slow the transaction and/or database down significantly I still think it’s absolutely necessary for at least the reasons described above. I’d usually never physically delete data from the database, I’d mark it archived instead, but when I really have to delete it I’d use detach-delete. If I then realize detach-delete is too slow for my needs - I’d optimize it by writing hand-crafted clean-up procedures myself.

detach-delete should automatically delete any outbound and inbound edges of the deleted node.

campoy commented :

This makes total sense as a feature, although it might have some serious implications on performance depending on the implementation we go with.

This would be equivalent to having an option to delete predicates based on their value rather than on their subject. This is explicitly not allowed in the docs: https://docs.dgraph.io/mutations/#delete
I’d argue this is the best place to fix it, rather than adding some new operation.

So for instance, we could do:

delete {
  * * <0x1> .
  <0x1> * * .
}

To delete all predicates with 0x1 as predicate or value, therefore completely removing 0x1 from the database.

pros

No new API added, simply completing the support for delete which so many expected that we added documentation explicitly saying it wasn’t supported for performance reasons.

cons

Not clear how this would apply to JSON mutations, not sure this is an issue at this point, though.
A possible solution would be to indeed add a detach mechanism for JSON in the style:

{
  "detach": [
    {
      "uid": "0x1"
    }
  ]
}

MichelDiz commented :

It is mentioned in the documentation (I think from @pawanrawal ) that * * O or * P O delete are expensive. It costs a lot of performance.

But I found a solution for this. Using the example with Simpson’s family. A detach method with bulk upsert.

As this issue was created before upsert release I couldn’t recommend bulk upsert. But I think this cover it now.

upsert {
  query {
     v as var(func: eq(name, "Bart")){
      p as ~parent_to
    }
  }

  mutation {
    delete {
      uid(p) <parent_to> uid(v) . #detach method
      uid(v) * * .
    }
  }
}

This way you can “detach” Bart from any parent. But need to use reverse in this case.

Note that this is a case where you delete a Child and need to delete edges coming from relatives. Only in this scenario necessary to do like so. Without reverse is possible, but it complicates the query.

Also I did some similar queries to neo4j docs.

In Neo4j

MATCH ()-[r:RELEASED]-() 
DELETE r

In graphql±

upsert {
  query {
    v as var(func: type(Artist)) @filter(has(RELEASED)){ 
       R as RELEASED  
     }
  }

  mutation {
    delete {
      uid(v) <RELEASED> uid(R) . #This will only delete relationships. Not the nodes.
    }
  }
}

From parent is very easy.

In Neo4j

MATCH (:Artist)-[r:RELEASED]-(:Album) 
DELETE r

In graphql±

In this query you need to verify the relationship between two nodes. It gets a little complicated as you will need to check reverse edges.

upsert {
  query {
    var(func: type(Artist)) @filter(has(RELEASED)) {
        R as RELEASED @filter(has(Album)) { 
             v as  ~RELEASED
# By this I am ensuring that I am deleting a relationship that comes from artist and goes to an album.
        }
    }
  }

  mutation {
    delete {
      uid(v) <RELEASED> uid(R) .
    }
  }
}

In Neo4j

MATCH (:Artist {Name: "Strapping Young Lad"})-[r:RELEASED]-(:Album {Name: "Heavy as a Really Heavy Thing"}) 
DELETE r

In graphql±

upsert {
  query {
     var(func: type(Artist)) @filter(eq(Name, "Strapping Young Lad")) {
      R as RELEASED @filter(eq(Name, "Heavy as a Really Heavy Thing")){ 
             v as  ~RELEASED
        }
    }
  }

  mutation {
    delete {
      uid(v) <RELEASED> uid(R) .
    }
  }
}

In Neo4j

MATCH (a:Artist {Name: "Strapping Young Lad"}) DETACH DELETE a

In graphql±

upsert {
  query {
    a as var(func: type(Artist)) @filter(eq(Name, "Strapping Young Lad"))
  }

  mutation {
    delete {
      uid(a) * * .
    }
  }
}

Note that this query deletes the artist node and its nested relations.

campoy commented :

Thanks for the very thoughtful note, @MichelDiz.

I think the upserts you’re providing definitely show that there’s already a way to delete nodes completely.

If we are afraid of the operation being too slow, I wonder whether we could require a reverse index on any predicates for which we want to delete by value.

This would be a good alternative, as it would follow the “add this index to perform that operation” strategy that is already followed for other functionalities.

My proposal, at this point is the following:

Deleting a given predicate by its value

Given a predicate name (e.g. parent) we allow deleting by value if and only if the predicate has a reverse predicate.

{
  delete {
    * <parent> <0x1> .
  }
}

An alternate syntax to this operation, which currently fails silently:

{
  delete {
    <0x1> <~parent> * .
  }
}

This operation should be quite performant since we already have an index.

Deleting all predicates by their value

The second question is what happens when we also have a wildcard (*) for the predicate name.

{
  delete {
    * * <0x1> .
  }
}

I’d argue we could also support this operation by finding the predicates in the same way we’ve done so far, but failing if any of those predicates doesn’t have a reverse index.

Conclusion

I think this change adds support for a feature that users are requesting in a natural way.
We already use delete to delete nodes, we can make it work efficiently as long as we require reverse indices.

That said, I’d like to have @manishrjain’s view on the issue.

pawanrawal commented :

Had a discussion with @manishrjain about this.

I’d like to point out that the reason a normal detached delete isn’t supported right now is that given how data is stored internally in Dgraph, to execute a detached delete

  1. We would need to do a full database scan to find all <subject, predicate> combinations which contain an object (uid) and rewrite data on disk. Hence, this operation won’t scale very well as the size of the database grows.
  2. We’d also have to somehow make sure that all the count indexes are still up to date and have the correct data which is another expensive operation.

A way to get around the problem of ghost nodes while querying would be to use the GraphQL layer of Dgraph (https://graphql.dgraph.io) which ensures that the nodes being returned have the type specified in the schema.


Now coming to * <parent> <0x1> . and * * <0x1> . type of deletions. The way to do the * <parent> <0x1> deletion is what @MichelDiz has already pointed out, that is having a reverse index on the predicates and doing a delete upsert operation.

upsert {
  query {
     v as var(func: eq(name, "Bart")) {
      p as ~parent_to
    }
  }

  mutation {
    delete {
      # delete the entity
      uid(v) * * .
      # Also delete node corresponding to Bart from other parent_to edges.
      uid(p) <parent_to> uid(v) . 
    }
  }
}

What @campoy is suggesting is syntactic sugar around the same functionality which already exists, so I am not sure if we should be doing it now.

Regarding the * * <0x1> delete operation.

I’d argue we could also support this operation by finding the predicates in the same way we’ve done so far, but failing if any of those predicates doesn’t have a reverse index.

We would need to execute this operation for all predicates (or all predicates with a reverse index) in the database to support this as we don’t know what predicates might have a particular object (say 0x1). I’d say its better if the user gives us the predicates using an upsert delete operation.