Data Cleaning Recipes - Call for contribution

GraphQL and DQL interoperability is certainly a great feature of Dgraph : you can exposed a strongly typed API to populate and query your data with GraphQL and you can still access and update the entire graph using DQL to add metadata, create new relationships based on graph analysis (proximity, similarity, rating, …), detect patterns etc…

We can use DQL to do some data cleaning when needed. For example, when re-deploying a new version of GraphQL schema you can leave behind data created by the previous version.If you change a type attribute to mandatory and you have some data of this type without this attribute already in the graph…

I’d like to collect the DQL recipes the community found to do data cleaning.
The goal is to share this important knowledge and to investigate what should be in the documentation or in the product itself or offered as a data cleaning tool to help schema migration.

I’m interested in

  • the situation you encountered
  • the queries ‘detecting’ problematic situations
  • the upsert queries ‘mitigating’ those situations.

Contribute by simply replying to this post.
I’ll compile the recipes in a blog.

Thanks.

Starting with a simple case.

use case

I set name mandatory in Location object.
Doing a queryLocation I get

message": "Non-nullable field 'name' (type String!) was not present in result from Dgraph. GraphQL error propagation triggered.",

because I have old data without the name.

finding issues

 {
   missingField(func: type(Location)) @filter(not  has(Location.name))  {
     count:count(uid)
  }
}

note : the filter can be extended with other mandatory fields.

idea : create a python or go code doing this automatically from the schema and generating queries for all types and all mandatory fields and reporting the results.

resolving the issue

The mitigation could be to set a default name or to delete the faulty nodes. Let’s set a default value.

upsert {
  query {
    missingField as var(func: type(Location)) @filter(not  has(Location.name)) 
  }

  mutation {
    set {
      uid(missingField) <Location.name> "default name" .
    }
  }
}

In order for them to be MORE interoperable (as they are really completely different things), we need a few core features… in this order:

  1. @reverse needs to work more like @hasInverse - this is really what creates the phantom node problem and the biggest headache for mutations. If DQL did this automatically, would be amazing.
  2. Get rid of RDF and only allow JSON mutations. This is controversial, but I am pro this option personally. Need to make sure exporting and importing are in JSON format as well. This makes implementing #1 easier.
  3. Unique Keys on the Database level.

As far as the docs are concerned, I think they would profit from seeing two tabs for all mutation code between JSON and RDF, as this is definitely not clear for any new users, and not easy to explain how they work.

You could also use a doc page that explains how the two store data differently. User.likedPosts is not the same thing as User and likedPosts for example.

I could definitely add way more, but just preliminary thoughts…

J

Thanks. Those valid points will go under GraphQL DQL interoperability evolution.
You’ll notice that we have updated the DQL quick start and illustrated it with JSON. I also found it difficult for new user to have the quick start showing RDF mutation while we are using JSON output. So we definitely want to go JSON first and we will introduce RDF when it really helps the user : in some export / import cases, RDF makes more sense (at least to me).
Tabs for JSON and RDF in the doc is a good idea. We will try it where it makes sense.

That said, I’d like to refocus this thread on DQL recipes for data cleaning.

1 Like

So not sure exactly what you mean here, but a few examples come to mind:

Rename a node:

Delete a node (with cascade delete / detach):

Copy a node:

Delete orphan nodes:

Not sure if these help,

J

2 Likes

Thanks @jdgamble555 , that is helping. I would add some queries to identify issues. The Migrating Data reference is giving good mitigation options when you already know that you have a data migration to do and what should be done. I’d also like to compile queries to identify issues. Let’s say someone did a GraphQL schema update without paying attention to the data migration implications and now has a probably some odd data.
The goal is to evaluate a product feature (UI or external tool) to analyze the graph data knowing the current GraphQL schema and provide a list of potential problems and offer options to correct them.