GraphQL and DQL interoperability is certainly a great feature of Dgraph : you can exposed a strongly typed API to populate and query your data with GraphQL and you can still access and update the entire graph using DQL to add metadata, create new relationships based on graph analysis (proximity, similarity, rating, …), detect patterns etc…
We can use DQL to do some data cleaning when needed. For example, when re-deploying a new version of GraphQL schema you can leave behind data created by the previous version.If you change a type attribute to mandatory and you have some data of this type without this attribute already in the graph…
I’d like to collect the DQL recipes the community found to do data cleaning.
The goal is to share this important knowledge and to investigate what should be in the documentation or in the product itself or offered as a data cleaning tool to help schema migration.
I’m interested in
the situation you encountered
the queries ‘detecting’ problematic situations
the upsert queries ‘mitigating’ those situations.
Contribute by simply replying to this post.
I’ll compile the recipes in a blog.
note : the filter can be extended with other mandatory fields.
idea : create a python or go code doing this automatically from the schema and generating queries for all types and all mandatory fields and reporting the results.
resolving the issue
The mitigation could be to set a default name or to delete the faulty nodes. Let’s set a default value.
In order for them to be MORE interoperable (as they are really completely different things), we need a few core features… in this order:
@reverse needs to work more like @hasInverse - this is really what creates the phantom node problem and the biggest headache for mutations. If DQL did this automatically, would be amazing.
Get rid of RDF and only allow JSON mutations. This is controversial, but I am pro this option personally. Need to make sure exporting and importing are in JSON format as well. This makes implementing #1 easier.
Unique Keys on the Database level.
As far as the docs are concerned, I think they would profit from seeing two tabs for all mutation code between JSON and RDF, as this is definitely not clear for any new users, and not easy to explain how they work.
You could also use a doc page that explains how the two store data differently. User.likedPosts is not the same thing as User and likedPosts for example.
I could definitely add way more, but just preliminary thoughts…
Thanks. Those valid points will go under GraphQL DQL interoperability evolution.
You’ll notice that we have updated the DQL quick start and illustrated it with JSON. I also found it difficult for new user to have the quick start showing RDF mutation while we are using JSON output. So we definitely want to go JSON first and we will introduce RDF when it really helps the user : in some export / import cases, RDF makes more sense (at least to me).
Tabs for JSON and RDF in the doc is a good idea. We will try it where it makes sense.
That said, I’d like to refocus this thread on DQL recipes for data cleaning.
Thanks @jdgamble555 , that is helping. I would add some queries to identify issues. The Migrating Data reference is giving good mitigation options when you already know that you have a data migration to do and what should be done. I’d also like to compile queries to identify issues. Let’s say someone did a GraphQL schema update without paying attention to the data migration implications and now has a probably some odd data.
The goal is to evaluate a product feature (UI or external tool) to analyze the graph data knowing the current GraphQL schema and provide a list of potential problems and offer options to correct them.