To answer the question - I don’t think it’s necessary to try to find “lonely” nodes, because I don’t think a node without any predicate is even stored on disk at all. Here’s why:
According to everything I’ve been able to find, getting a list of all UIDs by query is impossible. Likewise, finding a set of nodes without some factor doesn’t seem possible either. For example: a not(has()) query wouldn’t work unless that’s in a filter. There is a workaround, however.
Testing what happens if we query the dgraph cluster for UIDs which we do not think exist, we’ll notice that it always returns a node with at least a UID, no matter if there’s other data attached to the node. From this we can deduce that it’s likely that dgraph only actually stores nodes with predicates on disk. Thus, getting a list of all UIDs would literally be a list of all the UIDs possible.
The question then becomes, which UIDs are actually linked with predicates?
To find this out, we can ask dgraph for our schema which will show us all of the possible predicates. From that, we can query the UIDs for every node with has() one of those predicates. This might take a while depending on how many different types of predicate you have stored. After you have the UID lists from each query, you’ll need to combine and deduplicate them. Congratulations - this is your master UID list.
Note that the above may or may not (probably won’t) work well on a production database which is in live use. (Unless nobody writes to it while you’re doing this.)
Using your master UID list, you can do further data profiling by querying for all nodes with a certain type, and comparing it to your master to find all nodes without a type. You can also construct a set of queries which will return collectively all of your validly structured nodes - comparing that with the master list will yield a set of nodes which might have been added by buggy code, which can help you find the bugs. Your imagination is your friend here!
Have fun!
Example Master UID list:
# For the following schema:
<templar>: string .
<knight>: uid .
<horse>: float .
# etc.
# Query the following:
{
hasTemplar(func: has(templar)) {
uid
}
hasKnight(func: has(knight)) {
uid
}
hasHorse(func: has(horse)) {
uid
}
}
# Yielding:
{
"data": {
"hasTemplar": [
{
"uid": "0x50"
},
{
"uid": "0x53"
},
" ... "
],
"hasKnight": [
{
"uid": "0x57"
},
{
"uid": "0x6e"
},
" ... "
],
"hasHorse": [
{
"uid": "0x6e"
},
{
"uid": "0x97"
},
" ... "
]
}
}
# Then combine and deduplicate :)
For a small dataset this will work fine. If you have a huge dataset, you probably have more engineers to help you figure out what needs to be done and how