Support for Set Operations (useful for implementing security filtering)

We are looking with great interest at the evolution of DGraph as it seems to be a very natural fit to our requirements. It looks like a very solid new entry in the Graph Database space.

Since we can rather easily change databases for our read-models we have no trouble testing out new database implementations. But found one limitation that limits going further.

In our system our security system relies in part on integer value set intersections. Usually we can implement that with features available in the database or using a user-defined-function.

All our data is organized in a tree (though documents in the tree can reference arbitrary other documents). Each document has an array of locks e.g [4,8,9] and also a set of inherited locks from ‘parent’ documents [ [1,8] , [9,72] ]. When a user performs a query we basically retrieve his keys [1,9] and all results are filtered against these by intersecting each existing array of locks with the keys. A non-empty intersection results in inclusion in the result-set.

The actual operation is very cheap. Since the arrays are usually not very big calculating intersections of these integers is easy.

[ [1,8], [9,72], [4,8,9] ] INTERSECT EACH [1,9] => [ [1], [9], [9] ] => PASS
[ [1,8], [9,72], [4,8,9] ] INTERSECT EACH [9] => [ [], [9], [9] ] => FAIL

I am unsure how we could model this requirement given the existing filtering abilities without resorting to client side filtering again.

Optimally I would like to store the locks and inherited locks in one string object (or typed as a multidimensional integer array) and have a filter similar to a geo filter (the exact syntax is not relevant here and should be given some more thought of course).

{
  debug(intersect("all_locks", "{'test':'all_non_empty', 'test': [1,9]}" ) ) {
    type.object.name.en: name
  }
}

Simple set operations like these would offer a lot of value in a number of scenarios we have encountered of the years.

Does anybody know whether this could be implemented or can somebody offer an alternate in the current state of DGraph ?

Hi @matthiasg,

This seems a bit out of scope w.r.t. graph database functionality. More in-line with Redis though.

This is the basic scheme which comes to mind.

  • Doc -lock-> “lockA”
  • Doc -parent-> ID -lock-> “lockB”

In terms of values for lockA and lockB, you’d want them to be individual values, instead of arrays; to allow idempotency of operations. You would then set indexing on these predicates so you can run anyof functions.

Assuming you store arrays, this is one (non-working) solution.

Solution 1

docA -lock-> “4,8,9”
docA -lock-> “1,8”
docA -lock-> “9,72”

Note that the above data model currently won’t work, because we only support one string value for a predicate at the moment.

{
doc(anyof(lock, “1,9”)) { lock } // returns docA and the 3 lock values.
doc(anyof(lock, “9”)) { lock } // returns docA and the 3 lock values.
}

I think if you were looking to find any document which had a lock on [1, 9] or [9], this system would work nicely. But, you’d need to do a client side operation to verify if all the values of lock returned by the query contain that value.

I can’t think of any other solution which would work. And I’m not clear if this fits with the requirements of a graph database. Feel free to file an issue, and if enough people need this, we can figure out how to best support such a use case.

Thanks for the feedback. I will file it then, since i believe it is quite similar to geo filtering which also basically filters based on highly specific scalar value interpretation. It is just in the aforementioned case we do integer intersections (which is easier than geo distance matching). Of course if you do advanced geo indexing the actual implementation might differ.

Note I am not suggesting the database understand locks or keys. Just multi integer set intersections. I cannot imagine this being very difficult to add, but if it is out of scope it would put extra maintenance burden on the project of course.

Alternatively having a way to define filter functions in a script language (similar to elasticsearch or redis) would obviously also work but be more complicated.

We have been thinking about allowing more modularity, to allow plugins and custom functionality; which users can build and use. This might fit well in that.

Not sure what you mean by advanced. I think our implementation of geo covers all the popular use cases. Is there something that we’re missing there?

Not sure whether anything is missing since I don’t use geo searches much, but no I only meant that the actual source code inside of DGraph for geosearching could be implemented by converting the geo data into an indexed intermediary state for quicker retrieval instead of a standard scan filter. In my example a simple scan filter would suffice for smaller datasets.

SInce this feature is not yet supported, we are going to build a secondary read-model with pre-calculated security settings which should allow us to test the implementation,stability and ease-of-operation with minimal risks.

Btw I have yet to find a graph db with a usable security model :slight_smile: .

Keep up the work, this project looks great so far!

EDIT: Looking at dgraph/internal_utilities_geodb_geodb_impl.cc at 3269e3df2582ad06901230c50b38d8745d1d4f73 ¡ dgraph-io/dgraph ¡ GitHub I see the implementation of geodb seems to also be filter based. Once we have our test system up (I will schedule something for later this month) I will have a deeper look into this.

We actually do S2 indexing to efficiently retrieve geolocations; we don’t really scan through everything and match. That’d be too slow.

Here’s the relevant code:
For indexing: dgraph/s2index.go at master ¡ dgraph-io/dgraph ¡ GitHub
For filtering: dgraph/geofilter.go at master ¡ dgraph-io/dgraph ¡ GitHub

Happy to hear your ideas about security, and what are you guys looking for. We have that on our roadmap, but the exact timeframe is undecided.

yeah I just found that too :slight_smile: Thats what I had meant with code differences compared to a very simplistic filter

1 Like

Giving our requirements more thought we could potentially rewrite our requirement as dynamic properties.

document is_locked_with lock-set1 .
document is_locked_with lock-set2 .
document has_totalLockSets "2"@xsd:integer
lock-set1 has_lock_1 "true"@xsd:boolean .
lock-set1 has_lock_8 "true"@xsd:boolean .
lock-set1 is_empty "false"@xsd:boolean .
lock-set2 has_lock_9 "true"@xsd:boolean .
lock-set2 has_lock_72 "true"@xsd:boolean .
lock-set2 is_empty "false"@xsd:boolean .

e.g
document

  -> LOCK-SET (lock-set1)
    -> has_lock_1 = true
    -> has_lock_8 = true
     -> is_empty = false

  -> LOCK-SET (lock-set2)
    -> has_lock_9 = true
    -> has_lock_72 = true
    -> is_empty = false

  -> NUMBER-LOCKSETS = 2

We could then need to do a query like this:

Find all documents where ( document.has_totalLockSets == count of is_locked_with ( lockset is_empty OR has_lock_1==true OR has_lock_9 == true ))

This kind of query would presumably require variables inside queries which I already saw being discussed. Also I could first find all lock-sets matching the requirements and then intersect them with each document. The only requirement is to compare the number of matches with the number of expected matches. basically ALL linked lock-sets need to match a criteria

Sorry if this goes too deep but I am trying to figure our how I could make best use of the graph query features. Maybe it will help thinking about new features or somebody can come up with alternatives.

I think this setup works. We can even further simplify this and make it work with the variable syntax we’re already planning.

document is_locked_with lock-set-id .
lock-set-id has-lock "1"
lock-set-id has-lock "2"

# To represent empty, you might say:
lock-set-id has-lock "_nil_"

This would then allow you to do:
{ ..., count(has-lock) } // to count the number of locks.

Similarly, you don’t need to maintain counts for is_locked_with either. You could now do a query like this:

{
  L: anyof(has-lock, "1 9 _nil_")
  D: docs(id: L) {
    ~is-locked-with @filter(eq(count(has-lock), len(L)))
  }
  docs(id: D) {
    _uid_, other, properties
  }
}

Here note that ~ is used to represent reverse edge, which you can index. So, effectively, we first found all the locks, assigned them to L. Then from these locks we took the reverse edge to get all the documents, whose count of has-lock matched the length of L; there were put to D. Now we hold all the documents D, which we can retrieve the ids of, or other properties of.

looks good but I don’t quite follow yet.

I am confused a little about docs(id:L) and len(L). I initially read L as the set of all lock-sets that match my criteria without these being filtered against documents yet. So lets say this could be 10.

From there I need to find all documents whose associated lock-sets are included in L. But len(L) would be 10 not e.g 2 as in my example. Or is len(L) another invocation of anyof(has-lock, "1 9 _nil_") but limited to current document ?

Also: I didn’t check yet, are variables already supported ?
EDIT: I just saw its still in planning stage, but that’s fine

{
   var(anyof(has-lock, "1 9 _nil_")) {
    D: ~is_locked_with { // ref:a
      is_locked_with @filter(anyof("1 9 _nil_"))  // ref:b
    }
  }

  docs(id: D) {
    _uid_, and, some, other, properties
  }
}

This would give you all the docs which have all of the specified locks. How does this work?

  • We start with getting any lock set which has 1, 9 or nil
  • We then retrieve the documents which have these lock sets [ref:a]
  • For each of these documents, we retrieve their lock sets again (is_locked_with) [ref:b]
  • We then filter these lock sets to check if they contain any of 1, 9 or nil locks.
  • There is where it gets interesting: All the branches from each of the docs must return at least one result for the parent to be selected in the result set.
  • So, each of the lock sets coming out of the documents must return at least one result.
  • Thus, we’ll only pick the documents, ALL of whose lock sets contain at least one of 1, 9 or nil.

I think this should give you what you want.

Just my 2 cents follow (feel free to ignore the following):

Having said that, I feel such things are better taken care of by custom plugins. The main role of a database as I see is to retrieve the smallest set of data that must be used to tackle your query. Any sort of advanced filtering or manipulation should be done outside of the core.

For e.g., a query like this:

{
  locks(anyof(has-lock, "1 9 _nil_")) {
    ~is_locked_with {
      is_locked_with
    }
  }
}

Should give you all the lock sets, their documents, and their corresponding locks sets. You can then easily then write advanced filtering on this in Go language (assuming it’s a DB plugin); or in any language of your choosing (asssuming it’s done at client). I think the overall data set that you’ll get from this query is pretty small; so the client is best suited to do filtering here.

Yeah, that’s key here. Thanks for your thoughts. I will go ahead and integrate DGraph without locking as an alternate read-model for now and then upgrade once variable support is completed.

As for plugins I would quite agree that a DB filter plugin (in-process) would definitely fill that role. That is why we used user-defined functions for classical SQL and LUA in Redis and Groovy for Elasticsearch. In more modern SQL databases array set operations are directly possible though. In testing we found out though that the document sets returned might not be that small relative to the size of the machine. But the performance was more than sufficient nonetheless in all cases for now.

Is writing DB plugins already documented ? I am unsure as to the current state of documentation since I found a few ‘dead’ links (like to unrelated Gru which isn’t open source anymore) and could not find DB plugins yet.

PS: Thanks for all the time discussing my use case. I really appreciate it.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.