Indexes and Transactions

A very common use case is building multi-tenant applications, which segregate data on a per site/team/user basis. The problem is that indexes are always global (and there is now a greater performance penalty for using them globally due to transactions).

For example, let’s say I’m building a Slack clone and I want each team’s messages to be separated (i.e. you will only ever see messages from your team). There is only one thread, so I build a very simple schema as follows:

team_name: string @index(term) .
team_messages: uid .
message_value: string .
user_teams: uid @reverse .

Let’s say I have the userId (normally I have this assuming they are logged in), so to get a user’s sites I can simply run the following query:

root (func: uid(<userId>)) {
	user_teams {
		team_name
		team_messages {
			message_value
		}
	}
}

That’s all good, but I also want to be able to search for a message. I will only ever do this on a per site basis, but I only have the option to add a global index. So I need to add:

message_value: string  @index(fulltext) .

Now, for every single message added across the entire platform, the global index is updated (even though I will never use it at the global level) and transactions will fail if the message_value is updated with the same keys. Given that there could be a lot of messages being added across the entire platform, this could cause a real bottleneck.

A possible solution

If we could specify in the schema to only apply an index when it is a child of the team_messages predicate (as we will only every query on a fulltext filter /from/ team_messages), that might help Dgraph to manage performance.

	team_messages > message_value: string  @index(fulltext) .

And then when running the query I can only use it as a filter on team_messages (but that’s the only place I need it):

root (func: uid(<userId>)) {
	user_team {
		team_name
		team_messages @filter(anyoftext(predicate, "nice message")) {
			message_value
		}
	}
}

Effectively, you would be creating a separate index each time team_messages predicate is used.

3 Likes

To add a slight variation on your potential solution… Instead of using a user defined predicate as the index namespace (in this case, team_messages), would it work to have a standard concept of “Namespace” in dgraph that would subidivide indexes? So your index definition would then be:

Namespace: team1 > message_value: string @index(fulltext) .

I am also trying to work out how to use dgraph for multi-tenant applications, and need to find a foolproof security solution that provides as many reasonable guarantees as possible that tenants will never be able to access each others’ data. I wonder if this namespace concept could also be means of subdividing the graph for access control per Possible timeline of implementing ACL's in Dgraph and Adding security to dgraph - #2 by mrjn.

Multi-tenancy is something we are planning to build in the enterprise version of Dgraph. That would help solve some of these unique challenges when you want to ensure that each client operates independently of each other, but on a similar dataset.

I think my initial post was somewhat misleading in mentioning multi-tenant applications, it was merely one use case that is impacted (and not necessarily a call to add that functionality).

This is a performance issue that will impact almost all applications running on Dgraph at non-trivial scale. I’m already seeing a reasonably high number of aborts for a relatively low number transactions.

In effect, we can no longer run an index against a frequently updated predicate, because it would produce an unacceptable number of aborts.

For example, imagine implementing a Twitter clone. You would obviously want to add an index on the hashtag (and possibly a full text on the tweet too), but that would lead to a high number of key clashes/aborts.

Another possible way around this is to make index updates run in the background (as an option), so that although the node data will be correct if queried, the ordering/filtering may lag slightly behind?

I think the idea of calummoore would be something like making the logical structure of the Dgraph look like “Git” - also cz, we have transactions. The idea of branches for some nodes would be interesting. That would isolate virtually the thing/“Node Branch” - The relation would be reverse address/UID from node_branch to node_branch.

So when someone searches/filter for some “name” predicate on a node that is <0x2TM> branch. Dgraph would not waste time researching globally. It first locates the Branch through the address/UID and then performs a search inside it. That way the search would look like a “binary search”.

In practice Dgraph would be the first graph database with branches of Graphs. That would positively impact performance. And make better the way we write business logic would make it easier.

Small comment:

PS. By doing so Draph could sell a version of “Dgraph Private Cloud” that enables the creation of cluster-specific branch and even that these are unique “CSB” still communicate with the default cluster -To get relations and information from root. This cluster-to-branch feature would be interesting for anyone who wants to use, for example, more than 32GB of memory and larger processing on a very consuming branch. Balancing and scaling more your service. And/Or less resources in other branches. I know this may sound worthless “Dgraph Private Cloud” because Dgraph already promises performance. But I guarantee you would sell, it does not matter if it does not make sense. It would sell.

Continuing.

user: string . @index(exact) . #this edge is in root

team: Branch. > team_name: string @index(term) .
team: Branch. > user_team: root. user > uid @reverse . 
team: Branch. > team_owner: root. user > uid @reverse . 
team_messages: Branch. > message_value: string  @index(fulltext) . 
team_messages: Branch. > message_owner: root. user > uid @reverse . 

#uid @reverse to branches don't need to be explicit, but it need to be reverse.
team = <0x44T> #In this example UID would be used as an address for a branch.
team_messages = <0x2TM> #address for team_msgs branch.
<0x0><team_branch><0x44T> #reverse team with Root Branch - In this case you did not 
need to, because all branches addresses will be listed by default in root.
<0x44T><team_messages><0x2TM> #reverse team with team_messages

<0x234f> <name> "Johnny B. Goode" . 
#For reference effect this mutation above would be done in the master/Root.
<0x2TM><_:newMsgUID> <message_value> "Human nice message" .
<0x2TM><_:newMsgUID> <message_owner> <0x234f> (Bot=false).
me : Root (func: uid(0x89f)) { 
   #master_branch
   name
   age
   friends { 
     name
     }
   Node_Branch_A (func: x(y)) {
        expand(_All_)
     }
   team (func: x(y)) {
        expand(_All_)
      user_team {
           expand(_All_) #a list of users
             }
      team_messages (func: x(y)) @filter (x(y)) AND (...) {
        expand(_All_)
             }
         }
}

PS. “Nodes Branchs” could branch with unique IDs for each node below the node branch.

Very good point. We could add an @async flag for the indices in the schema, which would allow them to be updated once the txn has committed, and won’t make the index updates part of the txn itself. That would decrease the txn conflicts – but at the cost of being not able to use upsert like operation on that index.

@janardhan would see if this can be done.

1 Like

Providing an option to run index updates in background is a good idea. Can you please file an issue for the same.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.