Creating a constantly updated Read Replica with dgraph

gregberns · May 2, 2020, 3:45pm

I heard about dgraph from the Changelog podcast a couple years ago and it came to mind now that we’re looking at DBs. We’d like to create a read replica off of our legacy database which you could think of as a document database. I’ve read through the Mutations article and the JSON importer looks useful. I think I need the “External IDs” but the examples aren’t making a lot of sense to me.

I’m struggling to understand how we model and update a row of data, especially since we dont know the internal (dgraph) ID of each row. We’re going to start POC’s next week comparing Mongo, Postgres, and dgraph and any help you could provide to figure out how to do both the initial load and “streaming updates” would be helpful.

Here’s a super simple example to start with. The first data set has two tables with a primary key and then theres a join between the two tables. The second data set has updates and an insert. I think my main question is:

How can I format the JSON so that when the second data set is added, Account 1 and 2 get updated, and Account 3 gets added?

Initial data set

Account CSV file #1 (Initial insert)
ID,Balance,Transactions
1,100,[1,3]
2,250,[2,4]

Transactions CSV file #1 (Initial insert)
ID,AccountId,Amount
1,1,75
2,2,150
3,1,75
4,2,150

Second Dataset

Account CSV file #2 (Update)
ID,Balance,Transactions
1,100,[1,3,5]
2,250,[2,4,6]
3,50,[7]

Transactions CSV file #2 (Update)
ID,AccountId,Amount
5,1,-50
6,2,-50
7,3,50

gregberns · May 3, 2020, 10:15pm

Spent a good chunk of time digging yesterday and got an upsert working for part of the problem. Since most of the time this thing runs its going to be updating (not initializing), so that should be good enough.

Here’s what I came up with to up date the Account node and properties tied directly to it.

upsert {
  query {
    var(func: eq(<account.Id>, "1")) {
      Account as uid
    }
  }
  mutation {
      set {
        uid(Account) <account.Id> "1" .
        uid(Account) <account.Balance> "150" .
        uid(Account) <dgraph.type> "Account" .
      }
  }
}

But! I can’t figure out how to correctly model the edges between Accounts and Transactions. Two problems I see:

All of the updates are going to be coming row by row, so when an account is created, it may have 20 transaction ID’s but those transactions may not have been imported yet. I assume an edge and transaction node can be created when the account is created, but then it will just have incomplete data???
When the upsert for an account is done and it has lots of outgoing edges do all of those edges need to be looked up and verified every time? Meaning if there are 20 transaction egdes to an account, does the upsert query need 20 queries for all the edges and then have 20 edge mutations? I guess the queries would occur in parallel so it could be fast, just seems like as the edges grow the upsert time increases.

I did a very basic test of the upsert defined above using NodeJS and the upsert time was around 40ms. I assume I’ll be able to be throwing 10’s of upserts in parallel at the system without any problems.

Any help or comments would be appreciated.

ppp225 · May 4, 2020, 12:45am

Hi,

I would model your data like this: Account <-- Transaction

Schema:

type Account {
  account.Id
  account.Balance
}

type Transaction {
  transaction.Id
  transaction.Amount
  transaction.Account
}

transaction.Account: uid @reverse .

then I wouldn’t need to care about transaction upserts from the account perspective, as it would be solved through the revese edge.

Upserting Account just like in your post above.
Upserting Transaction with 2 query blocks, one for the Account of the transaction, the other one for the Transaction itself:

upsert {
  query {
    Account as var(func: eq(<account.Id>, "1"))
    Transaction as var(func: eq(<transaction.Id>, "1"))
  }
  mutation {
      set {
        uid(Transaction) <transaction.Id> "1" .
        uid(Transaction) <transaction.Amount> "75" .
        uid(Transaction) <dgraph.type> "Transaction" .
        uid(Transaction) <transaction.Account> uid(Account) .
	# if an Account can not exist, when adding the transaction, add:
	uid(Account) <account.Id> "1" .
        uid(Account) <dgraph.type> "Account" .
      }
  }
}

So, answering questions:

You don’t have to add data bi-directional in a graph. It will work correctly with @reverse directive. Just 2 query blocks for each transactions are ok, as shown above.
To query Account transactions:

  query {
    Account as var(func: eq(<account.Id>, "1")) {
      account.Id
      account.Balance
      transactions: ~transaction.Account {
           transaction.Id
           transaction.Amount
      }
  }

You can just process transactions row by row with this schema.

In summary, you will not need the “Transactions” column from the Account CSV at all.

Hope that helps!

gregberns · May 4, 2020, 2:18am

Thanks so much for the detailed response, that was incredibly helpful. The legacy database had the transactions in the account cause that’s how they wanted to improve query performance but its not at all necessary and definitely makes loading more challenging.

Thanks a ton!!

system · June 3, 2020, 2:18am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Updating Data Dgraph Clients untagged , dgraph-js	7	465	July 11, 2020
Anyone have example of an upsert using python client and js object format? Dgraph Cloud kind:question	23	1749	February 11, 2021
How to update a large amount of data in dgraph every day Dgraph mutation	23	3627	August 10, 2020
Complex Bulk Upsert Howto Dgraph graphql	3	1099	January 5, 2021
Documentation feedback Dgraph	4	634	July 7, 2018

Creating a constantly updated Read Replica with dgraph

Related topics