Counting minutes watched use case?

Hey there,

Btw I attended SG Gophercon & I was impressed by Manish’s talk: https://engineers.sg/video/dgraph-a-distributed-graph-database-written-in-go-gophercon-sg-2017--1756

Lately I have been turning my attention to help our ailing data pipeline at work. Tbh I am a “data science” newbie & usually I store things in a flat file, given the choice.

The particular thing that bothers me at work is how we grab data from a couple of different sources when a “progress” event comes in (use case: count minutes watched) and then dump it to S3. It results in very large & duplicated files because as I understand it, we are not doing joins at the reporting stage (elastic search) because they are too expensive or tools just can’t do them.

Currently the pipeline looks like POST to API end point. API endpoints joins user and video details to object. Dumps JSON to firehose which loads the data in elastic search & also dumps to S3 / redshift. Actually in reality it’s a bit more complicated because we use dynamodb to buffer the flow (like firehose) and do that joins at this stage via a lambda function (you can’t query firehose iiuc, so we have to use dynamodb) instead of stressing the API too much.

So after looking at Dgraph again, and its features of fast joins, perhaps it’s a solution to my problem? You tell me. :laughing:

I think there are lots of potential obstacles I fear in using any new system. For example our video / user table entries can change. E.g. video may get a different pricing one day to the next. user’s balance changes. So we need snapshots, so maybe there is no escaping the way we do it currently.
Last but not least, I don’t we there is an easy way to put all the data in a dgraph data store. We have several tables and it feel its rather a complex migration if it had to happen in one go.

Ultimately we prioritise queries like: minutes watched per:

  • video.title
  • video.owner
  • user.device
  • user.country

And the other way around, like what videos the user watched and such and so forth.

Anyway, your thoughts would be appreciated! Kind regards from Joo Chiat,

Hey @hendry,

Thanks for the explanation. I get a rough sense for what you’re trying to do; but would be good to start with the sort of data you have and the sort of queries you want to run. Judging how it would fit into your pipeline would be hard for me to say.

What videos the user watched – sure that’s something Dgraph can easily do. It can also aggregate and tell you how many minutes the user watched these videos, etc. The pricings could change, and it should be alright, because Dgraph can take write heavy workloads easily.

You could start with duplicating this data into Dgraph, and moving your queries over to Dgraph. Once all the queries are moved, you can turn off the older tables. That’s one way to do it.

Thanks for the reply!

The pricings could change, and it should be alright, because Dgraph can take write heavy workloads easily.

How does this work? Dgraph would efficiently save the object referencing a linked object at a particular timed snapshot?

You could start with duplicating this data into Dgraph, and moving your queries over to Dgraph.

Where do I start to try load this data from mysql? Google queries for ‘dgraph mysql import’ came up empty!

Ideally dgraph would connect to a read replica & listen for changes in the video/user tables to stay accurate whilst ingesting a stream of progress events.

I can imagine doing our reporting pipeline using dgraph to some extent, I’m taking a leap of faith that it get data out to some reporting tool like Tableau for the business peeps. However the Rails backend I don’t imagine switching away ever. :grimacing:

Thanks again!

If you can convert these changes to mutations, Dgraph would execute them.

That’s something we’ll have to build. But, it really shouldn’t be too hard. Convert each column to predicate, row id to a blank node, and the values to values.

Hmm… interesting. I think this is specific to your use case, it’s something you’ll have to build.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.