Btw I attended SG Gophercon & I was impressed by Manish’s talk: https://engineers.sg/video/dgraph-a-distributed-graph-database-written-in-go-gophercon-sg-2017--1756
Lately I have been turning my attention to help our ailing data pipeline at work. Tbh I am a “data science” newbie & usually I store things in a flat file, given the choice.
The particular thing that bothers me at work is how we grab data from a couple of different sources when a “progress” event comes in (use case: count minutes watched) and then dump it to S3. It results in very large & duplicated files because as I understand it, we are not doing joins at the reporting stage (elastic search) because they are too expensive or tools just can’t do them.
Currently the pipeline looks like POST to API end point. API endpoints joins user and video details to object. Dumps JSON to firehose which loads the data in elastic search & also dumps to S3 / redshift. Actually in reality it’s a bit more complicated because we use dynamodb to buffer the flow (like firehose) and do that joins at this stage via a lambda function (you can’t query firehose iiuc, so we have to use dynamodb) instead of stressing the API too much.
So after looking at Dgraph again, and its features of fast joins, perhaps it’s a solution to my problem? You tell me.
I think there are lots of potential obstacles I fear in using any new system. For example our video / user table entries can change. E.g. video may get a different pricing one day to the next. user’s balance changes. So we need snapshots, so maybe there is no escaping the way we do it currently.
Last but not least, I don’t we there is an easy way to put all the data in a dgraph data store. We have several tables and it feel its rather a complex migration if it had to happen in one go.
Ultimately we prioritise queries like: minutes watched per:
And the other way around, like what videos the user watched and such and so forth.
Anyway, your thoughts would be appreciated! Kind regards from Joo Chiat,