Call for Collaboration: Designing a Dgraph offline-first library

gotjoshua · August 7, 2020, 4:28pm

Greetings,
I want to invite, ideas, wishes, dreams, and suggestions for using dgraph offline(-first) in web applications, PWAs and mobile apps.

I am especially curious for perspectives from the dgraph core team about how to effectively:

represent the dgraph data structure in indexedDB
queue mutations
some inspiration:
- Apollo Offline
- Offix
deal with mutation history ( somewhat of a side issue: see this other post for details )
Are there plans and/or early prototypes of an offline first lib?

For potential consumers / users of an dgraph-offline-first package:

What are required “must-haves” ?
What are “nice-to-haves”?
What would be simply amazing if you dared to dream?

abhijit-kar · August 7, 2020, 5:08pm

Coincidentally, I was thinking about how I would tackle this problem today.

I came to the conclusion that:

At best, a lite version can be made and keeping up with Dgraph team might be tough.
Whole point of Dgraph is for GraphQL to be a first class citizen and not a layer on top of existing DBs, Hence IndexedDB should be out of question! (Except for persistance, maybe RDF or other serialisation can be used!)
Dgraph is way faster than Java, which means JavaScript can’t even scratch that itch for speed. Hence WASM is the only way!

Since Golang can be compiled to WASM, that’s how a Dgraph Lite should be made.

Nice to have, must haves and dreams:

Local development, without separately starting a Docker instance.
Offline sync kind of thing will be cool. i.e. It will be awesome if the user can just do whatever they want in a nice optimistic UI fashion and the Dgraph lite will sync in background asynchronously.
Mock GraphQL API.
State manager, although Apollographql does it as well, people not using GraphQL may benefit from a WASM based state manager!
It absolutely must have support for RxJS - Observables!

P.S. At what point Dgraph stops being Dgraph needs to be considered, just like Ship of Theseus paradox!

gotjoshua · August 8, 2020, 8:38am

I like the way you are thinking, dgraph is indeed blazing fast and WASM is a fascinating tech to consider ( and it appears to at least have access to LocalStorage ).

However, If I want to create a Vue app or a React App (or a quasar PWA or, or…), then I am anyway stuck with Javascript…
If I am anyway stuck with javascript and I want to store offline data then I am more or less stuck with IndexedDB.

By the way, dgraph stores its data in a well indexed key-value store, so I think we better not discard IndexedDB categorically ; -D

The issue of speed:

Depending on your internet connection it may be possible to get a parsed JSON object faster from dgraph, than from IndexedDB
If you are on a mountain top without any network, it will definitely not be faster to wait until you hike down to access dgraph than to use a javascript based offline-first data store

I still want to entertain the idea of “Dgraph lite” as you call it:
I was thinking of what i would call a “pseudo dgraph alpha” somehow in the browser, with a local client side storage of some sort. I wonder if your idea of compiling go to WASM might actually be able to create a full fledged alpha for each client. Now I have no idea if this is at all desirable, but it sounds very interesting to entertain.

abhijit-kar · August 8, 2020, 11:03am

We can invoke WASM APIs from JS. (Refer: Calls between JS & WASM are fast!, Loading and Running WASM, & Loading WASM module efficiently).

A great example of this is, Squoosh, an in browser Image Converter & Compressor. It downloads the WASM file when you upload an image!

Yup, my bad.

Agreed.

gotjoshua · August 8, 2020, 11:42am

I want to also answer my own questions for must-have, nice-to-haves and dreams:

Must haves:

First Page Load: Get most recent data from client-side cache to the screen ASAP and sync new data from server to the screen without any UI blocking
Offline Mutations: If user looses internet connection, all mutations are preserved and synced to dgraph ASAP

Nice to haves:

Transparent API - It would be great if the API was basically just like the dgraph-js-http client, with some magic behind the scenes so that dgraph queries will return data as quickly as possible and as up to date as possible considering the state of your offline cache and internet connection
Pre-caching / pre- syncing - a way to register queries that are likely to be needed soon and / or “subscription” queries - and a web worker will be sure to fill the offline-cache and inform the UI of updates (again zero UI blocking is high priority here)

Dreaming:
In the direction that i mentioned in my previous reply to @abhijit-kar… It could be really interesting to imagine a sort of client-alpha instance that handled syncing in the same way as dgraph (badger-like key value store and RAFT based ACID etc etc) If all of that could work in a service worker then it could provide a very interesting twist on distributed data. A system could emerge where peer-to-peer and server would just play nicely due to the brilliance of dgraph’s architechture.

gotjoshua · August 8, 2020, 11:49am

Thanks for the links ( I am starting to understand WASM a tiny bit better ).
Still, I am uncertain about what role WASM should play in a more standard web-app or PWA scenario. The image processing is an obviously computation intensive operation that WASM can boost performance. But if we are just talking about caching offline data and feeding it to React or Vue, then I am not sure if WASM can help (especially as we need native JS data structs for rendering)… It could be that the negotiation between WASM and JS would offset any speed gains.

WebAssembly is designed to be a complement to, not replacement of, JavaScript.
from the WASM FAQ

abhijit-kar · August 8, 2020, 12:21pm

This is an interesting take on things.

Apart from speed existing code can be cross compiled, rather than creating from scratch.

If there’s no compute intensive stuff happening, then yes, there’s no advantage.

Yup.

amaster507 · August 8, 2020, 1:51pm

just my thoughts… I don’t know if a perfect or even mostly good offline syncing dgraph would ever be usable in my situation. It would have to basically be a copy of the complete data set to be good because with selectable fields and filters on the user side the queries and depth of data wanted will never be consistent. Even if I get most of the data but leave out a few snippets here and there because I think they are not needed there will be that one user who wants to filter with that data that would not be there. On top of that we have user defined sorting and pagination. This would almost require a complete offline version of dgraph on the client side that synced when online. The main two problems to this is breaking the graphql rule of only transporting the data to the client that is needed and wanted to keep network traffic down, and also would require every client to have a super computer (32Gb+ of Ram dedicated to this offline db) This is before even discussing how to keep an offline db in perfect sync.

What would be nice to offer to my advanced clients is a in house synced db. Maybe like a alpha node on a managed server allowing advanced businesses/orgs to have full access to their data locally in case of a internet outage. This node would need to be a full copy of the db and have a way to sync back to push/pull changes if it goes offline and comes back later.

abhijit-kar · August 8, 2020, 2:38pm

gotjoshua · August 8, 2020, 3:15pm

I think this should be pretty reasonable… https://dgraph.io/docs/master/deploy/cluster-setup/

I also started to think about this dilemma but I don’t have monster datasets that are needed in their entirety in mind… so I figure that offline can cache all data ever accessed (but not all in existence) and require a connection to make new queries that need more data. The offline mutation possibility is actually more interesting for me.

Yes subscriptions / observables are for sure important…

amaster507 · February 15, 2021, 11:22pm

@pbassham, @maaft, @uncle_juniper, @marcown, @verneleem Let’s pick up this topic here to figure out how to better do a better offline first Dgraph client.

Some related posts:

Current and Future State of Windows Support

How does this help with offline-first applications? They can run offline for several weeks without seeing any internet. A cache doesn’t help here. The app will be closed and restarted during that period. We need persistence.
The client uses the same graphql queries supported by dgraph. How do you think will we be able to use pagination, filtering and all the other dgraph stuff on non-graph databases? We decided for dgraph because its “schema-first” and easy to use. Our complete code is build around the fact that the dgraph schema is the single source of truth. Using any other database than dgraph would require:

to implement a translation layer dgraph-style-graphql-queries ↔ SQL (incl. pagination, filtering, @cascade etc pp) this is a complete nightmare because it basically means to re-implement dgraph from scratch
or

use two different react apps that use different APIs a complete nightmare because it will mean double the frontend-dev work and lead to unmaintainable code

Sorry, this is nowhere close to a solution. I’d be honestly happy if it was though.

Idea Phase

What we want to come up with eventually is a thin, embeddable, syncable client. From what we have so far:

The thin client should have a GraphQL endpoint
The thin client should have a local data store of some kind
The thin client should be able to two way sync data when online to the main Dgraph database whether by DQL or GraphQL.
This will very likely depend upon Query server timestamps or some other means of synce last sync meta data.
This thin client should be lite and embeddable. (With lowest memory consumption as possible)
This thin client should answer queries to best of ability when offline and fill in gaps with local cache when online.
This thin client needs to respect GraphQL authentication rules.
TBD: will users be authenticated based upon local machines local credentials with a generated JWT, or will users need to authenticate against an online service that responds with a JWT. This discussion will determine if users can authenticate when offline, or if can only use last online authenticated credentials for as long as they last. Maybe allow users to authenticate when online with a long lived JWT. What happens when user logs out, or a different user logs in from a different client on the same thin client.
In the sync process, it may be possible that the same node was updated by the client when offline and during that same offline time also by the Dgraph online db. The sync process needs to account for this. Storing mutations that a user performed while offline and then pushing those mutations in a sync when a user connects may be a possible solution, but not perfect, as data may have changed on the source of truth and the mutations no longer have the same effect, also the JWT may no longer be valid when reconnected if multiple users are using the same thin-client.

And that is the tip of the iceberg in my opinion.

Research Questions:

Are there any thin graph dbms like SQLite?
What about embeddable key-value stores or NoSQL?
Any references from the web with anyone that has made a GraphQL API client that syncs when online?

Possible Usable Tech

IndexDB
pouchdb
lowdb
SQLite
Apollo Local State Management (local resolver deprecated in Apollo Client 3.0)

Client Language

WASM using Go
I am a Javascript developer myself, but this is open to suggestions.

Challenges:

Schema Updates between SSOT and client
Supporting the same queries and mutations supported on SSOT.
Example: If when online the client requests queryContact(first: 100)... then when offline the client is qeuried for getContact(id: "0x1")..., This also applies to querying the thin client with filters and arguments not used when online. Apollo State Management cannot handle this very well, and any other local db will require a completely customized GraphQL layer in this thin client.
overcoming N+1 without a embeddable graph database. Is this even an issue with a local dbms since there are no network trips.

Really Advanced Stuff

It would be neat if after all of the above is figured out if local LAN clients could communicate with each other to fill in the data from other local thin clients when working together in an offline mode.
Example: I was working on my Desktop and internet went out, and I am now on my laptop and my Desktop is on the same LAN, I make a query for data that is not locally, but can be obtained on the local network.

Back to work now on my current tasks at hand that I can do without needing an offline client at the moment.

MichelDiz · February 15, 2021, 11:35pm

At its core, Apollo Client is a state management library that happens to use GraphQL to interact with a remote server. Naturally, some application state doesn’t require a remote server because it’s entirely local. Source: Managing local state - Apollo GraphQL Docs

I still believe that Apollo is the best solution. And it works in JS frameworks like React-Native and React.

There are some decentralized apps. Called “DApps” that there is some experience with this. For example, IPFS is a decentralized application that uses Badger.

I remember about a DApp a few years ago that uses some principles from git and used a pure JSON DB. And they synced in a P2P manner. I don’t remember what solution they applied for that(as for example proof of work and blockchain are the best for decentralized registers). But they did. I didn’t follow them all, but for example, IPFS is growing again.

maaft · February 16, 2021, 10:28am

Hi @amaster507,

thank you once again for taking the lead here and pushing forward to get stuff done!
I’ll just add some requirements, challenges and ideas which you could add to your post if you like.

First, I’d like to clarify what use-cases we are talking about:

browser-only pure javascript library
database-lite service (e.g. implemented in golang)

I don’t think that the former case is really that important. If you really want offline support, you can always pack your app into electron and ship and start any needed services on demand (including any database services). If it is only about short internet outtages, I think @MichelDiz is correct in pointing to apollo state management. But this is not what we need here. We’re talking about possible weeks of downtime.

Also, as you said, WASM might still be an option. Therefore, I’ll concentrate on the latter case in this post.

I’d like to propose the obvious name “Dgraph GraphQLite” (or short DGQLite) instead of “thin client”. Also, by doing this, I want to emphazise that building a complete new client might be too much work and we could just build a lite version of dgraph, which already is capable of most requirements we want. Such a service could also be started by any app

In general, I see DGQLite very similar to dgraph itself except:

clustering etc.
high performance throughput (keeping RAM usage low)
keep it simple (to minimize cross-plattform efforts)
anything else that’s not needed offline-first (the target is again to keep RAM usage low)

Requirements

Full Cross-Plattform Support
Initialize DGQLite by posting a GQL schema (like with dgraph)
GQL endpoint serving the generated API (like with dgraph)
Synchronization with dgraph-alpha is a bonus (it can already be implemented with custom business logic - I will go into detail on this later)

Client
I’d propose to use the same apollo client everyone is using currently and outsource the “heavy-lifting” to DGQLite .

Apollo talks to DGQLite on localhost
DGQLite knows if the remote endpoint (e.g. Slash GraphQL) is reachable and forwards the request in that case
If the remote server is not reachable, DGQLite will handle the request itself and writes data to the local badger DB.

This has the advantage, that from a client perspective you can use exactly the same queries you are using today.

Challenges
By using a lite-version of dgraph, most of the challenges you mentioned (@amaster507) are already solved:

It would naturally support the same queries and mutations as dgraph alpha
not sure about N+1, but I guess that this is not an issue currently with dgraph?
schema updates: DGQLite could just introspect the schema of the main server when going online and do the same migration (i.e. schema posting to itself) that dgraph is doing currently.

Synchronization

It would be nice to leverage dgraphs cluster and replication mechanisms here. There, a similar problem needs to be solved, no?

The scope of the synchronization has to be limited of course - and I see @auth-rules to be a perfect fit here.

Than being said, I’d propose that for an initial DGQLite -MVP we shouldn’t pay too much attention to synchronization. We can add this functionality later on.

Anyway, here’s how we do synchronization currently using two dgraph instances and it works great:

do the synchronization on the application-layer
synchronize every type independently, starting with leaf-nodes and working the way up the dependency-tree
custom id field on every type that must be set when using the addFoo(...) mutation (all types implement interface ID { id: String! @id}
createdAt and updatedAt fields on every type
a list of deleted IDs with a deletedAt timestamp
syncedAt timestamps for every user for every type

S := DGQLite (in our case dgraph instance 1)
D := Slash GraphQL (in our case dgraph instance 2)

S; “Hey D, do you have any new, changed or deleted Foos since syncedAt ?”
→ S fetches all foos with queryFoo(filter:{syncedAt: {gte: time}}) and adds/updates/deletes them locally
S: “Hey D, here are all new, changed and deleted Foos since syncedAt !”
→ S builds addFoo(input: [ ... ]), updateFoo(...) and deleteID(...) mutations and send them to D
update syncedAt timestamps on both sides

When conflicts arise (object with same ID is added/updated/deleted), currently, the object with the newest updatedAt or deletedAt timestamp wins.

The big advantage here is that the input- and query-types are the same for the generated GQL-API when using dgraph. Therefore, the objects obtained by queryFoo can directly be used in the corresponding addFoo and updateFoo mutations.

This makes writing synchronization logic very easy. And because we already use @auth rules on all types, we don’t need to care about syncing data that we don’t own as dgraph takes care of this.

tl;dr:

Don’t implement a complete new client - use a downstripped dgraph instance instead, get most of the features for free and save time on development.

Further Steps

Maybe we can also get a bit involvement from the @graphql team here?

Possible next steps:

get an understanding about what features can be downstripped from dgraph
write “official” requirements for DGQLite MVP
define milestones for DGQLite MVP
???
let’s get this done!

MichelDiz · February 16, 2021, 2:32pm

That’s funny. Several times when Manish doing a presentation (we have this on video, and here I think there are some very old discussions about it), someone asks for a Dgraph SQLite like. He gets confused because his proposal with Dgraph is totally different.

But even if we had a DgraphLite, you should build a way to sync. Or if this being built in Dgraph, for sure, due to the complexity would be an Enterprise feature. Too much logic to guarantee sync between all clients. And make the data contextual based on the user. And prioritize data, cuz the user don’t need a bunch of things. That would be considered a type o “Over-fetching”.

That’s hard. Android, iOS, all BDS. We need more investors to do so

It isn’t at all.

Well, if the community is open to creating a DQLite is welcome. DQL, GraphQL, and everything else can be copied/forked(as they are opensource) if anyone is interested in doing this. This looks like reinventing the wheel for me. Only the idea of having a “portable graphql server” is a new idea to me.

maaft · February 16, 2021, 3:58pm

I’m with you on this. The complexity is most likely to hard. And if it was supported within dgraph, it would nicely fit as an enterprise feature.

“Full” was maybe a bit too exagerated. Linux, macOS and Windows should be enough for most cases.

Nevertheless I think we some support from dgraph would be very helpful here. Just to get us started.

amaster507 · February 16, 2021, 4:13pm

You want to fork the dgraph github repo somewhere so we can collaborate on these better? Fork and branch and start a new readme I think would be best. Leave master branch of the fork clean so that we can pull from dgraph master and and work into the lite version of new features.

marcown · February 20, 2021, 10:06am

@amaster507 @maaft

we should do a fork and work on it.

@MichelDiz @mrjn

can we hope for any technical support? What feature can be turned off in the original code and are only there for high performance throughput? How can we reduce the ram usage?

MichelDiz · February 20, 2021, 1:26pm

Not sure, I think not.

The EE features are protected by license.

In my view, starting from scratch would be a thousand times better. I would take only the concepts of DQL, of writing data at the predicate level, the RDF parser. If you just fork it, you will have a lot more work to make it work on many platforms. There is a lot of unnecessary code for the proposal to be “SQLite-like” (embeddable).

Start simple and you can go far.

amaster507 · February 21, 2021, 10:02pm

@marcown, we need to collaborate and first come to am agreement with the requirements and build it in phases.

Proof of concept with very limited features
Essential features for most use cases
Features for edge cases the lite db will support.

I think there is so much in dgraoh that we wouldn’t use that it might be easier to pick what we want to use and see how they do it with dgraph and mimic a handful of features.

Has anybody started a repo or project board somewhere we can collaborate on this? Discuss is not the right format IMO. We need a project board where we can comment on granular features and move ideas around

I have created a Github repo to kick things off. If you can’t access it, ping me here or in a PM and I will make sure you get added: Idea Phase · GitHub

maaft · February 22, 2021, 6:02pm

Thanks for creating the repo!

I think I don’t have access currently (I tried to add requirements). My github handle is the same as here (@maaft)

Topic		Replies	Views
What is Dgraph lacking? Dgraph	89	12065	July 22, 2022
Current and Future State of Windows Support Dgraph kind:question	20	3006	February 24, 2021
Product Roadmap 2020 Dgraph dgraph , roadmap	43	5456	May 20, 2021
Who’s new in June 2020 Users	33	3807	June 27, 2020
Santa wish list for Dgraph in 2021 Dev	25	3045	July 20, 2021