Dgraph, Microservices, Time series, Scalability and CI/CD

Hi. Have been reading about Dgraph for the last few days. Had a lot of questions in relation to it. Asking them all in this thread. Sorry for making it huge.

  1. In the microservices world, the recommendation is to have 1 db per service to enable scaling. Should we do that for dgraph as well cause then it will create isolated graphs without any connection in between different instances.

  2. Let’s say we have isolated instances of dgraph, is it possible to refer to the nodes running on another dgraph instance? While we may be violating the microservices principles doing this, I was wondering how to have a unified view of the data graph of all the services together but still allow the flexibility of multiple teams to handle their own database pipelines.

  3. Apollo Federation (https://www.apollographql.com/docs/apollo-server/federation/introduction/) handles GraphQL queries and distributes them to multiple graphql endpoints (microservices) depending on where they are running providing support for federated graphql. Is something similar possible in dgraph where I have multiple dgraph instances running each with their own graph and I do a query to graph using GraphQL± and you go back and run the individual GraphQL queries in the relevant GraphDB, aggregate and give back the response? This would enable push for graph database with microservices support.

  4. Articles like these:
    https://tdwi.org/articles/2017/03/14/good-bad-and-hype-about-graph-databases-for-mdm.aspx

suggest that Graph databases are not as performant as relational dbs when it comes to bulk queries or things like that and has cons as well. If I have to see what the cons of DGrapgh are, what should I look at before using it in production?

  1. I have different kinds of data that I use for my startup. And Time series is one of them. And as I see from this: Time Series in DGraph

I see that DGraph is not optimized for Time Series. Any suggestions on how we can go about time series data when using DGraph? Should I go for external time series stores? Or any suggestions you might have?

  1. Pricing for microservices pattern

Let’s say that we use one DGraph instance per service (not sure about this yet) and so this will lead to a lot of DGraph instances and as I see the pricing for EE is per instance and might shoot up even if all the services are running in the same node in the K8 cluster. So, is there any thoughts you have about this?

  1. One other question which I had mentioned about in an another thread is about using DGraph with data localization constraints taken into account. If considering laws in China, Russia and other places, you were to have data graph with dgraph to have them stored and processed in separate clusters in different regions but still do a query or mutation to them from one place (abstracting the complexity from the clients), is there some reference architecture about this from DGraph? I see that if you use tools like Vitess (https://vitess.io/) you can geo-shard for MySQL. How can I do it with DGraph?

  2. One other question I had was about CI/CD pipelines. When you talk about databases, there are things to consider like handling database migrations/schema changes, rollback, roll forward and so on. For instance, for relational databases, if you are using something like Prisma, you have Prisma migrate (https://www.prisma.io/docs/reference/tools-and-interfaces/prisma-migrate) which you can use to version control and rollback/rollforward the changes in your schema and apply them in the pipelines as well. How can we do this in DGraph?

  3. If there is one thing I am always against, that’s vendor lockin. Though DGraph supports spec compliant GraphQL as I see, still, its the only database supporting GraphQL± with underlying graph engine being Badger. So, how would you view this? How to ensure that a person using DGraph is not locked in to the ecosystem but rather embraces it and also is given the choice to move to something else later if needed?

  4. If you notice GraphQL, there are projects like dataloader (https://github.com/graphql/dataloader) to enable query batching when you have multiple queries fired per call and reduce the round trips to the database. Does DGraph have something inbuilt to solve this problem if you are doing multiple graph queries or should I still use dataloader to handle this?

While I did search for all of this, I did not find answer to most of them in them in the docs.

Thanks in advance.

2 Likes

I would not personally recommend it. Dgraph is already built to shard and scale horizontally instead of vertically. I would recommend keeping connected data, well… connected as much as possible.

Yes possible with the graphql endpoint and the @custom directive. You could even query data off from other non graph endpoints. Just FYI though, that data would not be expandable beyond the first retrieval so you have to get any of the nested data in a single request.

see above.

This is really hard because it is based a lot on use case. For me, the biggest challenge is going to be doing client side custom filtering of deeply nested data at multiple levels. As for generally speaking, dgraph has outperformed our previous running MySQL 8.0.17 on RDS for even the simple data lookups on a single table. If you are comparing relational dbs to graph though and even care about the n+1 problem, then graphs will win out every single time!

The query language will probably be most peoples biggest challenge, it involves a different way of thinking than with traditional relational dbs. With SQL like queries you do SELECT [fields] FROM [tables [w/joins]] WHERE [filters] HAVING [special case filters] [pagination] this is totally different with graph databases of any kind. You literally select the fields you want and apply the filters at every level down through the graph. So making the jump for new comers will most likely be easier that don’t have any relational db experience then those making the jump. However if you take it for what it is, you literally ask for the data you want in the shape that you want it. Designing your schema is of the utmost importance to get it right early in the game.

Not my expertise here, sorry. I will say it works for basic datetime tracking and plays really well with UI providing the datetime in the correct format without needing to convert it to a javascript time object like I did with MySQL. I use times for events, tasks, and activity tracking, though there is no currently native support for createdAt and updatedAt fields which has to currently be resolved on the UI at this time.

Sure, do it on Slash and pay-as-you-go.

umm, idk sorry. All I know is that Dgraph does automatic sharding but not sure about any geo-sharding… guess someone up the chain can give more info here.

I know this is tagged Dgraph, but I eventually foresee most users being mainly on the GraphQL endpoint so that is where I am answering this from, and I know that dgraph schema is vaguely the same under the hood.

Changing schema does not change data. However it can change access to that data. I am doing schema control in my github repo with my main UI. This will handle any kind of schema rollbacks or change control needed. Any data that needs to change along the way with a schema change will need to be done with custom scripts. Roll back on data changes may not be possible sometimes. If you delete data it is gone and a db rollback is not available unless you restore from a backup as far as I know.

Well, seeing that Dgraph made Graphql± and created Badger from scratch for their own pony show, they probably view it pretty highly. having this said, if any graph database is worth their salt, they would be able to import a well constructed graphql data with full schema. I foresee the future of graphql being the main graph query language. And not just a query language but a full functional db language. Dgraph is leaps and bounds beyond anybody else here and they have a lot of catching up to do. How can Dgraph use graphql as a query language, because they built their whole database structure around it.

Search for liveloader and bulkloader in the Dgraph docs. EDIT: sorry, I misread this as imports instead of queries. You are talking about something different here.

2 Likes
  1. This will depend on your architecture. However, with Dgraph this is not necessary. It was made to scale horizontally and vertically.
  1. Not natively. But you can use upsert block and create your own pattern with your own IDs manually. Your application will make this communication. Surely you should already know what ways to do this. (This is about using Dgraph, but you could do microservices with from Dgraph’s GraphQL. But this is a longer story)

2.1. That’s not possible. Ratel UI is unable to communicate with custom instances. If that’s what you’re suggesting.

  1. Yes, Apollo Federation is a possibility. But all instances of Dgraph would be “virtually separated”. You will not be able to make queries in GraphQL + -. GraphQL only in the context of the Federation. Perhaps, in the future, the Ratel UI can support GraphQL. Hence its context of microservices could be viewed.
  1. I read the article, it is not clear which DBs he used to make the comparisons. Taking some comments related to Neo4j. Dgraph overcame these limitations 2 years ago. The author probably didn’t even know Dgraph at the time.

There are some limitations related to Bulk queries, but they are overcome by removing the ACID when using ludicrous mode. In this mode with good machine setup, you can write 200k N-quads/s. That is, the reading will certainly be monstrous (I never had to test reading in bulk). Considering the writing (200k N-quads/s), in theory, 1 minute you can extract 12 million lines of information. This speaking of a single machine (High-end one) with two nodes (a Zero and an Alpha).

  1. Yes, my opinion on this is limited to what I wrote in that post. If nobody tests and knows how to test, we will never know for sure.
  1. Ping @dereksfoster99 for this question.
  1. No, we have nothing consolidated about this. But there are old discussions around this. It is necessary to understand well what the needs are. From the amount of people asking about it, it seems to me that it is something of little interest. Talk to Derek about this.
  1. I’m not sure, but EE features have something like that. Perhaps @martinmr could help.
  1. Dgraph has some options for you to export your data. In RDF or JSON. Most DBs out there support at least JSON insertion. I don’t know what other approaches could be taken. But I believe that JSON is already an interesting standard for this.
  1. Yes, Dgraph has internal mechanisms that avoid the n + 1 problem. GraphQL runs natively and maintains the same pattern as GraphQL+- in the case of executing queries and mutations. So, don’t worry about it. However, in your case of Microservices, this will be different. All of your services will be treated as independent in their context. But in relation to Dgraph itself, there is no such problem.
1 Like

Thanks a lot @amaster507 and @MichelDiz for your detailed answers. Read through it completely. Makes a lot of sense and that really helps.

After going through your answers and some thinking, I am thinking of going with this approach.

  1. So, I will go with a single Dgraph DB even for my microservices architecture
  2. Will see if I can use Elasticsearch for Timeseries data storage and probably link it to Dgraph

Ofcourse, the questions which I would still need a bit more clarity on is regarding

  1. Data localization (Geo-Sharding)
  2. Database migrations, Rollbacks and Rollforwards

Ofcourse I am concerned on how it will scale on a long run since it will end up being a huge monolithic database (maybe I can look for partitioning in the future if that happens? Still thinking.)

Maybe this could help Add a Bulk Move Tablet and/or A deterministic scheme for Tablets - This feature is similar, maybe append the logic about “geo”.

Do you mean migration to another DB or Dgraph to Dgraph? if the second option, the backup should do the trick.

I think we really don’t have this feature even in EE.

@MichelDiz Thanks for your reply again. By migrations I was talking about generating the schema level changes for every change that is applied to the db locally

For eg:

  1. Developer works on schema changes, applies it to Dgraph in his local instance
  2. Makes multiple changes (let’s say 5 changes for this case)
  3. Then he sends a pull request with all the changes (which includes his migrations as well every migration which he applied to the db in steps)
  4. Now, the change is applied in the same sequence as the developer did locally by the CI/CD pipeline using the migrations available
  5. This way, you always have a consistent behavior when applying all state changes to the database.
  6. In addition to this, you can use the migrations to roll back or roll forward the db to any state in schema.

These are the strategies which Liquibase (https://www.liquibase.org/), Prisma Migrate and other providers use.

Hope this makes sense.

Not natively possible. But you can do custom approaches for that. e.g: Migrating (renaming predicates, etc). You can create scripts for the 5 changes.

You can mark the predicates as deprecated, versioned, etc, manually. You can use either the rename process with upsert block or use facets instead. And all your future queries must take into account facets (which would work as a kind of “metadata”).

e.g using facets with List Type:

username: [string] .

{
  set {
    _:Node0 <username> "Lucas"  (pred_version="1.1") .
    _:Node0 <username> "Lucas Lira"  (pred_version="1.5") .
  }
}

Query and Result

{
  q(func: has(username)){
    username @facets(pred_version)
  }
}
{
  "data": {
    "q": [
      {
        "username|pred_version": {
          "0": "1.5",
          "1": "1.1"
        },
        "username": [
          "Lucas Lira",
          "Lucas"
        ]
      }
    ]
  }
}
{
  q(func: has(username)){
    username @facets(eq(pred_version,"1.1"))
  }
}
{
  "data": {
    "q": [
      {
        "username": [
          "Lucas"
        ]
      }
    ]
  },
}
{
  q(func: has(username)){
    username @facets(eq(pred_version,"1.5"))
  }
}
{
  "data": {
    "q": [
      {
        "username": [
          "Lucas Lira"
        ]
      }
    ]
  }
}

Deleting a specify value from the List Type

{
  delete {
    < 0xd > <username> "Lucas" . #This will also delete the facet.
  }
}

@MichelDiz Thanks again.

While this does make a lot of sense, it just seems like a lot of manual work to do and you might want to look at this over a long run. Since, when there is a good adoption for Dgraph within large organizations, they would want to version control every migration, roll back/forward and do all of these automatically with scripts. In this case, I would have to make quite a few manual changes for this to work.

PS: I am going all in with Dgraph for my startup now. Fingers crossed. Hope I don’t get stuck anywhere along the way. Will share my experience once I am done. Launching next month.

2 Likes

I got it. Feel free to ask this feature. I think there is none about this on the issues back-log. And due to the complexity, I think it would be a EE feature. But not sure. Maybe some integration with a third-party tool would be fine.