Groupby on multiple fields using DQL

What I want to do

I want to groupby nodes using multiple fields. Let’s consider the following schema as an example

enum StatusType {
  NOT_REGISTERED
  REGISTERED
}
type User {
  id: ID!
  mobilePhoneNumber: String! @id @search(by: [exact, regexp])
  userStatusHistory: [UserStatus] @hasInverse(field: user)
}
type UserStatus {
  id: ID!
  user: User @hasInverse(field: userStatusHistory)
  type: StatusType! @search
  time: DateTime! @search
}

I want to get unique user count for each user status on each day. Essentially, a groupby clause on (User.id, UserStatus.type, UserStatus.time)

Date     User    Status    Count
2022-09-13  0x1  UNREGISTERED    10
2022-09-13  0x1  REGISTERED    20
2022-09-14  0x2  UNREGISTERED    03
2022-09-14  0x1  REGISTERED    30

What I did

I’m querying all the nodes for UserStatus and applying groupby aggregations on application level.

Dgraph metadata

dgraph version
[
  {
    "instance": "alpha",
    "address": "localhost:7080",
    "status": "healthy",
    "group": "1",
    "version": "v21.03.0",
    "uptime": 31252,
    "lastEcho": 1663099970,
    "ongoing": [
      "opRollup"
    ],
    "max_assigned": 1190020
  }
]

What you have tried?

groupby is great in edges, for value edges you gonna have a bad time to use in GraphQL side. And the response will need some cleaning.

query queryUserStatusAggregation($eventFilter: string, $startTime: string, $endTime: string) {	
            filteredStatuses as var(func: type(UserStatus))
                @cascade(UserStatus.event)
                @filter(between(UserStatus.time, $startTime, $endTime))
            {
                UserStatus.event @filter(uid($eventFilter)) {
                    uid
                }
            }

            queryUserUserStatusAggregation(func: uid(filteredStatuses))
                @groupby(key: UserStatus.type)
            {
                count(uid)
            }
        }

Response

{
  "data": {
    "queryUserStatusAggregation": [
      {
        "@groupby": [
          {
            "key": "UNREGISTERED",
            "count": 2
          },
          {
            "key": "REGISTERED",
            "count": 3
          }
        ]
      }
    ]
  }
}

The above query works fine but I want to groupby multiple fields

@groupby(key: [UserStatus.type, UserStatus.time, User.id])

Also, I’m curious about

  1. How can I extract the Date from the DateTime field just like we do in SQL using Extract function and then use it inside the groupby
  2. How can we convert the timezone at runtime while querying the data.

As far I remember, you can add several predicates separately by comma.

But you can’t do this. A single Alias for all predicates. That’s not possible.

You can’t. That would be a new feature request.

Not possible, again that would be a new feature request.

There are several requests for datetime. For now we have only since. And indexing.

1 Like

Thanks for the quick response.

So, multiple predicates does work as follow.

query queryUserStatusAggregation($eventFilter: string, $startTime: string, $endTime: string) {	
            filteredStatuses as var(func: type(UserStatus))
                @cascade(UserStatus.event)
                @filter(between(UserStatus.time, $startTime, $endTime))
            {
                UserStatus.event @filter(uid($eventFilter)) {
                    uid
                }
            }

            queryUserStatusAggregation(func: uid(filteredStatuses))
                @groupby(UserStatus.type,UserStatus.time, UserStatus.user)
            {
                count(uid)
            }
        }

Response

{
  "data": {
    "queryUserStatusAggregation": [
      {
        "@groupby": [
          {
            "UserStatus.type": "UNREGISTERED",
            "UserStatus.time": "2021-05-30T11:02:05Z",
            "UserStatus.user": "0x684ac",
            "count": 1
          },
          {
            "UserStatus.type": "UNREGISTERED",
            "UserStatus.time": "2022-09-11T13:50:00Z",
            "UserStatus.user": "0x684a8",
            "count": 1
          },
          {
            "UserStatus.type": "REGISTERED",
            "UserStatus.time": "2022-09-11T13:50:58Z",
            "UserStatus.user": "0x684b6",
            "count": 1
          }
       ]
    }
}

But it doesn’t address the final results that I want which requires the conversion of timezone at runtime.

I assume the only option for now would be to pre-calculate the groupby key during mutation.

In the above scenario it would be to add an extra date field which is timezone aware and gets populated during the mutation. I’m not sure if that would be scalable enough as this approach will require a new predicate per timezone we support in our application and a data migration whenever we add a new timezone support.

Do you think it’s feasible approach or we can have a better approach to solve this problem i.e groupby using timezone aware date?

Thanks!

I think you gonna need multiple blocks. Cuz I think that multiple groupby is an “AND” logic not a “OR” or something.

If we had loops like foreach, this would create a better result.

There’s no such thing in Dgraph. You have to deal that logic in your application level.

I think you gonna need multiple blocks. Cuz I think that multiple groupby is an “AND” logic not a “OR” or something.

How will multiple blocks help?
Actually, I do need the AND logic. It’s totally fine.

If we had loops like foreach, this would create a better result.

Can we expect this feature in the near future?

Thanks!

There’s no such thing in Dgraph. You have to deal that logic in your application level.

Just a quick question on the performance of the dgraph when we fetch nodes in our application to deal with a particular logic that’s not supported yet.

Let’s say we have 10k nodes for UserStatus type. We have to fetch them all to apply the aggregation at application level. How will it impact the overall performance of dgraph if the number of nodes grow to 100k. Would it still be feasible to fetch them at once?

OR

Would it be recommended to fetch the 100k nodes in batch of 10k. Which will require 10 queries before we do the aggregation.

Thanks!

And that is the deal breaker for most any serious use case of Dgraph that needs analytics of the graph. Traversing is great and easy, but once you need to analyze the data, you get stuck in the mud and regret using Dgraph.

Again I ask, if you have to do everything in the client, then why have a database on a server. Hopeful for v23+

This is a topic that I talked a lot with Daniel a few years ago. When we saw many requests that in fact are naturally part of the client layer (business logic) and not the database.

It’s nice to have different features in DB. But the real duty of a database is to store things And fight to keep fast for query data. Since in the application we only have ephemeral states and sessions. “cookies, Redux, state, Mobx”. DBs were created to store long-term information and not for data mining. For that we have specific tools. For example Python, PowerBI or OpenRefine and others.

I think it’s cool to have fancy features, but don’t expect a DB to be responsible for everything. Otherwise, nobody needs to invent anything, since DB does everything. Right?

This part of data analysis, usually someone uses some lib in python, R or something similar. But I agree on improoving DQL and I will fight for the DQL to have cool features. That make people addicted to DQL and never want anything else :stuck_out_tongue:

PS. If you want to do many things at once. You will inevitably be bad at all of them.

@Asim_Khan Are you able to create issues for the feature requests? https://github.com/dgraph-io/dgraph/issues/new/choose try to find any duplicate if you don’t find anything like each of the things you mentioned. You can open an issue with examples and etc.

If you don’t want to do it, let me do it. I’m going to take some time next week to review the things that u said I’ll build the feature requests myself.

@MichelDiz Sure, I will open up an issue maybe on the weekend.

Can you please comment on this part?

Let’s say we have 10k nodes for UserStatus type. We have to fetch them all to apply the aggregation at application level. How will it impact the overall performance of dgraph if the number of nodes grow to 100k. Would it still be feasible to fetch them at once?
OR
Would it be recommended to fetch the 100k nodes in batch of 10k. Which will require 10 queries before we do the aggregation.

I don’t want to rely on the application logic assuming it’s fine querying large number of nodes where I’m not sure about the limit. So, I’m just confirming some metrics to foresee bad situations as the data grows for that query.

2 Likes

Question(Big O notation related), who is your client that needs 100k nodes, each node with N predicates. Each predicate with N number of characters? I think no DB can handle requests like 100k per second 24 hours per day, 7 days per week and no cool down. Such query must be eventually. Not always. Right? Obviously I’m exaggerating to expand our interpretation of things a bit. I know you’re not at that level of resource consumption. But in part your question implies resources usage.

If it’s a big query done eventually. Dgraph handles it perfectly. If it’s a series of concurrent queries(In this case, 10k or 100k nodes divided into many requests over the hours), Dgraph handles them perfectly. And it can go further if you have a well-planned Cluster along with well-planned clients and application.

But 10k nodes per request and per client is very complicated. If you have 1 million customers making a single request per second. Certainly their average consumption would be between 1% to 10%. That is, between 10 thousand nodes sec and 100 thousand nodes. In this context, you will already have a successful project(with a lot of money) and your cluster will certainly be very robust and Dgraph will support so many requests.

Application logic gives you freedom. To not rely on the DB only to grow. And change fast when you need to. Also, let’s say you are bored with Dgraph. What’s your plan to move out? as you are completely dependent on us? This change would be painful for you. Never depend in a single thing. Always have plan B, C, D, E… for everything in life. I love Dgraph, but I would be a fool if I say to myself that I can only rely in a single thing instead of create a mix of things and build my own things. I see no self growing thinking like this. IMHO.

By now I think I’m being just obvious, right? I think everyone thinks like that. Right?

1 Like

Thanks for the detailed explanation and the context that you build overall. I agree to most of your points.

Mostly aggregate queries are required for reporting stuff so that query might not be required almost all the time. The point I was making is that SQL DB support aggregate queries and it can execute those aggregates on 100k dataset pretty easily. If I have to do the same stuff in the Dgraph then because of the limitations of groupby we have at the moment, I have to fetch all those nodes to do the aggregation at application level which means fetching 100k nodes just to calculate a groupby stats for a particular schema.

Asim, be sure that I will advocate for those things in DQL level at Dgraph. I track everything I can and that is of community interests. And I will always be pocking the engineers the CTO and even the CEO about them. However, today it is not possible. And we have to live with what is possible.

I have plans to instigate changes in Groupby and make it work as it does in RDBMs. This feature works differently because the relationships in GraphDBs are different. It was created with the intention of edges and not the way we know it. But it is possible to do something similar or better.

I hope we include loops in the next coding season. And may this result response be more friendly in the future.

1 Like

Totally understandable. I greatly appreciate your support.

Thanks!

1 Like

So Python would definitely be middleware, not client side. I really do feel like you want to move more things on the client side than should be there sometimes.

Real database uses do much more than “store data.” You can do analytics when the core features like math equations are there.

This particular issue would be solved by plugins / adapters one day where people would write their own algorithms (hopefully in something like custom dql with all the neo4j features). This could NOT be done on the client side efficiently and is the reason why Dgraph has a GraphQL layer… to avoid the middleware all together.

Databases need queries, and great queries, to look at the data.

J

1 Like

PS. This is too much offtopic talking. You can skip. Dgraph still doesn’t address what was questioned. But it will in the near future.

When I said “client” I was referring to the end customer. No application will have a customer consuming that much data per second directly from a DB. But that level of data can be surpassed if your user base is massive. But that’s another context, not data dumping.

Database => BS-Logic/middleware/client => API => Application => customer

Understand. Exists:

Backend
0 - Level of database. He takes care of the storage-level writing and his job is to make sure the data is safe. Nothing more than this. Anything else that comes in the package is profit, gladly welcome.
1 - Business Logic(your product should rely on this, not in your DB - IMHO). This is the client(py, js, go, rust) level. The thick part of the job has to be done here. Not in the DB.
1.1 - Some mining data(big data activities). Python mining and others goes here. This can serve at frontend optionally or just dump mined data to use in PowerBI and other types of plotting data to present in research or marketing maybe. Such activities it is impossible for a DB to specialize.

Frontend
2 - State
3 - application logic. (Some synonyms used: “Client, customer, end consumer” - The frontend is reflection of your business logic)

You won’t be doing any mining or heavy activity on frontend. Just minor validations and state managing. Much of the work has to be on the backend because that’s where you know there are more resources. You’re not going to ask a $100 smartphone to do data analysis. Neither your DB and risk it going OOM. You have to balance the Job. Be in control of your ship, captain!

I really don’t understand what you mean.

I really don’t think DBs should be Low-Code/No-Code platforms. But you can build such a platform on top of that. Not fundamentally part of a DB’s core.

It’s not because they’ve added something to a DB that you should really rely your whole bussines on it. Too much abstraction, too much outsourcing can be something very serious in your final product. As I said before, you create too much dependency in a single thing. A man of strategy doesn’t think that way. Think about time. Time is alll you have in life. You will save a lot of time even with a DB doing everything for you. But at what price? in terms of time wasted in case something happens.

I’m all in if you wanna give your soul to Dgraph’s product. I will love having you around forever. :stuck_out_tongue:

Really, maybe we have a really passionate community indeed.

This is something that the CTO has also already agreed to in conversations. Let’s consolidate this idea.

I do not agree. We have GraphQL because the community has been relentlessly asking for it. It wasn’t to solve anything. It was just to have a different language supported in my opinion. Read the GraphQL related issues. I participated in it myself. Today I see that I was partly wrong with what I wanted. At the time I was just a user studying GraphQL and excited about it. I wanted everyone supporting GraphQL. Fool. GraphQL is awesome, but it’s not for everyone.

What engineers have done with GraphQL in Dgraph is really impressive. But nothing can be set on stone. Everything is changeable in life. Every river changes its course over the years. And something that is inflexible begins to show signs of fatigue. Almost screaming for middleware.

Question. What algorithms you can do with GraphQL? if it is not running with your own custom resolver?

Yes. Let’s be an Eagle, not a Duck.

Just to remind, I’m not against GraphQL. I love that piece of technology. I just think it is not as flexible as we’d like it to be. DQL will grow a lot. You can bet.

Cheers!

So let’s agree client side and client are not the same thing here. Middleware is NOT on the client side, as it can’t be for security reasons. As you stated, your CTO wants to add plugins / adapters (actually on the database level), for analytics and custom algos, which will be great. So no point in hashing that out.

All I ask for is the same features as neo4j and postgres. Nothing more. We are 80-90% almost there. My arguments have mainly been about security, then queries. Your CTO wants to add things like unique IDs and other constraints on the Database level up, which I think is amazing. I don’t think we are arguing there either.

So this is only partly true. You bet your butt I have read all of them. The problem here is the fundamental misunderstanding of GraphQL. DQL (even as GraphQL+) was already very similar to GraphQL, but it isn’t GraphQL. People wanted it to be GraphQL because it almost was. With GraphQl comes standardizations that people know, and MOST IMPORTANTLY, security. I think both devs here and users have been ignorant of the usefulness of GraphQL. We both agree it is just not flexible enough to be a query language. In fact, GraphQL is NOT a query language. It is an API. The specs simply do not support the kind of queries we want in DQL. It was created by Facebook to replace multiple REST API points so you code less. They MUST be secure, and in fact, ARE middleware.

So, let’s agree we need to have our GraphQL API middleware layer more secure. Some of the security like constraints and keys can be written on the database level up. In order for us to use the GraphQL API Dgraph already has, it needs to be more flexible as well, or we should get rid of it. It needs better queries and standardized mutations (nested filters, deep mutations, etc). I think we both agree it will never replace DQL, nor do we want it to.

J

2 Likes

Exactly to the point, that’s what I mean when I said I don’t rely on the application logic to accomplish something that might eventually be available in dgraph as they exists in the other DBs.

One other use-case of having native support such queries would be to remove duplication of application logic from multiple services/client-side-apps.

Let’s say I implement the groupby at application level and later I require the same groupby in another service, I will have to duplicate the same stuff whereas If we can support it in the custom DQL. I can add that query in @custom directive and use it in multiple places.

1 Like