Running total (cumulative sum) by date


(TW) #1

I’m still trying to wrap my head around Dgraph. Given the following schema:

<asset>: default .
<a.acquired>: datetime @index(day) .

<user>: default .
<u.name>: string .
<u.asset>: uid @count .

How do I get running total daily?

{
  user(func: has(user)) {
    stuff_over_time: ???
  }
}

(Michel Conrado) #2

Hi wiradikusuma,

Your question is quite unclear. Do you wanna know the sum of time between a chain of nodes. Or a sum of some count and order by date?

Can you share a structured example? Have you tried something? Please also share a kind of desired result. Only the Schema don’t help much.

Cheers.


(TW) #3

Given this sample data:

{
  set {
    _:u <user> "" .
    _:u <u.name> "Warren Buffett" .
    _:u <u.asset> _:Duracell .
    _:u <u.asset> _:Geico .
    _:u <u.asset> _:NetJets .
    _:u <u.asset> _:Brooks .
    _:u <u.asset> _:DairyQueen .
    
    _:Duracell <asset> "" .
    _:Duracell <a.acquired> "2018-01-28" .

    _:Geico <asset> "" .
    _:Geico <a.acquired> "2018-01-28" .
    
    _:NetJets <asset> "" .
    _:NetJets <a.acquired> "2018-02-14" .
    
    _:Brooks <asset> "" .
    _:Brooks <a.acquired> "2018-03-01" .
   
    _:DairyQueen <asset> "" .
    _:DairyQueen <a.acquired> "2018-06-04" .
  }
}

I would like to get something like:

  {
    "data": {
      "user": [
        {
          "stuff_over_time": [
            {
              "date": "2018-01-28",
              "count": "2",
            },
            {
              "date": "2018-02-14",
              "count": "3", // 2 + 1
            },
            {
              "date": "2018-03-01",
              "count": "4", // 2 + 1 + 1
            },
            {
              "date": "2018-06-04",
              "count": "5", // 2 + 1 + 1 + 1
            }
          ]
        }
      ]
    }
  }

Probably similar to https://stackoverflow.com/questions/664700/calculate-a-running-total-in-mysql (SQL).

Unfortunately, all my attempts are not runnable, as I still haven’t fully grasped the concept.


(Michel Conrado) #4

Get it! thanks for the example it made me have the whole idea fastly. Unfortunately I think that’s not possible. It is possible by GroupBy but you would need to create a ““kind of edge calendar based structure”” to work.

And about the accumulation that you would have to do via application (your app). For it is not possible to do cumulative aggregation with time parameters. You can do it, but each parameter would have to be done manually.

In fact you can even do something with GroupBy.

{
  q(func: has(a.acquired)) @groupby(a.acquired) {
      Total : count(uid)
    }
}

Result

{
  "data": {
    "q": [
      {
            "@groupby": [
              {
                "a.acquired": "2018-02-14T00:00:00Z",
                "Total": 1
              },
              {
                "a.acquired": "2018-03-01T00:00:00Z",
                "Total": 1
              },
              {
                "a.acquired": "2018-06-04T00:00:00Z",
                "Total": 1
              },
              {
                "a.acquired": "2018-01-28T00:00:00Z",
                "Total": 2
              }
            ]
      }
    ]
  }

But I do not think it’s ideal. Well you can not do much with this grouping for other blocks. It may be useful for the final application, but not for applying other GraphQL +- logic.


(TW) #5

Thanks for your reply!

If it’s not possible, do you have any suggestion (on the schema/query/both) for showing running total for “most recent period” (e.g. last 3 months)?

“Do it in the app” doesn’t sound scalable–imagine loading 5 years worth of data (to accumulate the count) only displaying the last 3 months.

Also, since we’re in this topic, how do I change the group granularity? E.g. instead of @groupby(a.acquired), I want @groupby(a.acquired.month) (i.e. group by the month).


(Michel Conrado) #6

That’s easy. Just use gt and lt. eg:

@filter(gt(date, “1980-01-01”) and lt(date, “1999-01-01”))

I mean the other things like “cumulative sum”. This should be done in application level in this specific case of yours. I believe you outsource several large scale jobs to a DB is a bad strategy. Because you need the DB to be available. Imagine thousands of people doing this same workload for DB. You need to orchestrate the loads and not let DB do all the work IMHO.

That’s not possible, GroupBy only supports grouping by edge relationships. This means that you need to create a structure previously designed for your case if you wanna use GroupBy. As I mentioned previously, you need a structure of dates. And your objects will relate to this structure. And then you will use this structure to group as you need.

I’m going to work on some examples today to explain you better.


(TW) #7

I guess the most performant way is to precalculate on update then? SQL-speak, there should be “tables” that keep running total for each granularity, e.g. total_in_month:

User Month Count Total
John 2019-01 3 3
John 2019-02 2 5 (3+2)
Mary 2019-01 2 2

Will there be any (performance) advantage of putting this kind of data in Dgraph as opposed to SQL?

Regarding GroupBy for date: Thanks for clearing it up!

All the while I thought @index(day) means “to index down to day granularity”, which means, I can query/group by year and month as well such as @filter(gt(date, “1980”)).


(Michel Conrado) #8

I didn’t get it. In Dgraph we do caculations(computing values or edges). You can use @count indexing too if you wish. I do not believe adding values manually is safe. But perhaps I do not really understand your question.

As I said, GroupBy was created keeping relationships by Edges in mind. That is, you group objects that have the same relation in common. What I did up there with “@groupby (a.acquired)” is almost like a “hack”. This is not supported, I’m taking advantage of an ability of this function to do an arbitrary grouping.

So any other function or feature that you are familiar with GraphQL + - does not work in GroupBy (Unless using @filter). In the case of my “hacky way”, if you have two nodes “1980-01-02” and “1980-05-12”. These two nodes will not be grouped as 1980, but by their exact dates.

To work the way you need, GroupBy needs to be improved a lot. Not only accepting values ​​as grouping but also extra functions. As determining a grouping logic not only “group by exactly the same”, but “group by this expression”.

I’ve opened a discussion about this internally, but feel free to open a request.


(Michel Conrado) #9

As promised here is a basic example. The structure itself is pretty basic indeed. Each date will be a node (uid). That way you can use GroupBy perfectly.

With the next version of Dgraph approaching, the new upsert transaction will make it easier for you to use this structure. Because you can check if the date already exists, if it does not exist it creates a new one and at the same time attach both objects.

BTW, the use of this structure does not prevent you from using datetime indexing. They are two different things. The structure will serve for you to “visually” analyze your stats and help with groupby().

You do not need an established structure at first, because you just need dates to be nodes. But I find it convenient to have a structure. And you can use the same logic to other things.

Mutation in JSON and Schema

date: datetime .
acquired_when: [uid] @reverse .
name: String @index(exact) .

[
  {
    "name": "Warren Buffett",
    "user": "",
    "u.asset": [
      {
        "asset": "Duracell",
        "acquired_when": [
          {
            "uid": "_:2018-01-28"
          }
        ]
      },
      {
        "asset": "Geico",
        "acquired_when": [
          {
            "uid": "_:2018-01-28"
          }
        ]
      },
      {
        "asset": "NetJets",
        "acquired_when": [
          {
            "uid": "_:2018-02-14"
          }
        ]
      },
      {
        "asset": "Brooks",
        "acquired_when": [
          {
            "uid": "_:2018-03-01"
          }
        ]
      },
      {
        "asset": "DairyQueen",
        "acquired_when": [
          {
            "uid": "_:2018-06-04"
          }
        ]
      }
    ]
  },
  {
    "year": "2018",
    "month": [
      {
        "month_name": "jan",
        "day": [
          {
            "date": "2018-01-28",
            "uid": "_:2018-01-28"
          }
        ]
      },
      {
        "month_name": "feb",
        "day": [
          {
            "date": "2018-02-14",
            "uid": "_:2018-02-14"
          }
        ]
      },
      {
        "month_name": "mar",
        "day": [
          {
            "date": "2018-03-01",
            "uid": "_:2018-03-01"
          }
        ]
      }
    ]
  }
]

Queries

{
  var(func: has(acquired_when)) @groupby(acquired_when) {
      A as  count(uid)
    }
    q(func: uid(A), orderdesc: val(A)){
      uid
      date
      total : val(A)
      acquired_in_this_date : ~acquired_when{
        uid
        asset
      }
    }
}

Result:

{
  "data": {
    "q": [
      {
        "uid": "0x6",
        "date": "2018-01-28",
        "total": 2,
        "acquired_in_this_date": [
          {
            "uid": "0x1",
            "asset": "Duracell"
          },
          {
            "uid": "0x9",
            "asset": "Geico"
          }
        ]
      },
      {
        "uid": "0x2",
        "date": "2018-03-01",
        "total": 1,
        "acquired_in_this_date": [
          {
            "uid": "0x7",
            "asset": "Brooks"
          }
        ]
      },
      {
        "uid": "0x3",
        "total": 1,
        "acquired_in_this_date": [
          {
            "uid": "0x8",
            "asset": "DairyQueen"
          }
        ]
      },
      {
        "uid": "0xa",
        "date": "2018-02-14",
        "total": 1,
        "acquired_in_this_date": [
          {
            "uid": "0xd",
            "asset": "NetJets"
          }
        ]
      }
    ]
  }

Query to view your structure

{
  q(func: has(year)) {
      uid
      month {
        uid
        month_name
        day {
          date
          uid
        }
      }
    }
}

(TW) #10

Hi Michael,

Thanks for your example. When I saw that, wow that’s clever :smiley:

But the result is exactly like you first answer—it doesn’t accumulate the total. You said that it “should be done in application level”, so if I want to know the latest total, I need to retrieve everything, iterate the result, and accumulate the total, right?

That’s why I said, “I guess the most performant way is to precalculate on update”, so I can just (using the table in my previous post as reference): SELECT Total FROM total_in_month WHERE User = 'John' ORDER BY Month DESC LIMIT 1 — and I asked whether there’s any performance advantage if I save this “table” in Dgraph instead of SQL.

What do you think?


(Michel Conrado) #11

It does, in “total : val(A)”. But not “by cumulative period”.

No, you can manually filter just to latest date and do the counting.

q(func: uid(A), orderdesc: val(A)) @filter(eq(date, "2018-01-28"))

Or maybe this:

{
  var(func: eq(name, "Warren Buffett")) {
    u.asset @groupby(acquired_when){
      A as  count(uid)
    }
  }
  var(func: uid(A), orderdesc: val(A)){
    d as date 
    #### get dates
  }
  var() {
    vMax as max(val(d)) 
    #### do some aggregation. This case I wanna the "latest".
  }
    q(func: uid(A), orderdesc: val(A)) @filter(eq(date, val(vMax))){
      uid
      date
      total : val(A)
    }
}

{
  "data": {
    "q": [
      {
        "uid": "0x2",
        "date": "2018-03-01T00:00:00Z",
        "total": 1
      }
    ]
  }

acquired_when: [uid] @reverse .
name: String @index(exact) .
date: datetime @index(day) .

About what you said in terms of perf and SQL. I have to say that personally I have little exp with SQL world. So I know the basics, I got some models(in mind), but not at the point of comparing two different paradigms directly.

Also personally I feel more lost and stressed with SQL world than Graphs when it come to complexity and when gets bigger and bigger. So I’ve dive “all in” to Graphs technologies.

Technically speaking Dgraph can be used the way you need it and completely replaces a SQL DB. With all Dgraph’s advantages. So you can do whatever you want. You can invent your way of modeling data.


(TW) #12

I apologize for the confusion. Let me rephrase my post:

But the result is exactly like you first answer—it doesn’t accumulate the total by cumulative period. You said that it “should be done in application level”, so if I want to know the latest total by cumulative period, I need to retrieve everything, iterate the result, and accumulate the total, right?

Which means, if I want to efficiently get latest total by cumulative period, then it should be precalculated on update (my previous post with the table picture). What do you think?


(Michel Conrado) #13

Yep, but I don’t know what is better to do so. Precalculate and add it manually seems okay, but you’ll need to deal with it anyway.


(TW) #14

I’m fine with that, I just want to know what’s (not) possible. Thank you!


(yyyguy) #15

Thanks for this very useful dialog back and forth.

For a Dgraph newbie, this clarified a few things that I wanted answers on.

Cheers