Standard Deviation

I have been attempting to figure out how to calculate standard deviation but am having some issues. This example is calculating the std dev for the length of all tracks. The use case may not make much sense, but I am just experimenting with how to accomplish this.

Here is what I have:

{
  var(func: type(Track)){
    len as length    
    c as count(length)  
  }
 
  var(){         
    mean as avg(val(len))   
    count as sum(val(c))   
    # x as math(pow(len - 387, 2)) => correct result; 
    # mean is not making it here so it subtracts 0
    x as math(pow(len - mean, 2))
  }
      
  q() {
    count: sum(val(count))
    mean: sum(val(mean))
    sum: sum as sum(val(x))          
    variance: v as math(sum/count)
    stdev: math(sqrt(v))
  }               
}

Returns:

"data": {
    "q": [
      {
        "count": 1064618
      },
      {
        "mean": 387.436584
      },
      {
        "sum": 1426996742338.472
      },
      {
        "variance": 1340383.820618
      },
      {
        "stdev": 1157.749464
      }
    ]

In which the stdev is wrong. The correct answer is 1090 and as noted in the comment in the query if I replace:

 x as math(pow(len - mean, 2))

with:

 x as math(pow(len - 387, 2))

I get the right result. So for some reason the value of mean, which as shown above is 387 not making it to this calculation. I have tried a few things including calculating mean in itā€™s own method but that returns the same results.

Thoughts?

I gave this a try and you are right it doesnā€™t work as expected. The problem here is that it doesnā€™t use mean in the math block as you found out. I think we should be able to use the result of an aggregation like sum, avg, min, max etc in other math blocks in the same or as part of the same block. If you could open an issue for this on GitHub, we can look into supporting this. There might be other ways of doing this, tagging @MichelDiz if he has some ideas here.

1 Like

Created a GitHub ticket: Standard deviation does not work as expected Ā· Issue #5339 Ā· dgraph-io/dgraph Ā· GitHub

Since I didnā€™t find any dataset samples, I decided to create one based on this wikihow. How to Calculate Standard Deviation: 12 Steps (with Pictures) See the dataset at the end of this comment.

Side comment

If you have time, read this comment Bug with multi-level (nested bocks) in Upsert Block Ā· Issue #4779 Ā· dgraph-io/dgraph Ā· GitHub it is the same issue, but in a different situation.

Also, Custom Block - Discussion would help and maybe Feature: Add foreach() function. Ā· Issue #5335 Ā· dgraph-io/dgraph Ā· GitHub. Cuz I think it would be much harder to fix this changing the design of the query system.

My query

{
  var(func: eq(dgraph.type, "Track")){
    a as count(uid)
    c as count(length)
  }
  f(){
    n: n as sum(val(a))
    sum : S as sum(val(c))
    mean : mean as math(S / n)
    #result : math(c - mean) #ignore this
    x : math(pow(n - mean, 2))
    variance: v as math(S/n) #is it right? this is the same as mean
    stdev: math(sqrt(v))
  }
}

Results

{
  "data": {
    "f": [
      {
        "n": 6
      },
      {
        "sum": 48
      },
      {
        "mean": 8
      },
      {
        "x": 4
      },
      {
        "variance": 8
      },
      {
        "stdev": 2.828427
      }
    ]
  }
}

About the Results

I managed to find N and Sum to find the mean. That is, the results ā€œnā€, ā€œsumā€ and ā€œmeanā€ are correct. However, I found problems to continue. BTW, ā€œxā€, ā€œvarianceā€ and ā€œstdevā€ are wrong.

To continue this calculation, I would need to apply the ā€œmeanā€ against the variable ā€œaā€ (which is ā€˜nā€™) in a subtraction for each value (as you can see on the wikihow - in part 2 ā€œFinding the Variance In Your Sampleā€ ). But it is not possible to do this in an empty block. It was not designed for this. So the idea would be to create a new block with repeating the var block query. And apply the subtraction.

e.g: (this wonā€™t work)

{
  T as var(func: eq(dgraph.type, "Track")){
    a as count(uid)
    c as count(length)
  }
  f(){
    n: n as sum(val(a))
    sum : S as sum(val(c))
    mean : mean as math(S / n)
  }
  q(func: uid(T)){ #That was the idea
    c2 as count(length)
    test : math(c2 - mean) #but mean var is empty.
  }
}

But there is another problem. The empty/aggregation block will not pass the value of the variables according to the ā€œmapā€ of previous queries. That is, this is not possible to do.

e.g:

{
  T as var(func: eq(dgraph.type, "Track")){
    a as count(uid)
    c as count(length)
  }
  f(){
    n: n as sum(val(a))
    sum : S as sum(val(c))
    mean : math(S / n)
  }
  q(func: uid(T)){
    c2 as count(length)
    test : math(c2 - 8) # "8" Thats the value that should come from f() block
  }
}

This example bellow is the desirable result

{
  T as var(func: eq(dgraph.type, "Track")){
    a as count(uid)
    c as count(length)
  }
  f(){
    n: n as sum(val(a))
    sum : S as sum(val(c))
    mean : math(S / n)
  }
  q(func: uid(T)){
    c2 as count(length)
    fv : fv as  math(c2 - 8) # "8" Thats the value that should come from f() block
    SquareAll : SA as  math(pow(fv, 2))
  }
  FINAL(){
    SquareAll_sum : FA as sum(val(SA))
    VARIANCE : VA as math(FA / (n - 1))
    Standard_Deviation : math(sqrt(VA))
  }
}

Desireble Result

{
  "data": {
    "f": [
      {
        "n": 6
      },
      {
        "sum": 48
      },
      {
        "mean": 8
      }
    ],
    "q": [
      {
        "count(length)": 8,
        "fv": 0,
        "SquareAll": 0
      },
      {
        "count(length)": 10,
        "fv": 2,
        "SquareAll": 4
      },
      {
        "count(length)": 8,
        "fv": 0,
        "SquareAll": 0
      },
      {
        "count(length)": 4,
        "fv": -4,
        "SquareAll": 16
      },
      {
        "count(length)": 8,
        "fv": 0,
        "SquareAll": 0
      },
      {
        "count(length)": 10,
        "fv": 2,
        "SquareAll": 4
      }
    ],
    "FINAL": [
      {
        "SquareAll_sum": 24
      },
      {
        "VARIANCE": 4.8
      },
      {
        "Standard_Deviation": 2.19089
      }
    ]
  }
}

As you can see, the end result is perfect if we get the value right.

Dataset

{
   "set": [
      {
         "dgraph.type": "Track",
         "length": [
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            }
         ]
      },
      {
         "dgraph.type": "Track",
         "length": [
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            }
         ]
      },
      {
         "dgraph.type": "Track",
         "length": [
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            }
         ]
      },
      {
         "dgraph.type": "Track",
         "length": [
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            }
         ]
      },
      {
         "dgraph.type": "Track",
         "length": [
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            }
         ]
      },
      {
         "dgraph.type": "Track",
         "length": [
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            },
            {
               "dgraph.type": "any"
            }
         ]
      }
   ]
}