Incremental Algorithms

Mean and variance

To: numpy-discussion
From: Robert Kern
Date: Wed, 20 Sep 2006 03:01:18 -0500

Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python:

def mean(a):
     m = a[0]
     for i in range(1, len(a)):
         m += (a[i] - m) / (i + 1)
     return m

def var(a):
     m = a[0]
     t = a.dtype.type(0)
     for i in range(1, len(a)):
         q = a[i] - m
         r = q / (i+1)
         m += r
         t += i * q * r
     t /= len(a)
     return t

Alternatively, from Knuth:

def var_knuth(a):
     m = a.dtype.type(0)
     variance = a.dtype.type(0)
     for i in range(len(a)):
         delta = a[i] - m
         m += delta / (i+1)
         variance += delta * (a[i] - m)
     variance /= len(a)
     return variance

David Cooke replies:

Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation:

def mean(a):
    s = e = a.dtype.type(0)
    for i in range(0, len(a)):
        temp = s
        y = a[i] + e
        s = temp + y
        e = (temp - s) + y
    return s / len(a)

Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation.

Upon which Tim Hochberg provides an implementation:

Here's version of mean based on the Kahan sum, including the compensation term that seems to be consistently much better than builtin mean It seems to be typically 4 orders or magnitude closer to the "correct" answer than the builtin mean and 2 orders of magnitude better than just naively using the Kahan sum. I'm using the sum performed with dtype=int64 as the "correct" value.

def rawKahanSum(values):
    if not input:
        return 0.0
    total = values[0]
    c = type(total)(0.0)
    for x in values[1:]:
        y = x - c
        t = total + y
        c = (t - total) - y
        total = t

    return total, c

def kahanSum(values):
    total, c =  rawKahanSum(values)
    return total

def mean(values):
    total, c = rawKahanSum(values)
    n = float(len(values))
    return total /  n - c / n

for i in range(100):
    data = random.random([10000]).astype(float32)
    baseline = data.mean(dtype=float64)
    delta_builtin_mean = baseline - data.mean()
    delta_compensated_mean = baseline - mean(data)
    delta_kahan_mean = baseline - (kahanSum(data) / len(data))
    if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= abs(delta_compensated_mean):
        print ">>>",
    print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean

Tim Hochberg mentions:

Again, my tests show otherwise for float32. I'll condense my ipython log into a
module for everyone's perusal. It's possible that the Kahan summation of the
squared residuals will work better than the current two-pass algorithm and the
implementations I give above.

> This is what my tests show as well var_knuth outperformed any simple two pass algorithm I could come up with, even ones using Kahan sums.

David Cooke concludes:

- If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only.

- for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-)

sub-sahara africa
sahara trip IanKnot rainbow