Debugging Statistical Methods in JavaScript Part 2: Adventures in Numerical Programming

javascript

In a previous post I decided to try and fix a bug in the open source jStat statistical library. This bug is a challenge. Here’s the long and the short of it.

jStat has a function to for the probability distribution function of the central F distribution. The function returns NaN for large input values of its second parameter. The reason this occurs is because the function as written (shown below) uses the traditional, mathematical definition of the function (see equation / line 3 here), which involves calling Math.pow(df2, df2) - correct, but this will overflow a JavaScript number and return Infinity for any input larger than roughly 100.

1
2
3
4
5
6
7
8
pdf: function pdf(x, df1, df2) {
  if (x < 0)
    return undefined;
  return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                   (Math.pow(df1 * x + df2, df1 + df2))) /
                   (x * jStat.betafn(df1/2, df2/2));

},

I’ve already generated a failing test, and found that the R project has an implementation that avoids calling any exponentials.

C isn’t one of my strong suits, so I tried to find some more supporting documentation before really trying to dive into the code; the comments however state that the code uses the binomial distribution to calculate the F distribution’s pdf.

JSTOR has an article on “The Relationship Between the Binomial and F Distributions”, but, being JSTOR, I couldn’t get access to it even with my university account. I additionally found a reference to section 6.4 of Numerical Recipes in Fortran 77 which has a recipe to calculate the test statistic of the F distribution, but no pdf calculation. So, let’s analyze the df.c function written by Catherine Loader:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
double df(double x, double m, double n, int give_log)
{
    double p, q, f, dens;

#ifdef IEEE_754
    if (ISNAN(x) || ISNAN(m) || ISNAN(n))
  return x + m + n;
#endif
    if (m <= 0 || n <= 0) ML_ERR_return_NAN;
    if (x < 0.)  return(R_D__0);
    if (x == 0.) return(m > 2 ? R_D__0 : (m == 2 ? R_D__1 : ML_POSINF));
    if (!R_FINITE(m) && !R_FINITE(n)) { /* both +Inf */
  if(x == 1.) return ML_POSINF;
  /* else */  return R_D__0;
    }
    if (!R_FINITE(n)) /* must be +Inf by now */
  return(dgamma(x, m/2, 2./m, give_log));
    if (m > 1e14) {/* includes +Inf: code below is inaccurate there */
  dens = dgamma(1./x, n/2, 2./n, give_log);
  return give_log ? dens - 2*log(x): dens/(x*x);
    }

    f = 1./(n+x*m);
    q = n*f;
    p = x*m*f;

    if (m >= 2) {
  f = m*q/2;
  dens = dbinom_raw((m-2)/2, (m+n-2)/2, p, q, give_log);
    }
    else {
  f = m*m*q / (2*p*(m+n));
  dens = dbinom_raw(m/2, (m+n)/2, p, q, give_log);
    }
    return(give_log ? log(f)+dens : f*dens);
}

This looks intimidating, especially as a rubyist who generally likes to see six line or less functions. But, we can break it down. First, note that the function takes a give_log parameter that indicates whether the probabilities were passed as logs. jStat isn’t going to start passing probabilities as logs any time soon, so let’s strip that out. Plus, we’re just trying to understand the way that the calculation works conceptually, so we can ditch the macros that are concerned with how numbers are represented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
double df(double x, double m, double n)
{
    double p, q, f, dens;

    if (m <= 0 || n <= 0) ML_ERR_return_NAN;
    if (x < 0.)  return(R_D__0);
    if (x == 0.) return(m > 2 ? R_D__0 : (m == 2 ? R_D__1 : ML_POSINF));
    if (!R_FINITE(m) && !R_FINITE(n)) { /* both +Inf */
      if(x == 1.) return ML_POSINF;
      /* else */  return R_D__0;
    }
    if (!R_FINITE(n)) /* must be +Inf by now */
      return(dgamma(x, m/2, 2./m, 0));
    if (m > 1e14) {/* includes +Inf: code below is inaccurate there */
      dens = dgamma(1./x, n/2, 2./n, 0);
      return dens/(x*x);
    }

    f = 1./(n+x*m);
    q = n*f;
    p = x*m*f;

    if (m >= 2) {
      f = m*q/2;
      dens = dbinom_raw((m-2)/2, (m+n-2)/2, p, q, 0);
    }
    else {
      f = m*m*q / (2*p*(m+n));
      dens = dbinom_raw(m/2, (m+n)/2, p, q, 0);
    }
    return f*dens;
}

I don’t know what all of those C preprocessor definitions for ML_ERR_return_NAN, ML_POSINF, R_D__0, &c. mean, but based on the conditionals they’re attached to, the first half of this function is entirely concerned with checking the bounds of the inputs and handling some special cases. Let’s ditch that. And, while we’re at it, we can replace the less-informative m and n with the variables we’ll be using, df1 and df2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
double df(double x, double df1, double df2)
{
    double p, q, f, dens;

    f = 1./(df2+x*df1);
    q = df2*f;
    p = x*df1*f;

    if (df1 >= 2) {
      f = df1*q/2;
      dens = dbinom_raw((df1-2)/2, (df1+df2-2)/2, p, q, 0);
    }
    else {
      f = df1*df1*q / (2*p*(df1+df2));
      dens = dbinom_raw(df1/2, (df1+df2)/2, p, q, 0);
    }
    return f*dens;
}

This is readable even if you don’t know any C at all. If we read the source of dbinom.c we can see that dbinom_raw just calculates the density of a binomial random variable - something we can do quite nicely in jStat.

First, note that f gets assigned twice; as far as I can tell there’s no reason to use the same variable for both parts except to save a variable assignment and memory allocation. In fact, the only reason to use the first assignment of f at all appears to be to save a division operation - something I’m not going to worry about in jStat.

To make sure I understand what is going on, I fool around in the R interface for a little bit to make sure I can reproduce the calculations. Sure enough, this recipe works great for a variety of inputs in R.

Time to translate it to JavaScript. I’m not sure why the f variable is re-used; might just be a C memory allocation optimization, but I’m going to keep the two variables separate. Additionally, this algorithm depends critically on the validity of the dbinom_raw function - which is binomial.pdf in jStat. So, I confirm that there are tests for binomial.pdf in the existing test suite, then remember that just last night I submitted a PR to remove duplication of those tests. Oh well, they’re there, I feel good about making centralF.pdf dependent on binomial.pdf.

The new function in JavaScript, after I tried it once without the jStat. on the binomial.pdf calls and got a no method error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
pdf: function pdf(x, df1, df2) {
  if (x < 0) {
    return undefined;
  }

  var p = df2 / (df2 + x * df1);
  var f = df1 * (1 - p) / 2;
  if(df1 >= 2) {
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    return f * jStat.binomial.pdf( df1 / 2, (df1 + df2) / 2, p);
  }

},

This makes the tests pass! But, then I decide to add another test case for that pesky df1 < 2 case - so, the test suite gains:

1
2
3
  var zeroth = jStat.centralF.pdf(0.2, 1, 3);
  assert.isNumber(zeroth);
  assert.epsilon(tol, zeroth, 0.722349);

This fails. There was a mistake; I didn’t calculate f separately in each of the cases, so I try:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pdf: function pdf(x, df1, df2) {
 if (x < 0) {
    return undefined;
  }

  var p = df2 / (df2 + x * df1);

  if(df1 >= 2) {
    var f = df1 * (1 - p) / 2;
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    var f = df1 * df1 * (1 - p) / (2 * p * (df1 + df2));
    return f * jStat.binomial.pdf( df1 / 2, (df1 + df2) / 2, p);
  }
}

No dice - looking at df.c again, it appears that if m == 2 something fancy is going on. Without wanting to look up every C macro in the df.c file, I’ll choose to make the new function a wrapper of the old, so that we can keep the working behavior for small values. Hence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pdf: function pdf(x, df1, df2) {
 if (x < 0) {
    return undefined;
  }

  if(df1 >= 2) {
    var p = df2 / (df2 + (x * df1));
    var f = df1 * (1 - p) / 2;
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                   (Math.pow(df1 * x + df2, df1 + df2))) /
                   (x * jStat.betafn(df1/2, df2/2));
  }
}

This makes my zeroth test pass. But, just to be sure, I add another test case with df1 = 2, which I’m not showing here, but as it turns out it doesn’t work for df1 = 2 either, so I replace the if(df1 >= 2) { line with if(df1 > 2)… and the tests fail.

At this point, I spent a little while, perhaps 20 minutes trying to figure out what was wrong, then eventually decided to log each step, and determined that where I had calculated p, I entered the formula for q. The tests passed for my centralF.pdf(1, 100, 100) case because if df1 === df2, then p = q. Here’s the final, correct code that passes all my tests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pdf: function pdf(x, df1, df2) {
 if (x < 0) {
    return undefined;
  }

  if(df1 > 2) {
    var p = (df1 * x) / (df2 + x * df1);
    var q = df2 / (df2 + x * df1);
    var f = df1 * q / 2.0;
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                   (Math.pow(df1 * x + df2, df1 + df2))) /
                   (x * jStat.betafn(df1/2, df2/2));
  }
},

This probably isn’t as clear as it could be, but it works great! I included a comment with a link to the R source for future maintainer’s reference in the actual pull request. Thanks to Catherine Loader of Bell Labs for writing the C code into the R project that made this possible - I tried to find her online to thank her myself, but she seems to have disappeared.


I'm looking for better ways to build software businesses. Find out if I find something.