Debugging Statistical Methods in JavaScript Part 2: Adventures in Numerical Programming
In a previous post I decided to try and fix a bug in the open source jStat statistical library. This bug is a challenge. Here’s the long and the short of it.
jStat has a function to for the probability distribution function of the central F distribution. The function returns NaN
for large input values of its second parameter. The reason this occurs is because the function as written (shown below) uses the traditional, mathematical definition of the function (see equation / line 3 here), which involves calling Math.pow(df2, df2)
- correct, but this will overflow a JavaScript number and return Infinity
for any input larger than roughly 100.
1 2 3 4 5 6 7 8 |
|
I’ve already generated a failing test, and found that the R project has an implementation that avoids calling any exponentials.
C isn’t one of my strong suits, so I tried to find some more supporting documentation before really trying to dive into the code; the comments however state that the code uses the binomial distribution to calculate the F distribution’s pdf.
JSTOR has an article on “The Relationship Between the Binomial and F Distributions”, but, being JSTOR, I couldn’t get access to it even with my university account. I additionally found a reference to section 6.4 of Numerical Recipes in Fortran 77 which has a recipe to calculate the test statistic of the F distribution, but no pdf calculation. So, let’s analyze the df.c function written by Catherine Loader:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
This looks intimidating, especially as a rubyist who generally likes to see six line or less functions. But, we can break it down. First, note that the function takes a give_log
parameter that indicates whether the probabilities were passed as logs. jStat isn’t going to start passing probabilities as logs any time soon, so let’s strip that out. Plus, we’re just trying to understand the way that the calculation works conceptually, so we can ditch the macros that are concerned with how numbers are represented.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
I don’t know what all of those C preprocessor definitions for ML_ERR_return_NAN
, ML_POSINF
, R_D__0
, &c. mean, but based on the conditionals they’re attached to, the first half of this function is entirely concerned with checking the bounds of the inputs and handling some special cases. Let’s ditch that. And, while we’re at it, we can replace the less-informative m
and n
with the variables we’ll be using, df1
and df2
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
This is readable even if you don’t know any C at all. If we read the source of dbinom.c we can see that dbinom_raw
just calculates the density of a binomial random variable - something we can do quite nicely in jStat.
First, note that f
gets assigned twice; as far as I can tell there’s no reason to use the same variable for both parts except to save a variable assignment and memory allocation. In fact, the only reason to use the first assignment of f
at all appears to be to save a division operation - something I’m not going to worry about in jStat.
To make sure I understand what is going on, I fool around in the R interface for a little bit to make sure I can reproduce the calculations. Sure enough, this recipe works great for a variety of inputs in R.
Time to translate it to JavaScript. I’m not sure why the f
variable is re-used; might just be a C memory allocation optimization, but I’m going to keep the two variables separate. Additionally, this algorithm depends critically on the validity of the dbinom_raw
function - which is binomial.pdf
in jStat. So, I confirm that there are tests for binomial.pdf
in the existing test suite, then remember that just last night I submitted a PR to remove duplication of those tests. Oh well, they’re there, I feel good about making centralF.pdf
dependent on binomial.pdf
.
The new function in JavaScript, after I tried it once without the jStat.
on the binomial.pdf
calls and got a no method error:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
This makes the tests pass! But, then I decide to add another test case for that pesky df1 < 2
case - so, the test suite gains:
1 2 3 |
|
This fails. There was a mistake; I didn’t calculate f
separately in each of the cases, so I try:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
No dice - looking at df.c again, it appears that if m == 2
something fancy is going on. Without wanting to look up every C macro in the df.c file, I’ll choose to make the new function a wrapper of the old, so that we can keep the working behavior for small values. Hence:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
This makes my zeroth
test pass. But, just to be sure, I add another test case with df1 = 2
, which I’m not showing here, but as it turns out it doesn’t work for df1 = 2
either, so I replace the if(df1 >= 2) {
line with if(df1 > 2)
… and the tests fail.
At this point, I spent a little while, perhaps 20 minutes trying to figure out what was wrong, then eventually decided to log each step, and determined that where I had calculated p
, I entered the formula for q
. The tests passed for my centralF.pdf(1, 100, 100)
case because if df1 === df2
, then p = q
. Here’s the final, correct code that passes all my tests:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
This probably isn’t as clear as it could be, but it works great! I included a comment with a link to the R source for future maintainer’s reference in the actual pull request. Thanks to Catherine Loader of Bell Labs for writing the C code into the R project that made this possible - I tried to find her online to thank her myself, but she seems to have disappeared.