Debugging Statistical Methods in JavaScript

javascript

I occasionally contribute to the jStat JavaScript library, which aims to provide a decent suite of statistical functions in native JavaScript. A recent bug report (helpfully filed by akrawitz - thanks!) reported that the jStat.centralF.pdf(x, df1, df2) function was returning NaN instead of the proper value for large values of df1 and df2.

I decided to fix this bug and journal the process here, as an experiment in self-reflective coding. I’m writing in the present tense, because I am writing this post as I go through the bug fix. I understand if first person present tense is bothersome; apologies to those readers.

First, the bug report helpfully describes the problem and gives two failing cases, with the correct value even! This is quite helpful. I have a degree in statistics and am certainly familiar with the F distribution but haven’t touched this part of the jStat code. I don’t know what to expect. I know there is division in the formula for the probability distribution function of a F random variable; perhaps that’s where the NaN comes from.

This is all a hunch, though. To begin fixing the issue, first I make sure I have the newest copy of the jStat master to work off of. Which means forgetting how to set an upstream branch to track the jstat/jstat project instead of my fork; a little searching (this article was helpful) and I have my local repository working.

I confirm that the existing tests pass. Then I write some tests that fail, which follow in the next code block. In the process, I check to see if there are any existing tests of the centralF distribution; there aren’t, but in scanning the appropriate file I find that there are some duplicated tests. I go ahead and remove them, and commit.

Given that there are no tests for the centralF distribution, it is time to add some. I fire up R to get some values to test against, and copy the structure of one of the other test groups. The code I added is below.

1
2
3
4
5
6
7
8
9
10
11
12
  'central F distribution': {
    'topic': function() {
      return jStat;
    },
    'check pdf calculation': function(jStat) {
      var tol = 0.0000001;
      assert.epsilon(tol, jStat.centralF.pdf(1, 5, 5), 0.4244132);
      assert.epsilon(tol, jStat.centralF.pdf(1, 100, 5), 0.5954544);
      assert.epsilon(tol, jStat.centralF.pdf(1, 100, 100), 1.989731);
      assert.epsilon(tol, jStat.centralF.pdf(1, 4, 200), 0.5359988);
    }
  }

The last two values were given to me from the bug report, I know they fail so I include them. The first is a sanity test - there are no tests for this function, so if it fails on the first one I know that it is totally broken. The second tests my theory that it is large denominator degrees of freedom - which correspond to small denominator values in the F pdf calculation - that is causing the problem.

I run the tests and… they pass?

Add a test sure to fail to make sure I’m not insane:

1
2
3
4
5
6
7
  /* ... */
    'check pdf calculation': function(jStat) {
      /* assertions ... */
      assert.epsilon(tol, jStat.centralF.pdf(1, 4, 200), 0.5359988);
      assert.epsilon(tol, 1, 10000000);
    }
  }

Sure enough, I get a broken message with the offending line. I remove it. I’m still a doubting Thomas; I open the node.js REPL and try to see for myself. I’m not a node expert so it takes a couple tries to get the file require’d correctly; I’m used to using Ruby’s reuqire_relative and didn’t realize I could just list a file path in require(). Sure enough jStat.centralF.pdf(1, 100, 100) returns NaN.

Some logging and further investigation is in order:

1
2
3
4
5
6
7
8
9
  /* ... */
    'check pdf calculation': function(jStat) {
      /* assertions ... */
      console.log(jStat.centralF.pdf(1, 4, 200));
      assert.epsilon(tol, jStat.centralF.pdf(1, 4, 200), 0.5359988);
      assert.epsilon(tol, NaN, 1);
      assert.epsilon(tol, 1, NaN);
    }
  }

The tests pass, and the log statement dutifully shows NaN. Apparently, I was wrong to assume that assert.epsilon would return false when comparing a number to a NaN. A quick grep through my source directory shows that assert.epsilon is defined in the vows node module, in the lib/assert/macros.js file.

Here’s the function:

1
2
3
4
5
assert.epsilon = function (eps, actual, expected, message) {
    if (Math.abs(actual - expected) > eps) {
        assert.fail(actual, expected, message || "expected {expected} \u00B1"+ eps +", but was {actual}");
    }
};

Playing around in the REPL, I re-discover that NaN > x returns false for all x, so if NaN is passed, the function can never fail the assertion. This means that manual NaN checks are going to be required everywhere. There are a few other options; the two logical ones would be to wrap assert.epsilon in a custom, jStat only function that automates the NaN check or to monkey-patch the vows library to make assert.epsilon have the requested behavior.

Since I’m not the project maintainer, I’ll bring it up in the pull request and discuss it with the maintainer, Trevor Norris. As an aside, Trevor is absolutely wonderful - he spent way more time than I deserve helping me learn how to contribute to jStat and FLOSS. For now, I’ll stay on track fixing this function. Here’s a fixed iteration of the test suite, which fails as expected.

(I removed the #1 and #2 test case from above after testing them manually in the REPL; they worked, so the remaining two examples were sufficient.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  'central F distribution': {
    'topic': function() {
      return jStat;
    },
    'check pdf calculation': function(jStat) {
      var tol = 0.0000001;

      var first = jStat.centralF.pdf(1, 100, 100);
      assert.isNumber(first);
      assert.epsilon(tol, first, 1.989731);

      var second = jStat.centralF.pdf(1, 100, 100);
      assert.isNumber(second);
      assert.epsilon(tol, second, 1.989731);
    }
  },

At this point, I commit, with a note about the assert.epsilon problem. Time to actually fix the function, which is shown here:

1
2
3
4
5
6
7
8
  pdf: function pdf(x, df1, df2) {
    if (x < 0)
      return undefined;
    return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                     (Math.pow(df1 * x + df2, df1 + df2))) /
                     (x * jStat.betafn(df1/2, df2/2));

  },

Checking against one of my old statistics textbooks, I see that this is totally correct. The problem is numerical. Time to verify:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
  pdf: function pdf(x, df1, df2) {
    if (x < 0)
      return undefined;
    var sqrtNumerator = Math.pow(df1 * x, df1) * Math.pow(df2, df2);
    var sqrtDenominator = Math.pow(df1 * x + df2, df1 + df2);
    var denominator = x * jStat.betafn(df1 / 2, df2 / 2);
    var numerator = Math.sqrt(sqrtNumerator / sqrtDenominator);
    var result = numerator / denominator;

    console.log(sqrtNumerator);
    console.log(sqrtDenominator);
    console.log(denominator);
    console.log(numerator);
    console.log(result);

    return result;

  },

Sure enough, the sqrtNumerator and sqrtDenominator are returning Infinity - not surprising when one calls Math.pow(df2, df2)!

How to fix this is another issue. There isn’t a clear option in JavaScript to just declare the numerator or denominator as a bigger type of number, as all numbers are represented the same way. There are some big number libraries around, but adding a dependency like that to the project isn’t a decision to be made lightly, and evaluating big number libraries is certainly outside the scope of this blog post.

The only other idea I have is to see if someone else has a cleverer way to calculate the central F distribution pdf. I know this is pretty common; oftentimes the way that a statistical value is calculated is quite different from the mathematical definition found in a textbook. If there is a clever way to do it, it is a pretty sure bet that R will be using it.

The R project exposes its SVN repository as HTML; I searched it just by typing (“df site:df site:https://svn.r-project.org/R/trunk”) into a search engine, where “df” is the name of the function I was looking for. df.c ends up being the sixth or seventh result, but still a lot easier than navigating the source.

Good news : it appears that Catherine Loader from Bell Labs wrote a super quick implementation that can handle reasonably large numbers, but understanding the implementation is nontrivial. I think that a better understanding of how Catherine Loader’s implementation works is required - time to do some research. In the mean time, I submitted a PR for the double test.

UPDATE: I fixed the bug! Read about the process in Part 2.


I'm looking for better ways to build software businesses. Find out if I find something.