jStat.centralF.pdf(x, df1, df2) function was returning
NaN instead of the proper value for large values of
I decided to fix this bug and journal the process here, as an experiment in self-reflective coding. I’m writing in the present tense, because I am writing this post as I go through the bug fix. I understand if first person present tense is bothersome; apologies to those readers.
First, the bug report helpfully describes the problem and gives two failing cases, with the correct value even! This is quite helpful. I have a degree in statistics and am certainly familiar with the F distribution but haven’t touched this part of the jStat code. I don’t know what to expect. I know there is division in the formula for the probability distribution function of a F random variable; perhaps that’s where the NaN comes from.
This is all a hunch, though. To begin fixing the issue, first I make sure I have the newest copy of the jStat master to work off of. Which means forgetting how to set an upstream branch to track the jstat/jstat project instead of my fork; a little searching (this article was helpful) and I have my local repository working.
I confirm that the existing tests pass. Then I write some tests that fail, which follow in the next code block. In the process, I check to see if there are any existing tests of the
centralF distribution; there aren’t, but in scanning the appropriate file I find that there are some duplicated tests. I go ahead and remove them, and commit.
Given that there are no tests for the
centralF distribution, it is time to add some. I fire up R to get some values to test against, and copy the structure of one of the other test groups. The code I added is below.
1 2 3 4 5 6 7 8 9 10 11 12
The last two values were given to me from the bug report, I know they fail so I include them. The first is a sanity test - there are no tests for this function, so if it fails on the first one I know that it is totally broken. The second tests my theory that it is large denominator degrees of freedom - which correspond to small denominator values in the F pdf calculation - that is causing the problem.
I run the tests and… they pass?
Add a test sure to fail to make sure I’m not insane:
1 2 3 4 5 6 7
Sure enough, I get a
broken message with the offending line. I remove it. I’m still a doubting Thomas; I open the node.js REPL and try to see for myself. I’m not a node expert so it takes a couple tries to get the file
require’d correctly; I’m used to using Ruby’s
reuqire_relative and didn’t realize I could just list a file path in
require(). Sure enough
jStat.centralF.pdf(1, 100, 100) returns
Some logging and further investigation is in order:
1 2 3 4 5 6 7 8 9
The tests pass, and the log statement dutifully shows NaN. Apparently, I was wrong to assume that
assert.epsilon would return
false when comparing a number to a
NaN. A quick
grep through my source directory shows that
assert.epsilon is defined in the vows node module, in the lib/assert/macros.js file.
Here’s the function:
1 2 3 4 5
Playing around in the REPL, I re-discover that
NaN > x returns
false for all
x, so if
NaN is passed, the function can never fail the assertion. This means that manual
NaN checks are going to be required everywhere. There are a few other options; the two logical ones would be to wrap
assert.epsilon in a custom, jStat only function that automates the
NaN check or to monkey-patch the vows library to make
assert.epsilon have the requested behavior.
Since I’m not the project maintainer, I’ll bring it up in the pull request and discuss it with the maintainer, Trevor Norris. As an aside, Trevor is absolutely wonderful - he spent way more time than I deserve helping me learn how to contribute to jStat and FLOSS. For now, I’ll stay on track fixing this function. Here’s a fixed iteration of the test suite, which fails as expected.
(I removed the #1 and #2 test case from above after testing them manually in the REPL; they worked, so the remaining two examples were sufficient.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
At this point, I commit, with a note about the
assert.epsilon problem. Time to actually fix the function, which is shown here:
1 2 3 4 5 6 7 8
Checking against one of my old statistics textbooks, I see that this is totally correct. The problem is numerical. Time to verify:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Sure enough, the
sqrtDenominator are returning
Infinity - not surprising when one calls
The only other idea I have is to see if someone else has a cleverer way to calculate the central F distribution pdf. I know this is pretty common; oftentimes the way that a statistical value is calculated is quite different from the mathematical definition found in a textbook. If there is a clever way to do it, it is a pretty sure bet that R will be using it.
The R project exposes its SVN repository as HTML; I searched it just by typing (“df site:df site:https://svn.r-project.org/R/trunk”) into a search engine, where “df” is the name of the function I was looking for. df.c ends up being the sixth or seventh result, but still a lot easier than navigating the source.
Good news : it appears that Catherine Loader from Bell Labs wrote a super quick implementation that can handle reasonably large numbers, but understanding the implementation is nontrivial. I think that a better understanding of how Catherine Loader’s implementation works is required - time to do some research. In the mean time, I submitted a PR for the double test.
UPDATE: I fixed the bug! Read about the process in Part 2.