SecondForge

Bloom's Taxonomy for Programmers

Oct 31st, 2014 2:00 am programming

Last night’s (10/29/2014) Code Newbie twitter chat focused on the topic of teaching (and hence, learning) technical concepts. It was, as always, a great way to spend an hour. Saron Yitbarek (@saronyitbarek) always does a great job of hosting. If you are a new or novice programmer, or once were a new or novice programmer, it’s worth checking out.

After the chat, I recalled Bloom’s Taxonomy from my days in high school. Bloom’s Taxonomy is a theoretical construct that is used to “distinguish the fundamental questions in the education system”. That is, Bloom’s Taxonomy breaks down the different ways of learning, and what they mean. Here are the types of understanding Bloom identified (loosely paraphrased from Wikipedia):

Remember: Knowing specific details or facts, knowing vocabulary words. Knowledge of classifications, conventions.
Comprehension: Demonstrating understanding of the facts by organizing them, comparing, translating, &c.
Application: Use acquired knowledge (remembering & comprehending) to apply facts in previously-unseen situations.
Analyze: Break new information into parts by identifying motives or causes; make inferences.
Synthesis: Compile disparate information together in a new way; propose alternative patterns or solutions.
Evaluate: (Last time I looked at the taxonomy, this one was called “Judgement”) Present and defend opinions by making judgement on validity, or quality of ideas.

Some examples as applied to programming:

Remember: Remembering syntax rules and function names.
Comprehension: Understanding the setup of your core library and its inheritance trees.
Application: Writing lines of codes, or methods.
Analyze: Break down existing software to determine how it is organized.
Synthesis: Put together disparate libraries into a new pieces of software.
Evaluate: Argue about TDD on Twitter.

The more recent versions of the taxonomy put the ways of knowing into four tiers: remembering is most fundamental, then comprehension, then application, and then Analysis, Synthesis, and Evaluation are all in the same tier as sibling concepts. This has changed from what I remember in elementary school in the early 2000’s: the version of the taxonomy that I remember seeing was a strict pyramid, with the tiers in reverse order of what I listed above (Evaluation as the ‘highest’; remembering as the ‘lowest’).

But the subtle changes that education researchers make aren’t relevant to the discussion here. How can programmers guide their learning by understanding Bloom’s Taxonomy?

The act of programming, line by line, mostly occurs at the level of application. However, writing software in generally - being a ‘software developer’ or ‘software engineer’ - mostly occurs at the level of Synthesis and Analysis, while arguments on twitter about testing strategies and design principles (in theory) occur at the level of evaluation. Everything before that - remembering, comprehending, and applying - is there to build up to analysis or synthesis.

This is an important to understand. Learning to program the first time, in addition to learning each new tool, library, method, strategy, pattern, and so on - has to follow through this hierarchy. You must remember the words and the names, so that you can understand how each one works. Then you must be able to apply each piece of the puzzle - for loops, control structures, and so on. Finally, when you get to that point, you can begin to analyze existing software, then synthesize your own new software. Only then can you really start evaluating practices - i.e, arguing about them on twitter.

Hopefully, thinking about your learning process in this context will enable you to overcome obstacles more quickly, and turbo-charge your learning.

Happy Coding!

Installing Clojure With Leningen on Ubuntu 14.04 (October 2014)

Oct 30th, 2014 10:56 am

This post is as much reference for myself as for anybody else. Installing Clojure and Leningen has a tiny bit of subtlety to it, so I am documenting the process here.

First, download leningen, copy it to /bin/, and make it executable:

$ wget https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein
$ sudo cp lein /bin/lein
$ sudo chmod a+x /bin/lein

You don’t actually have to install to /bin/lein, but /bin/ is pretty much guaranteed to be on your PATH.

Now, the Leningen website says to “Run it (lein) and it will download the self-install package”. When I ran lein, I saw a list of available tasks, and there was no indication that anything had been downloaded. I think this instruction might be slightly out of date.

Instead, just go ahead and open a REPL; Leningen will download a clojure package at that time:

$ lein repl
Downloading: org/clojure/clojure/1.3.0/....

This worked great.

If you want to create an actual project, just run lein new project_name; lein won’t install anything when the project is created, but will download the needed dependencies if and when you run lein install.

You’ll still need to add code to your project before you can run anything useful with lein run and add some tests before you can test with lein test, but you’ll then be good to go!

Happy Coding!

Debugging Statistical Methods in JavaScript Part 2: Adventures in Numerical Programming

Oct 29th, 2014 2:45 am javascript

In a previous post I decided to try and fix a bug in the open source jStat statistical library. This bug is a challenge. Here’s the long and the short of it.

jStat has a function to for the probability distribution function of the central F distribution. The function returns NaN for large input values of its second parameter. The reason this occurs is because the function as written (shown below) uses the traditional, mathematical definition of the function (see equation / line 3 here), which involves calling Math.pow(df2, df2) - correct, but this will overflow a JavaScript number and return Infinity for any input larger than roughly 100.

pdf: function pdf(x, df1, df2) {
  if (x < 0)
    return undefined;
  return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                   (Math.pow(df1 * x + df2, df1 + df2))) /
                   (x * jStat.betafn(df1/2, df2/2));

},

I’ve already generated a failing test, and found that the R project has an implementation that avoids calling any exponentials.

C isn’t one of my strong suits, so I tried to find some more supporting documentation before really trying to dive into the code; the comments however state that the code uses the binomial distribution to calculate the F distribution’s pdf.

JSTOR has an article on “The Relationship Between the Binomial and F Distributions”, but, being JSTOR, I couldn’t get access to it even with my university account. I additionally found a reference to section 6.4 of Numerical Recipes in Fortran 77 which has a recipe to calculate the test statistic of the F distribution, but no pdf calculation. So, let’s analyze the df.c function written by Catherine Loader:

double df(double x, double m, double n, int give_log)
{
    double p, q, f, dens;

#ifdef IEEE_754
    if (ISNAN(x) || ISNAN(m) || ISNAN(n))
  return x + m + n;
#endif
    if (m <= 0 || n <= 0) ML_ERR_return_NAN;
    if (x < 0.)  return(R_D__0);
    if (x == 0.) return(m > 2 ? R_D__0 : (m == 2 ? R_D__1 : ML_POSINF));
    if (!R_FINITE(m) && !R_FINITE(n)) { /* both +Inf */
  if(x == 1.) return ML_POSINF;
  /* else */  return R_D__0;
    }
    if (!R_FINITE(n)) /* must be +Inf by now */
  return(dgamma(x, m/2, 2./m, give_log));
    if (m > 1e14) {/* includes +Inf: code below is inaccurate there */
  dens = dgamma(1./x, n/2, 2./n, give_log);
  return give_log ? dens - 2*log(x): dens/(x*x);
    }

    f = 1./(n+x*m);
    q = n*f;
    p = x*m*f;

    if (m >= 2) {
  f = m*q/2;
  dens = dbinom_raw((m-2)/2, (m+n-2)/2, p, q, give_log);
    }
    else {
  f = m*m*q / (2*p*(m+n));
  dens = dbinom_raw(m/2, (m+n)/2, p, q, give_log);
    }
    return(give_log ? log(f)+dens : f*dens);
}

This looks intimidating, especially as a rubyist who generally likes to see six line or less functions. But, we can break it down. First, note that the function takes a give_log parameter that indicates whether the probabilities were passed as logs. jStat isn’t going to start passing probabilities as logs any time soon, so let’s strip that out. Plus, we’re just trying to understand the way that the calculation works conceptually, so we can ditch the macros that are concerned with how numbers are represented.

double df(double x, double m, double n)
{
    double p, q, f, dens;

    if (m <= 0 || n <= 0) ML_ERR_return_NAN;
    if (x < 0.)  return(R_D__0);
    if (x == 0.) return(m > 2 ? R_D__0 : (m == 2 ? R_D__1 : ML_POSINF));
    if (!R_FINITE(m) && !R_FINITE(n)) { /* both +Inf */
      if(x == 1.) return ML_POSINF;
      /* else */  return R_D__0;
    }
    if (!R_FINITE(n)) /* must be +Inf by now */
      return(dgamma(x, m/2, 2./m, 0));
    if (m > 1e14) {/* includes +Inf: code below is inaccurate there */
      dens = dgamma(1./x, n/2, 2./n, 0);
      return dens/(x*x);
    }

    f = 1./(n+x*m);
    q = n*f;
    p = x*m*f;

    if (m >= 2) {
      f = m*q/2;
      dens = dbinom_raw((m-2)/2, (m+n-2)/2, p, q, 0);
    }
    else {
      f = m*m*q / (2*p*(m+n));
      dens = dbinom_raw(m/2, (m+n)/2, p, q, 0);
    }
    return f*dens;
}

I don’t know what all of those C preprocessor definitions for ML_ERR_return_NAN, ML_POSINF, R_D__0, &c. mean, but based on the conditionals they’re attached to, the first half of this function is entirely concerned with checking the bounds of the inputs and handling some special cases. Let’s ditch that. And, while we’re at it, we can replace the less-informative m and n with the variables we’ll be using, df1 and df2.

double df(double x, double df1, double df2)
{
    double p, q, f, dens;

    f = 1./(df2+x*df1);
    q = df2*f;
    p = x*df1*f;

    if (df1 >= 2) {
      f = df1*q/2;
      dens = dbinom_raw((df1-2)/2, (df1+df2-2)/2, p, q, 0);
    }
    else {
      f = df1*df1*q / (2*p*(df1+df2));
      dens = dbinom_raw(df1/2, (df1+df2)/2, p, q, 0);
    }
    return f*dens;
}

This is readable even if you don’t know any C at all. If we read the source of dbinom.c we can see that dbinom_raw just calculates the density of a binomial random variable - something we can do quite nicely in jStat.

First, note that f gets assigned twice; as far as I can tell there’s no reason to use the same variable for both parts except to save a variable assignment and memory allocation. In fact, the only reason to use the first assignment of f at all appears to be to save a division operation - something I’m not going to worry about in jStat.

To make sure I understand what is going on, I fool around in the R interface for a little bit to make sure I can reproduce the calculations. Sure enough, this recipe works great for a variety of inputs in R.

Time to translate it to JavaScript. I’m not sure why the f variable is re-used; might just be a C memory allocation optimization, but I’m going to keep the two variables separate. Additionally, this algorithm depends critically on the validity of the dbinom_raw function - which is binomial.pdf in jStat. So, I confirm that there are tests for binomial.pdf in the existing test suite, then remember that just last night I submitted a PR to remove duplication of those tests. Oh well, they’re there, I feel good about making centralF.pdf dependent on binomial.pdf.

The new function in JavaScript, after I tried it once without the jStat. on the binomial.pdf calls and got a no method error:

pdf: function pdf(x, df1, df2) {
  if (x < 0) {
    return undefined;
  }

  var p = df2 / (df2 + x * df1);
  var f = df1 * (1 - p) / 2;
  if(df1 >= 2) {
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    return f * jStat.binomial.pdf( df1 / 2, (df1 + df2) / 2, p);
  }

},

This makes the tests pass! But, then I decide to add another test case for that pesky df1 < 2 case - so, the test suite gains:

  var zeroth = jStat.centralF.pdf(0.2, 1, 3);
  assert.isNumber(zeroth);
  assert.epsilon(tol, zeroth, 0.722349);

This fails. There was a mistake; I didn’t calculate f separately in each of the cases, so I try:

pdf: function pdf(x, df1, df2) {
 if (x < 0) {
    return undefined;
  }

  var p = df2 / (df2 + x * df1);

  if(df1 >= 2) {
    var f = df1 * (1 - p) / 2;
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    var f = df1 * df1 * (1 - p) / (2 * p * (df1 + df2));
    return f * jStat.binomial.pdf( df1 / 2, (df1 + df2) / 2, p);
  }
}

No dice - looking at df.c again, it appears that if m == 2 something fancy is going on. Without wanting to look up every C macro in the df.c file, I’ll choose to make the new function a wrapper of the old, so that we can keep the working behavior for small values. Hence:

pdf: function pdf(x, df1, df2) {
 if (x < 0) {
    return undefined;
  }

  if(df1 >= 2) {
    var p = df2 / (df2 + (x * df1));
    var f = df1 * (1 - p) / 2;
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                   (Math.pow(df1 * x + df2, df1 + df2))) /
                   (x * jStat.betafn(df1/2, df2/2));
  }
}

This makes my zeroth test pass. But, just to be sure, I add another test case with df1 = 2, which I’m not showing here, but as it turns out it doesn’t work for df1 = 2 either, so I replace the if(df1 >= 2) { line with if(df1 > 2)… and the tests fail.

At this point, I spent a little while, perhaps 20 minutes trying to figure out what was wrong, then eventually decided to log each step, and determined that where I had calculated p, I entered the formula for q. The tests passed for my centralF.pdf(1, 100, 100) case because if df1 === df2, then p = q. Here’s the final, correct code that passes all my tests:

pdf: function pdf(x, df1, df2) {
 if (x < 0) {
    return undefined;
  }

  if(df1 > 2) {
    var p = (df1 * x) / (df2 + x * df1);
    var q = df2 / (df2 + x * df1);
    var f = df1 * q / 2.0;
    return f * jStat.binomial.pdf( (df1 - 2) / 2, (df1 + df2 - 2) / 2, p);
  } else {
    return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                   (Math.pow(df1 * x + df2, df1 + df2))) /
                   (x * jStat.betafn(df1/2, df2/2));
  }
},

This probably isn’t as clear as it could be, but it works great! I included a comment with a link to the R source for future maintainer’s reference in the actual pull request. Thanks to Catherine Loader of Bell Labs for writing the C code into the R project that made this possible - I tried to find her online to thank her myself, but she seems to have disappeared.

Debugging Statistical Methods in JavaScript

Oct 28th, 2014 12:06 am javascript

I occasionally contribute to the jStat JavaScript library, which aims to provide a decent suite of statistical functions in native JavaScript. A recent bug report (helpfully filed by akrawitz - thanks!) reported that the jStat.centralF.pdf(x, df1, df2) function was returning NaN instead of the proper value for large values of df1 and df2.

I decided to fix this bug and journal the process here, as an experiment in self-reflective coding. I’m writing in the present tense, because I am writing this post as I go through the bug fix. I understand if first person present tense is bothersome; apologies to those readers.

First, the bug report helpfully describes the problem and gives two failing cases, with the correct value even! This is quite helpful. I have a degree in statistics and am certainly familiar with the F distribution but haven’t touched this part of the jStat code. I don’t know what to expect. I know there is division in the formula for the probability distribution function of a F random variable; perhaps that’s where the NaN comes from.

This is all a hunch, though. To begin fixing the issue, first I make sure I have the newest copy of the jStat master to work off of. Which means forgetting how to set an upstream branch to track the jstat/jstat project instead of my fork; a little searching (this article was helpful) and I have my local repository working.

I confirm that the existing tests pass. Then I write some tests that fail, which follow in the next code block. In the process, I check to see if there are any existing tests of the centralF distribution; there aren’t, but in scanning the appropriate file I find that there are some duplicated tests. I go ahead and remove them, and commit.

Given that there are no tests for the centralF distribution, it is time to add some. I fire up R to get some values to test against, and copy the structure of one of the other test groups. The code I added is below.

  'central F distribution': {
    'topic': function() {
      return jStat;
    },
    'check pdf calculation': function(jStat) {
      var tol = 0.0000001;
      assert.epsilon(tol, jStat.centralF.pdf(1, 5, 5), 0.4244132);
      assert.epsilon(tol, jStat.centralF.pdf(1, 100, 5), 0.5954544);
      assert.epsilon(tol, jStat.centralF.pdf(1, 100, 100), 1.989731);
      assert.epsilon(tol, jStat.centralF.pdf(1, 4, 200), 0.5359988);
    }
  }

The last two values were given to me from the bug report, I know they fail so I include them. The first is a sanity test - there are no tests for this function, so if it fails on the first one I know that it is totally broken. The second tests my theory that it is large denominator degrees of freedom - which correspond to small denominator values in the F pdf calculation - that is causing the problem.

I run the tests and… they pass?

Add a test sure to fail to make sure I’m not insane:

  /* ... */
    'check pdf calculation': function(jStat) {
      /* assertions ... */
      assert.epsilon(tol, jStat.centralF.pdf(1, 4, 200), 0.5359988);
      assert.epsilon(tol, 1, 10000000);
    }
  }

Sure enough, I get a broken message with the offending line. I remove it. I’m still a doubting Thomas; I open the node.js REPL and try to see for myself. I’m not a node expert so it takes a couple tries to get the file require’d correctly; I’m used to using Ruby’s reuqire_relative and didn’t realize I could just list a file path in require(). Sure enough jStat.centralF.pdf(1, 100, 100) returns NaN.

Some logging and further investigation is in order:

  /* ... */
    'check pdf calculation': function(jStat) {
      /* assertions ... */
      console.log(jStat.centralF.pdf(1, 4, 200));
      assert.epsilon(tol, jStat.centralF.pdf(1, 4, 200), 0.5359988);
      assert.epsilon(tol, NaN, 1);
      assert.epsilon(tol, 1, NaN);
    }
  }

The tests pass, and the log statement dutifully shows NaN. Apparently, I was wrong to assume that assert.epsilon would return false when comparing a number to a NaN. A quick grep through my source directory shows that assert.epsilon is defined in the vows node module, in the lib/assert/macros.js file.

Here’s the function:

assert.epsilon = function (eps, actual, expected, message) {
    if (Math.abs(actual - expected) > eps) {
        assert.fail(actual, expected, message || "expected {expected} \u00B1"+ eps +", but was {actual}");
    }
};

Playing around in the REPL, I re-discover that NaN > x returns false for all x, so if NaN is passed, the function can never fail the assertion. This means that manual NaN checks are going to be required everywhere. There are a few other options; the two logical ones would be to wrap assert.epsilon in a custom, jStat only function that automates the NaN check or to monkey-patch the vows library to make assert.epsilon have the requested behavior.

Since I’m not the project maintainer, I’ll bring it up in the pull request and discuss it with the maintainer, Trevor Norris. As an aside, Trevor is absolutely wonderful - he spent way more time than I deserve helping me learn how to contribute to jStat and FLOSS. For now, I’ll stay on track fixing this function. Here’s a fixed iteration of the test suite, which fails as expected.

(I removed the #1 and #2 test case from above after testing them manually in the REPL; they worked, so the remaining two examples were sufficient.)

  'central F distribution': {
    'topic': function() {
      return jStat;
    },
    'check pdf calculation': function(jStat) {
      var tol = 0.0000001;

      var first = jStat.centralF.pdf(1, 100, 100);
      assert.isNumber(first);
      assert.epsilon(tol, first, 1.989731);

      var second = jStat.centralF.pdf(1, 100, 100);
      assert.isNumber(second);
      assert.epsilon(tol, second, 1.989731);
    }
  },

At this point, I commit, with a note about the assert.epsilon problem. Time to actually fix the function, which is shown here:

  pdf: function pdf(x, df1, df2) {
    if (x < 0)
      return undefined;
    return Math.sqrt((Math.pow(df1 * x, df1) * Math.pow(df2, df2)) /
                     (Math.pow(df1 * x + df2, df1 + df2))) /
                     (x * jStat.betafn(df1/2, df2/2));

  },

Checking against one of my old statistics textbooks, I see that this is totally correct. The problem is numerical. Time to verify:

  pdf: function pdf(x, df1, df2) {
    if (x < 0)
      return undefined;
    var sqrtNumerator = Math.pow(df1 * x, df1) * Math.pow(df2, df2);
    var sqrtDenominator = Math.pow(df1 * x + df2, df1 + df2);
    var denominator = x * jStat.betafn(df1 / 2, df2 / 2);
    var numerator = Math.sqrt(sqrtNumerator / sqrtDenominator);
    var result = numerator / denominator;

    console.log(sqrtNumerator);
    console.log(sqrtDenominator);
    console.log(denominator);
    console.log(numerator);
    console.log(result);

    return result;

  },

Sure enough, the sqrtNumerator and sqrtDenominator are returning Infinity - not surprising when one calls Math.pow(df2, df2)!

How to fix this is another issue. There isn’t a clear option in JavaScript to just declare the numerator or denominator as a bigger type of number, as all numbers are represented the same way. There are some big number libraries around, but adding a dependency like that to the project isn’t a decision to be made lightly, and evaluating big number libraries is certainly outside the scope of this blog post.

The only other idea I have is to see if someone else has a cleverer way to calculate the central F distribution pdf. I know this is pretty common; oftentimes the way that a statistical value is calculated is quite different from the mathematical definition found in a textbook. If there is a clever way to do it, it is a pretty sure bet that R will be using it.

The R project exposes its SVN repository as HTML; I searched it just by typing (“df site:df site:https://svn.r-project.org/R/trunk”) into a search engine, where “df” is the name of the function I was looking for. df.c ends up being the sixth or seventh result, but still a lot easier than navigating the source.

Good news : it appears that Catherine Loader from Bell Labs wrote a super quick implementation that can handle reasonably large numbers, but understanding the implementation is nontrivial. I think that a better understanding of how Catherine Loader’s implementation works is required - time to do some research. In the mean time, I submitted a PR for the double test.

UPDATE: I fixed the bug! Read about the process in Part 2.

Grouping Items for Display in Ember

Oct 27th, 2014 3:41 pm ember

I’m working on an Ember.js project that requires grouping a list of objects by one of their properties for display. I’ve got one model - call it a plan - which is joined to a bunch of actions by a join model. Each action has a year attribute defined on it.

When a user views the plan, I’d like to make a view that shows all of the actions, grouped by year. This would be trivial if the actions belongedTo some sort of year model, which would belongTo the plan - in that case, something along the following lines in the Handlebars template would work great:

template

  <h1>  </h1>
  
    <!-- some display logic for the action -->

However, we can’t do that, for a variety of back end / performance reasons - the two-tiered fetch from the REST API becomes a problem.

Here’s what I originally wanted to do. In the controller:

controller

groupedActions: function() {
  //First we need to get an array of the years
  var years = this.get('actions').map(function(act) {
    return act.get('year');
  }).uniq();

  var grouped = years.forEach(function(year) {
    var filteredActions = this.get('actions').filter(function(action) {

    });

    return [year, filteredActions];
  });
}.property('actions.@each')

Then, in the view, we could display the grouped records as so:

template

  <h1>  </h1>
  
    <!-- some display logic for the action -->

Ember.js won’t let you do this, which might be frustrating for those of us who are used to less pervasive frameworks. There is a simple solution, though: just replace the array with a non-persisted Ember.Object!

The following solution works. In the controller:

controller

groupedActions: function() {
  //First we need to get an array of the years
  var years = this.get('actions').map(function(act) {
    return act.get('year');
  }).uniq();

  var grouped = years.forEach(function(year) {
    var filteredActions = this.get('actions').filter(function(action) {
      return action.year === year;
    });

    //Todo
    return Ember.Object.create({
      year: year,
      actions: filteredActions
    });
  });
}.property('actions.@each')

This, when used with the template code shown below, works splendidly.

template

  <h1>  </h1>
  
    <!-- some display logic for the action -->

The moral of the story is thus: remember that when workin with Ember.js objects that are to be displayed in Handlebars templates, your life will be made easier by defaulting to returning Ember objects, even if those objects have no associated model.

One word of caution: I don’t actulaly recommend making the groupedActions function observe actions.@each via the .property() call. In my experience, it was somewhat difficult to get reliable performance in this way. I created another property that determined if all the actions were loaded, and then made the wrapper function observer that, so that the function would only be called after a completed reload from the server.