Using pdf2htmlEX on Heroku

heroku

Want to use the awesome pdf2htmlEX on Heroku? You’re not alone. For Quottly, we do quite a bit of PDF processing - turns out, a lot of colleges and universities like to publish information in PDF format. We always try to use the pdf-reader ruby gem if we can, since it’s easy to deploy and maintain, but sometimes pdf-reader just doesn’t have enough power for what we’re trying to do.

We recently got pdf2htmlEX running on our Heroku app. Here’s how.

apt buildpack

pdf2htmlEX is distributed either from source or as a Linux package. To install the debian package for pdf2htmlEX on Heroku, we first added heroku-buildpack-apt to our application’s buildpacks.

Some old sources (including the README.md on heroku-buildpack-apt) will indicate that the best way to do this is to create a .buildpacks file in your project. However, Heroku now recommends adding the buildpacks from the command line, and/or using an app.json for reproducible deploys.

We added the following to our app.json:

1
2
3
4
5
6
7
8
9
10
  ...
  "buildpacks": [
    {
      "url": "https://github.com/heroku/heroku-buildpack-ruby.git"
    },
    {
      "url": "https://github.com/ddollar/heroku-buildpack-apt.git"
    }
  ]
  ...

Then, add an Aptfile for heroku-buildpack-apt to pull from. Each line in the Aptfile is either the name of an apt package, in which case the package will be installed from the standard source archives available on Heroku, or is a link to a specific .deb package.

Either by running apt show on the pdf2htmlEX package, or by referencing this stack overflow post, you might come up with the following dependency list:

1
2
3
4
5
6
7
8
9
libc6
libcairo2
libfontforge1
libfreetype6
libpoppler44
libgcc1
libstdc++6
ttfautohint
https://launchpad.net/~coolwanglu/+archive/ubuntu/pdf2htmlex/+files/pdf2htmlex_0.12-1~git201411121058r1a6ec-0ubuntu1~trusty1_amd64.deb

It’s worth noting that since listing the .deb on its own line installs it without automatically resolving dependencies, you will not receive a build error in the event that pdf2htmlEX installs but is unusable. The only way to confirm that pdf2htmlEX is installed correctly is to:

1
2
$ heroku run bash --app YOURAPP
$ pdf2htmlEX --version

and confirm that the output is correct.

After deploying with the Aptfile above, you likely will run into an error about a missing libpoppler57.so. I believe this is because the .deb file that is listed was built against a different libpoppler than the one that is installed here - in this case, libpoppler57 vs libpoppler46.

To fix, let’s just replace the libpoppler44 reference with an explicit reference to the correct .deb file - I found this by looking up libpoppler on the Ubuntu archive website:

1
2
3
4
5
6
7
8
libc6
libfontforge1
libgcc1
libjs-pdf
libstdc++6
http://mirrors.kernel.org/ubuntu/pool/main/p/poppler/libpoppler57_0.38.0-0ubuntu1_amd64.deb
https://launchpad.net/~coolwanglu/+archive/ubuntu/pdf2htmlex/+files/pdf2htmlex_0.12-1~git201411121058r1a6ec-0ubuntu1~trusty1_amd64.deb
ttfautohint

This should resolve the libpoppler error. However, after deploying this, I still ran into the same problem listed on that stack overflow post -

1
2
3
pdf2htmlEX: /app/.apt/usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by pdf2htmlEX)
pdf2htmlEX: /app/.apt/usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by pdf2htmlEX)
pdf2htmlEX: /app/.apt/usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /app/.apt/usr/lib/x86_64-linux-gnu/libpoppler.so.57)

The issue here is that the version of libstdc++6 being installed doesn’t include glibcxx_3.4.20 - we just need a newer version of libstdc++6. A quick upgrade:

1
2
3
4
5
6
7
8
libc6
libfontforge1
libgcc1
libjs-pdf
http://mirrors.kernel.org/ubuntu/pool/main/g/gcc-5/libstdc++6_5.3.1-5ubuntu2_amd64.deb
http://mirrors.kernel.org/ubuntu/pool/main/p/poppler/libpoppler57_0.38.0-0ubuntu1_amd64.deb
https://launchpad.net/~coolwanglu/+archive/ubuntu/pdf2htmlex/+files/pdf2htmlex_0.12-1~git201411121058r1a6ec-0ubuntu1~trusty1_amd64.deb
ttfautohint

And this should work!

A few caveats: I’m not entirely familiar with how linking on mirrors.kernel.org works, so I believe it is possible that these links may break some time in the future. Additionally, I would feel more comfortable if every one of the dependencies were locked down to a specific .deb - I’m concerned that a version bump on e.g. libgcc1 may break this build.

However, I think that it shouldn’t be too terribly difficult to cross that road if and when it arises - all that is needed to do is to determine which version of libgcc1 is installed on a working system, and then hard link to that `.deb.

Happy deploying!