Using pdf2htmlEX on Heroku
Want to use the awesome pdf2htmlEX on Heroku? You’re not alone. For Quottly, we do quite a bit of PDF processing - turns out, a lot of colleges and universities like to publish information in PDF format. We always try to use the pdf-reader ruby gem if we can, since it’s easy to deploy and maintain, but sometimes pdf-reader just doesn’t have enough power for what we’re trying to do.
We recently got pdf2htmlEX running on our Heroku app. Here’s how.
apt buildpack
pdf2htmlEX is distributed either from source or as a Linux package. To install the debian package for pdf2htmlEX on Heroku, we first added heroku-buildpack-apt to our application’s buildpacks.
Some old sources (including the README.md on heroku-buildpack-apt) will indicate that the best way to do this is to create a .buildpacks
file in your project. However, Heroku now recommends adding the buildpacks from the command line, and/or using an app.json
for reproducible deploys.
We added the following to our app.json:
1 2 3 4 5 6 7 8 9 10 |
|
Then, add an Aptfile
for heroku-buildpack-apt to pull from. Each line in the Aptfile is either the name of an apt package, in which case the package will be installed from the standard source archives available on Heroku, or is a link to a specific .deb
package.
Either by running apt show
on the pdf2htmlEX package, or by referencing this stack overflow post, you might come up with the following dependency list:
1 2 3 4 5 6 7 8 9 |
|
It’s worth noting that since listing the .deb
on its own line installs it without automatically resolving dependencies, you will not receive a build error in the event that pdf2htmlEX installs but is unusable. The only way to confirm that pdf2htmlEX is installed correctly is to:
1 2 |
|
and confirm that the output is correct.
After deploying with the Aptfile above, you likely will run into an error about a missing libpoppler57.so
. I believe this is because the .deb file that is listed was built against a different libpoppler than the one that is installed here - in this case, libpoppler57 vs libpoppler46.
To fix, let’s just replace the libpoppler44 reference with an explicit reference to the correct .deb
file - I found this by looking up libpoppler on the Ubuntu archive website:
1 2 3 4 5 6 7 8 |
|
This should resolve the libpoppler error. However, after deploying this, I still ran into the same problem listed on that stack overflow post -
1 2 3 |
|
The issue here is that the version of libstdc++6 being installed doesn’t include glibcxx_3.4.20
- we just need a newer version of libstdc++6
. A quick upgrade:
1 2 3 4 5 6 7 8 |
|
And this should work!
A few caveats: I’m not entirely familiar with how linking on mirrors.kernel.org
works, so I believe it is possible that these links may break some time in the future. Additionally, I would feel more comfortable if every one of the dependencies were locked down to a specific .deb - I’m concerned that a version bump on e.g. libgcc1
may break this build.
However, I think that it shouldn’t be too terribly difficult to cross that road if and when it arises - all that is needed to do is to determine which version of libgcc1
is installed on a working system, and then hard link to that `.deb.
Happy deploying!