Unit Test Generated PDFs with PHPUnit and PDFBox

Amongst the features, that are hard to test with Unit Tests, is generating PDF documents.

The command line tool PDFBox with the option ExtractText comes in handy:

PDF

This application will extract all text from the given PDF document.

This allows us, to test the textual content of the document or searching for specific strings inside.

It gets interesting with the option -html, which converts the PDF to HTML instead. Thus structure and formatting gets at least remotely testable.

Unfortunately the tool does not work with streams, we have to use temporary files. A simple example for a function that receives a PDF document as string, converts it to HTML with PdfBox and returns the HTML string:

/**
 * @var string $streamIn binary string with generated PDF
 * @return string HTML string
 */
function htmlFromPdf($streamIn)
{
  $pdf = tempnam();
  file_put_contents($pdf, $streamIn);
  $txt = tempnam();
  exec('java -jar pdfbox-app-x.y.z.jar ExtractText -encoding UTF-8 -html ' . $pdf . ' ' . $txt);
  $streamOut = file_get_contents($txt);
  unlink($pdf);
  unlink($txt);
  return $streamOut;
}

For regression tests or refactoring it sometimes is enough to test that the generated PDF did not change in comparision to a reference PDF. This could be achieved with a hash value but a PDF itself is not binary equal every time, probably due to timestamps. But a hash of the converted HTML is sufficient:

        // In PHPUnit test case:
        $converter = new PdfBox();
        $html = $converter->htmlFromPdfStream($pdf);
        $this->assertEquals('336edd9ee49b57e6dba5dc04602765056ce05b91', sha1($html), 'Hash of PDF content');

In this example I use a self-written class PdfBox, which encapsulated the call to Apache PdfBox. The code is available under BSD Licence on GitHub: PHP PdfBox

PHP PdfBox

Requirements

  1. Java Runtime Environment, with “java” in the system path. To test this, run java -version on the command line. If you see information about the Java version, everything is fine
  2. Apache PdfBox as executable JAR file. You can download it here: http://pdfbox.apache.org/downloads.html
  3. The PHP function exec() for executing system commands must not be disabled. On shared hosts this is usually the case for security reasons; for local execution of Unit Tests it shouldn’t be a problem to allow exec().PHP-CLI, i.e. PHP on the command line usually uses a different php.ini configuration file than PHP-CGI for the web. The command php --ini shows, which INI files are loaded in CLI mode. If necessary, edit these to remove exec from the disable_functions list.
  4. A PSR-0 compatible autoloader, as shipped with most frameworks. Otherwise you will need to include the single PHP files.

Usage

First you’ll have to specify the full path to the PdfBox JAR. Afterwards you can call the conversion methods, for example:

use SGH\PdfBox

//$pdf = GENERATED_PDF;
$converter = new PdfBox;
$converter->setPathToPdfBox('/usr/bin/pdfbox-app-1.7.0.jar');
$text = $converter->textFromPdfStream($pdf);
$html = $converter->htmlFromPdfStream($pdf);
$dom  = $converter->domFromPdfStream($pdf);

The following conversion methods exist:

  • string textFromPdfStream($content, $saveToFile = null)
  • string htmlFromPdfStream($content, $saveToFile = null)
  • DomDocument domFromPdfStream($content, $saveToFile = null)
  • string textFromPdfFile($fileName, $saveToFile = null)
  • string htmlFromPdfFile($fileName, $saveToFile = null)
  • DomDocument domFromPdfFile($fileName, $saveToFile = null)

The second parameter is either the PDF as binary string ($content) or the file name of a PDF ($fileName). The second parameter, if provided, is a file name for the output. In this file the text, or HTML, will be saved.

A few additional PdfBox-Options can be useful as well:

// Only extract pages 2-5
$converter->getOptions()
    ->setStartPage(2)
    ->setEndPage(5);

// ignore corrupt PDF objects
$converter->getOptions()
    ->setForce(true);

Everything else should be clear from the PhpDoc comments. Happy Testing!