Collection Pipelines in PHP

If you read the book “Refactoring to Collections” or saw screencasts and talks by Adam Wathan about collection pipelines, but do not work with Laravel, you might have asked yourself how to apply these techniques to other PHP applications, for example Magento.

If you did not, let me explain collection pipelines first. Better, I’ll let Martin Fowler give the definition:

A collection pipeline lays out a sequence of operations that pass a collection of items between them. Each operation takes a collection as an input and emits another collection (except the last operation, which may be a terminal that emits a single value). The individual operations are simple, but you can create complex behavior by stringing together the various operations, much as you might plug pipes together in the physical world, hence the pipeline metaphor.


Source and full article

This is very similar to Unix pipes, where the individual commands are the operations, and the lines passed through input and output stream, the collection. For example if we want to find the latest 5 different “undefined variable” messages in our logs, we might use:

grep "undefined variable" var/log/system.log | uniq | tail -n5

Now let’s look at collection pipelines in object oriented languages. Ruby is a nice example, because its built in Enumerables have all methods you need. Arrays and Ranges are Enumerable, but you can also build your own. So let’s say we want to calculate the sum of squares for all even numbers of a given input:

input = (1..10)
puts input
    .select(&:even?)
    .map{|n| n ** 2}
    .reduce :+

There is some ruby specific syntactic sugar in this example, but the important parts are “select”, “map” and “reduce”. They take a callable parameter and constitute the operations

  • Filter: .select() in Ruby is more commonly known as “filter”. It passes each item of the collection to the given function and returns a new collection with only the items where the result was true. The line from above can be expressed more verbosely as:
    .select(
      { |n| return n.even? }
    )
  • Map: .map() returns a new collection of the same size, where each item is replaced with the result of the given function, called with this item. So [2,4,6,8,10] becomes [4,16,36,64,100] in this line:
    .map(
      { |n| return n ** 2 }
    )
  • Reduce: .reduce() passes the items of the collection pairwise to the given function and repeats this until only one single value is left. The line from above, more verbose:
    .reduce(
      { |x,y| return x + y }
    )

    Here, [4,16,36,64,100] gets reduced to ((4 + 16) + (36 + 64)) + 100, which is 220

There are more collection related methods, but in the end everything can be translated to filter, map and reduce. These are the building blocks of collection pipelines.

Thanks to Emojis, here is an explanation that fits in a Tweet:

Why not just loops?

With loops, the Ruby example from above would look like this:

output = 0
for n in input
  if n.even? then
    output += n ** 2
  end
end
puts output

Using the collection pipeline with filter -> map -> reduce has several advantages over traditional loops with temporary variables:

  • more declarative, code tells what you want, not how
  • encourages a different, more natural way of approaching problems
  • less cognitive load when reading the code, you don’t have to keep temporary state in mind
  • easier to scale: operations can be paralleled (pefectly demonstrated by Googles MapReduce model)

If you are a PHP developer, go read the aforementioned Refactoring to Collections to get convincing real life examples! It is centered around Laravel, which provides the necessary collection methods, but you will see that it is just as valuable if you don’t work with Laravel.

What about PHP array functions?

PHP has array_filter(), array_map() and array_reduce(), which provide exactly the same functionality. The example from above in PHP with array functions:

echo array_reduce(
  array_map(
    function($n) { return $n ** 2; }
    array_filter($input, function($n) { return $n % 2 == 0; })
  ),
  function($x, $y) { return $x + $y; }
)

I always liked array_map() and the other array functions in PHP, but also found it a bit clumsy and would not overuse it. Do you see what’s wrong with this example? It is harder to read, because you have to start in the middle to understand what’s going on:

echo array_reduce(                                             // 3. reduce
  array_map(                                                   // 2. map
    function($n) { return $n ** 2; }                           // 2. map
    array_filter($input, function($n) { return $n % 2 == 0; }) // 1. filter
  ),                                                           // 2. map
  function($x, $y) { return $x + $y; }                         // 3. reduce
)

One day a junior developer came to a colleague and me with a solution using array_map() and array_filter() instead of loops. He was very excited that this was possible in PHP and wanted to convince us that it was superior to loops and conditions. The code in question looked like this (here, to prepare customer data for a newsletter email template):

$fields = [
    'title',
    'first_name',
    'last_name',
    'country',
    'currency',
];
$idsByField = unserialize(Mage::getStoreConfig('newsletter/sendeffect/fields'));
$fieldsForEmailTemplate = array_filter(
    array_map(
        function($field) use ($customer, $idsByField) {
            return new Varien_Object([
                'id' => $idsByField[$field],
                'value' => $customer->getData($field),
            ]);
        }, $fields
    ), function($field) {
        return $field->getId() && $field->getValue();
    }
);

Guess what we told him: yes, that’s cool, but look, one simple loop is way easier to understand than THIS. And code being easy to read is more important than being clever.
Had I thought of collection pipelines at that time, my advice might have been different. Take the example with a collection pipeline (collect() is the collection constructor in Laravel):

$idsByField = unserialize(Mage::getStoreConfig('newsletter/sendeffect/fields'));
collect(
    [
        'title',
        'first_name',
        'last_name',
        'country',
        'currency',
    ]
)->map(
    function($field) use ($customer, $idsByField) {
        return new Varien_Object([
            'id' => $idsByField[$field],
            'value' => $customer->getData($field),
        ]);
    }
)->filter(
    function($field) {
        return $field->getId() && $field->getValue();
    }
);

The important difference is, it reads ‘take the collection, map to this, then filter items like that’, in the right order. You don’t have to read from inside to outside, and it is always the same level of indentation.

This gets even more relevant if you have more than two operations. Here’s the main “loop” of a CSV export I wrote with collection pipelines :

collect($productIds)
    ->map(function($id) {
        return $this->loadStoreAttributes($id);
    })
    ->filter(function(Collection $row) {
        return count($row) > 1;
    })
    ->map(function(Collection $row) {
        return $this->columnTemplate->merge($row);
    })
    ->prepend($this->columnTemplate->keys())
    ->each(function(Collection $row) {
        fputcsv(STDOUT, $row->toArray());
    });

Each row is a collection on its own, here is how those are initialized

private function loadStoreAttributes($id)
{
    return collect([['attribute_code' => 'id', 'value' => $id]])
        ->merge($this->getStoreValues($id))
        ->merge($this->getMediaValues($id))
        ->merge($this->getPriceValues($id))
        ->merge($this->getCategories($id))
        ->each(function($row) {
            $this->columnTemplate[$row['attribute_code']] = '';
        })
        ->pluck('value', 'attribute_code');
}

No loops. No temporary variables (except the column template because we did not know the columns beforehand). And one level of indentation. With the native PHP array functions this would have been a big mess.

Collection libraries in PHP

But should we include the Laravel collection component in Magento projects? That was my first approach but since then found a better alternative, if your framework does not have collections (no, Magento “collections” are different – no pipelines for them): Knapsack.

It is a standalone library, and it uses lazy loading wherever possible (with PHP 5.6 generators). This is especially interesting if you work with a non trivial amount of data. The Laravel collection will process the whole collection for each step before passing the results to the next step, so memory usage increases for each additional step. Knapsack will process each item after each other through the whole pipeline, or until a step requires all data at once.

It also provides different ways to use collections: pipelines with the collection class, pipelines with a custom collection class that uses the CollectionTrait (similar to the Enumerable mixin in Ruby, and last but not least, functions (similar to the PHP array functions, but more powerful). Let’s take the first example from above and see how it looks, implemented with Knapsack:

echo Collection::from($input)
    ->filter(function($n) { return $n % 2 == 0; })
    ->map(function($n) { return $n ** 2; })
    ->reduce(function($x, $y) { return $x + $y; });

Much better!

Homegrown collections

But maybe you don’t even need a full fledged collection library. To get started, this simple class is already useful

<?php
class ArrayCollection extends \ArrayIterator
{
    public static function fromArray(array $array)
    {
        return new static($array);
    }

    public static function fromTraversable(\Traversable $traversable)
    {
        return new static(\iterator_to_array($traversable));
    }

    /**
     * @param callable $callback
     * @return static
     */
    public function map(callable $callback)
    {
        return new static(\array_map($callback, $this->getArrayCopy(), $this->keys()->getArrayCopy()));
    }

    /**
     * @param callable $callback
     * @return static
     */
    public function filter(callable $callback)
    {
        return new static(\array_filter($this->getArrayCopy(), $callback));
    }

    /**
     * @param callable $callback
     * @param null $initial
     * @return static
     */
    public function reduce(callable $callback, $initial = null)
    {
        return new static(\array_reduce($this->getArrayCopy(), $callback, $initial));
    }

}

You can then extend it as needed, for example with a merge() method:

    /**
     * @param ArrayCollection $other
     * @return static
     */
    public function merge(ArrayCollection $other)
    {
        return new static(\array_merge($this->getArrayCopy(), $other->getArrayCopy()));
    }

And if you need more advanced features such as lazy loading, it will be easy to replace with Knapsack, for example. The signatures of map(), filter() and reduce() are relatively universal.

Magento Integration

I thought about methods to integrate the CollectionTrait into Magento collections and while it is possible, I came to the conclusion that it is not a good idea in the first place. Magento collections are a means to retrieve objects from the database. Their “filter” methods for example have completely different semantics than the filter method I discussed above. Of course I am refering to the db resource collections only, but other instances of Varien_Collection are rare and do not concern me here.

But what can make sense is to create a collection type that can convert from a Magento collection. Like this:

use DusanKasan\Knapsack\CollectionTrait;
use Magento\Framework\Model\ResourceModel\Db\Collection\AbstractCollection;

class Collection extends ArrayIterator
{
    use CollectionTrait

    public static function fromMagentoCollection(AbstractCollection $magentoCollection)
    {
        return new static($magentoCollection->getIterator()->getArrayCopy());
    }
}

Functional Programming

Although we use objects here, this approach is closer to functional programming than to object oriented programming and we are lucky that PHP is a language that allows to mix different paradigms. As you see in the example above, the code can easily be translated into using only functions. I believe that functional programming is useful in the lower levels of web application development and object oriented programming more useful in the higher levels, hiding the functional parts as implementation detail.

If you are not familiar with functional programming (FP) , it is important to understand that it is completely different to procedural programming. Some Characteristics of FP are

  • higher order functions, i.e. functions that take other functions as parameters or return other functions
  • side effect free functions

By the way, even functional languages have syntax for collection pipelines, to avoid the “inside out” problem and the kind of indentation and brace hell, Lisp gets made fun of so often.

Here is the example in Closure (I had to look it up, did not work with a FP language since University, 13 years ago):

(->> [1,2,3,4,5,6,7,8,9,10]
    (filter even?)
    (map (fn[x] (* x x)))
    (reduce +))

The ->> symbol creates the pipeline and is followed by input and all operations.

The hard part

But let’s be honest: there are also drawbacks. If you are used to step by step debugging, this is going to be different – the code is not executed line by line anymore, it’s more jumping back and forth which is not immediately obvious. For reading and writing the code this should not matter much, as we are not thinking procedural anymore. But when debugging an issue you usually want to know what is going on under the hood. I use more conditional breakpoints when debugging collection pipelines to skip the code of the underlying library and inspect input and output of my functions as well as order of operations.

But this should be enough. I don’t need to worry about temporary state. Debugging collection pipelines step by step is confusing, but if you move away from that and use the debugger to only inspect your mapping, filtering and reducing functions, i.e. their input and output, it’s satisfying again. And it makes sense to change the mindset for debugging as well, when changing the mindset for coding.

Give it a try, you might be surprised!

3 Replies to “Collection Pipelines in PHP”

  1. The problem with inventing new facilities is everyone ends up either making their own, or using one of many different available libraries – you end up with code that isn’t very idiomatic, you increase complexity, introduce new dependencies, add to the learning curve, etc.

    The horrible inside-out examples you show with the array_() functions of course isn’t the only possible workflow with these – you can have a linear workflow by introducing an intermediary variable, I posted an example and some quick notes here:

    https://gist.github.com/mindplay-dk/4ef61fd5c0a35e5aa8fc699febb86487

    1. Thank you for the example, I like it. It’s probably the best you can get with “idiomatic PHP” and I get your point of not introducing new dependencies or reinventing “foreign” concepts. For this example I cannot even argue against the temporary variable.

      It is a fine line between actual improvement and unnecessary complexity.

  2. Just to add yet another library in the mix, have a look at my enumerable lib ( https://packagist.org/packages/lasso3000/enumerable ). It tries to stay close to ruby’s Enumerable module and the main difference from Knapsack is that the functionality is attached using a trait instead of a specialized class. The main benefit of that approach is that *any* class can be an Enumerable (just like in ruby).

Comments are closed.