Pythonics - Objective Algorithms for Simplified Simulation

n-Holmes 2019-07-25

Objective Algorithms for Simplified Simulation

This post is largely my thoughts on Andy G's Algorithms as Objects, specifically in relation to simulation algorithms. The original article makes a case for Transforming complicated algorithms into objects in order to deal with five code smells:

Code that is long or deeply nested
Section comments
Excessive closures
Single-purpose helper functions polluting the namespace
Lots of state being passed between functions

The solution presented to these issues is to refactor the algorithm into an object, allowing helper functions (now private methods) to be extracted without having to pass the state to them (that being stored in the object instance).

As with all code paradigms, the proposed pattern has its advantages and disadvantages and it is important to be cautious of taking any particular methodology as gospel. (Beware the AbstractSingletonProxyFactoryBean!) On the other hand, it is important not to dismiss new patterns before considering the cases in which they may prove useful. In particular, there is a strong case for applying this pattern in the case of simulations.

The smells and their solutions

1, 2, 4: Long code, section comments, helper functions which should be contained.

These smells are likely the weakest arguments for objectifying an algorithm. They are all definitely code smells, but all work in favor of refactoring logic into closures just as well as into an object. The closure approach is also slightly neater in that it avoids the need for an extra run call. For example:

def algorithm_with_closures(*args):
    """An algorithm to do a thing."""
    def sub_logic(bar):
        """Some sub-logic in a closure."""
        do_something(bar)
        ...
    ...
    return closure_1(bar)

class AlgorithmAsObject:
    """The same logic as an object."""
    def __init__(self, *args):
        self.bar = args[0]
        ...

    def _sub_logic():
        """The same sub-logic as a non-public method."""
        do_something(self.bar)
        ...

    def run(self):
        """Execute the algorithm."""
        ...
        return self._sub_logic()

The way of calling each of these is, respectively:

output = algorithm_with_closures(*args)

algo = AlgorithmAsObject(*args)
output = algo.run()

Python does allow us to emulate a function by changing run to __call__, which would let us run the algorithm with algo() or just AlgorithmAsObject(*args)(), though the latter is likely to cause confusion.

Ultimately, whilst each of these code smells may indeed signal poor code and an object can be a solution to them, they do not, by themselves, suggest that an object is the correct solution. Indeed, for the toy example above, closures seem cleaner.

3: "Helper functions as nested closures, but it's still too long"

For the previous three smells, we saw that an object could clean them up, but was not necessarily the correct solution. For this smell, (if it even is one,) there seems to be no huge benefit at all in refactoring the algorithm into an object. The code will still be around the same length and the nesting is likely to be largely identical. I'm honestly not sure how the change will help with readability here.

5: Passing state between helper functions

This smell, finally, gives us a compelling reason to refactor an algorithm into an object. Often in large algorithms many sections of logic will require access to the same or overlapping sets of state. This leads to either very long function call, (itself a code smell,) or the passing around of a state collection, which often contains more information than is needed for the individual function and makes functions harder to decipher.

Objects are a natural choice for the management of this problem, as they are inherently stateful. In addition, since all of an object's variables should be pre-defined within its __init__, we have a natural place to define (and explain if necessary) the elements of state needed by the algorithm.

Individual methods can access the state they need without needing it to be passed explicitly and unlike with closures, there should be no confusion between variables used solely within a helper function and variables that are being accessed from the state. (In the object case, all state will be prepended with self.)

This smell also leads into the other half of the post: simulations. Some simulations, such as simple Monte Carlo volume calculations or models for time-invariant processes, can have little to no state. In such cases it is generally worth considering whether direct simulation is even optimal, as closed form solutions to such questions can often (though not always) be found.

As the process being simulated becomes more complicated, it will often do so by an increasing the amount of state. For example, even a relatively simple simulation of a task scheduling strategy will generally require us to keep track of at least

the current (in-simulation) time,
the current tasks being completed, their sizes and projected end times,
the queue of incoming tasks.

A simulation object will often provide the simplest and clearest way to manage this state.

Example

As an example, lets consider a fairly simple scheduling simulation.

Suppose we have a worker and a series of tasks to complete with format (task_length, time_scheduled). If our worker always chooses the shortest available task and will only actively work for a limited amount of time, which tasks will be completed at a given point in time?

Here is an object to perform this simulation:

from heapq import heapify, heappop, heappush

class TaskSimulator:
    def __init__(self, work_limit, files):
        self.work_limit = work_limit
        self.event_queue = [
            (time_scheduled, 2, i, length)
            for i, (length, time_scheduled) in enumerate(files)
        ]
        heapify(self.event_queue)

        self.time = 0
        self.current_task = None
        self.task_queue = []
        self.completed = []

    def __call__(self, end_time):
        """Run the simulation until the specified time.
        Return all tasks completed."""
        if end_time < self.time:
            raise ValueError(
                f"`end_time` cannot be less than current simulation time: {self.time}."
            )
        heappush(self.event_queue, (end_time, 1))

        while True:
            time, event, *args = heappop(self.event_queue)
            self.time = time

            if event == 0:  # Current upload finished
                self._finish_task()
            elif event == 1:  # end_time reached, terminate simulation
                return self.completed
            elif event == 2:  # New task requested
                self._add_task(*args)
            elif event == 3:  # Start work on a task. Successor event to 0 and 2.
                if self.current_task is None and self.task_queue:
                    self._start_task()
            else:
                raise ValueError(f"Unrecognised event code: {event}")

    def _add_task(self, task_id, task_size):
        """A new task is requested."""
        if task_size > self.work_limit:
            return

        heappush(self.task_queue, (task_size, task_id))

        if self.current_task is None:
            # Don't immediately start a task as a shorter one may be added simultaneously
            heappush(self.event_queue, (self.time, 3))

    def _finish_task(self):
        """A task is completed."""
        self.completed.append(self.current_task)
        self.current_task = None
        if self.task_queue:
            heappush(self.event_queue, (self.time, 3))

    def _start_task(self):
        """The worker chooses and starts work on a task."""
        task_size, task_id = heappop(self.task_queue)
        if task_size > self.work_limit:
            return

        self.current_task = task_id
        self.work_limit -= task_size
        heappush(self.event_queue, (self.time + task_size, 0))

To run the simulation, we first initialize an object with work_limit and tasks:

tasks = [(5, 1), (10, 3), (4, 6), (8, 7), (3, 8)]
simulation = TaskSimulation(20, tasks)

The simulation is then called with the given end_time. So

simulation(15)

will return [0, 2, 4].

Code Notes:

The linked blog post suggests the use of a Fluent Interface for interacting with these objects. I haven't used this as a matter of personal preference.

I am unhappy with the event codes used for sequencing of simultaneous events here. They are functional, but not very self-explanatory. A better approach would be to use an IntEnum from the Python standard library's enum package, however these are new since Python 3.4, so may not function on all systems. (The post is also getting long enough as it is.)

It is possible to improve memory efficiency here by not recreating the task list in state, but this should be marginal, barring a very large input.

Tangential Benefits

An immediate difference we can see from an algorithm as a function is persistence. This has both positives and negatives.

If we wish to run an object-based simulation again with the same or different parameters, the object must be re-initialized. (In this case running with the same parameters actually works, but this is the exception, rather than the rule - especially with stochastic simulations.)

On the other hand, the persistence of state after execution allows us to continue the same simulation to multiple different end points without having to repeat earlier calculations. In the example above, if we make a subsequent call of simulation(1000), we get [0, 2, 4, 3] and can see that task 1 is never going to be completed.

We can also use this persistence to inspect the final state for the purpose of analysis. Obviously we could achieve this with a function, by returning the relevant state along with the output, but this would require prior knowledge of what we would like to inspect and we probably won't want the extra return value most of the time.

Conclusion

Overall, I would say that whilst many long algorithms are perfectly fine staying as functions, there is definitely a case that for some classes of algorithms, (and simulations in particular,) refactoring into an object can be helpful. The key, as always, is to consider patterns on a case-by-case basis and not fall prey to Maslow's hammer.