This post is largely my thoughts on Andy G's Algorithms as Objects, specifically in relation to simulation algorithms. The original article makes a case for Transforming complicated algorithms into objects in order to deal with five code smells:
- Code that is long or deeply nested
- Section comments
- Excessive closures
- Single-purpose helper functions polluting the namespace
- Lots of state being passed between functions
The solution presented to these issues is to refactor the algorithm into an object, allowing helper functions (now private methods) to be extracted without having to pass the state to them (that being stored in the object instance).
As with all code paradigms, the proposed pattern has its advantages and disadvantages and it is important to be cautious of taking any particular methodology as gospel. (Beware the AbstractSingletonProxyFactoryBean!) On the other hand, it is important not to dismiss new patterns before considering the cases in which they may prove useful. In particular, there is a strong case for applying this pattern in the case of simulations.
The smells and their solutions
1, 2, 4: Long code, section comments, helper functions which should be contained.
These smells are likely the weakest arguments for objectifying an algorithm. They are all definitely code smells, but all work in favor of refactoring logic into closures just as well as into an object. The closure approach is also slightly neater in that it avoids the need for an extra run call. For example:
def algorithm_with_closures(*args): """An algorithm to do a thing.""" def sub_logic(bar): """Some sub-logic in a closure.""" do_something(bar) ... ... return closure_1(bar) class AlgorithmAsObject: """The same logic as an object.""" def __init__(self, *args): self.bar = args[0] ... def _sub_logic(): """The same sub-logic as a non-public method.""" do_something(self.bar) ... def run(self): """Execute the algorithm.""" ... return self._sub_logic()
The way of calling each of these is, respectively:
output = algorithm_with_closures(*args) algo = AlgorithmAsObject(*args) output = algo.run()
Python does allow us to emulate a function by changing run
to __call__
, which would let us run the algorithm with algo()
or just AlgorithmAsObject(*args)()
, though the latter is likely to cause confusion.
Ultimately, whilst each of these code smells may indeed signal poor code and an object can be a solution to them, they do not, by themselves, suggest that an object is the correct solution. Indeed, for the toy example above, closures seem cleaner.
3: "Helper functions as nested closures, but it's still too long"
For the previous three smells, we saw that an object could clean them up, but was not necessarily the correct solution. For this smell, (if it even is one,) there seems to be no huge benefit at all in refactoring the algorithm into an object. The code will still be around the same length and the nesting is likely to be largely identical. I'm honestly not sure how the change will help with readability here.
5: Passing state between helper functions
This smell, finally, gives us a compelling reason to refactor an algorithm into an object. Often in large algorithms many sections of logic will require access to the same or overlapping sets of state. This leads to either very long function call, (itself a code smell,) or the passing around of a state
collection, which often contains more information than is needed for the individual function and makes functions harder to decipher.
Objects are a natural choice for the management of this problem, as they are inherently stateful. In addition, since all of an object's variables should be pre-defined within its __init__
, we have a natural place to define (and explain if necessary) the elements of state needed by the algorithm.
Individual methods can access the state they need without needing it to be passed explicitly and unlike with closures, there should be no confusion between variables used solely within a helper function and variables that are being accessed from the state. (In the object case, all state will be prepended with self.
)
This smell also leads into the other half of the post: simulations. Some simulations, such as simple Monte Carlo volume calculations or models for time-invariant processes, can have little to no state. In such cases it is generally worth considering whether direct simulation is even optimal, as closed form solutions to such questions can often (though not always) be found.
As the process being simulated becomes more complicated, it will often do so by an increasing the amount of state. For example, even a relatively simple simulation of a task scheduling strategy will generally require us to keep track of at least
- the current (in-simulation) time,
- the current tasks being completed, their sizes and projected end times,
- the queue of incoming tasks.
A simulation object will often provide the simplest and clearest way to manage this state.
Example
As an example, lets consider a fairly simple scheduling simulation.
Suppose we have a worker and a series of tasks to complete with format (task_length, time_scheduled)
. If our worker always chooses the shortest available task and will only actively work for a limited amount of time, which tasks will be completed at a given point in time?
Here is an object to perform this simulation:
from heapq import heapify, heappop, heappush class TaskSimulator: def __init__(self, work_limit, files): self.work_limit = work_limit self.event_queue = [ (time_scheduled, 2, i, length) for i, (length, time_scheduled) in enumerate(files) ] heapify(self.event_queue) self.time = 0 self.current_task = None self.task_queue = [] self.completed = [] def __call__(self, end_time): """Run the simulation until the specified time. Return all tasks completed.""" if end_time < self.time: raise ValueError( f"`end_time` cannot be less than current simulation time: {self.time}." ) heappush(self.event_queue, (end_time, 1)) while True: time, event, *args = heappop(self.event_queue) self.time = time if event == 0: # Current upload finished self._finish_task() elif event == 1: # end_time reached, terminate simulation return self.completed elif event == 2: # New task requested self._add_task(*args) elif event == 3: # Start work on a task. Successor event to 0 and 2. if self.current_task is None and self.task_queue: self._start_task() else: raise ValueError(f"Unrecognised event code: {event}") def _add_task(self, task_id, task_size): """A new task is requested.""" if task_size > self.work_limit: return heappush(self.task_queue, (task_size, task_id)) if self.current_task is None: # Don't immediately start a task as a shorter one may be added simultaneously heappush(self.event_queue, (self.time, 3)) def _finish_task(self): """A task is completed.""" self.completed.append(self.current_task) self.current_task = None if self.task_queue: heappush(self.event_queue, (self.time, 3)) def _start_task(self): """The worker chooses and starts work on a task.""" task_size, task_id = heappop(self.task_queue) if task_size > self.work_limit: return self.current_task = task_id self.work_limit -= task_size heappush(self.event_queue, (self.time + task_size, 0))
To run the simulation, we first initialize an object with work_limit
and tasks
:
tasks = [(5, 1), (10, 3), (4, 6), (8, 7), (3, 8)] simulation = TaskSimulation(20, tasks)
The simulation is then called with the given end_time
. So
simulation(15)
will return [0, 2, 4]
.
Code Notes:
The linked blog post suggests the use of a Fluent Interface for interacting with these objects. I haven't used this as a matter of personal preference.
I am unhappy with the event codes used for sequencing of simultaneous events here. They are functional, but not very self-explanatory. A better approach would be to use an IntEnum
from the Python standard library's enum
package, however these are new since Python 3.4, so may not function on all systems. (The post is also getting long enough as it is.)
It is possible to improve memory efficiency here by not recreating the task list in state, but this should be marginal, barring a very large input.
Tangential Benefits
An immediate difference we can see from an algorithm as a function is persistence. This has both positives and negatives.
If we wish to run an object-based simulation again with the same or different parameters, the object must be re-initialized. (In this case running with the same parameters actually works, but this is the exception, rather than the rule - especially with stochastic simulations.)
On the other hand, the persistence of state after execution allows us to continue the same simulation to multiple different end points without having to repeat earlier calculations. In the example above, if we make a subsequent call of simulation(1000)
, we get [0, 2, 4, 3]
and can see that task 1
is never going to be completed.
We can also use this persistence to inspect the final state for the purpose of analysis. Obviously we could achieve this with a function, by returning the relevant state along with the output, but this would require prior knowledge of what we would like to inspect and we probably won't want the extra return value most of the time.
Conclusion
Overall, I would say that whilst many long algorithms are perfectly fine staying as functions, there is definitely a case that for some classes of algorithms, (and simulations in particular,) refactoring into an object can be helpful. The key, as always, is to consider patterns on a case-by-case basis and not fall prey to Maslow's hammer.