Pythonics - Elemental - Writing with Chemistry

n-Holmes 2019-07-21

Elemental - Writing with Chemistry

Project code: https://github.com/n-Holmes/elemental

Project and Goals

This was a fun little side project I started to try and get to grips with argparse. The idea is to take a word or string of text and rewrite it using only the element symbols from the periodic table of elements.

Periodic Table

As an example, the longest english word that can be written in this way is nonrepresentationalism, which can be written NoNRePReSeNTaToPNaLiSm, combining the symbols for Nobelium, Nitrogen, Rhenium, Phosphorus and so on.

The end goal is to create a simple command line app that can be called with either a string or the location of a text file and re-case the provided text to match the element symbols as well as outputting the elements used.

Code

Since the app is going to consist of a thin command line wrapper around a single algorithm, it makes sense to get the algorithmic part up and running to at least a minimum-viable-solution stage before starting on the interface.

Encoding algorithm

The first complication that becomes evident when trying to find a solution for the encoding problem is that a greedy algorithm is insufficient. Take for example the word Calypso.

For the first symbol in the encoding of this word, we could choose either Carbon or Calcium. Calcium may seem like the better choice as it gives us the first two letters rather than just one, but leaves us with the problem that neither L or Ly is a symbol for an element. If we instead choose to start with C, we can use Aluminium for the next symbol and the rest of the word can be encoded.

In this case, the problem with a choice appeared straight away. This is quite likely due to the limited choices of symbols, but in theory there is nothing stopping a bad choice from only becoming evident after several further decisions. To get around this problem, we need to implement some form of backtracking.

In this case I went for a simple depth-first search, using the fact that Python lists function roughly as a stack (O(1) append and pop methods). In addition, since some strings have multiple options, I opted to find them all for the time being.

In this code message is the string to be encoded and symbols is a dictionary of symbol:full_name pairs (all lower case).

msg_len = len(message)
stack = [("", [], 0)] # Working list of partial encodings
encodings = [] # Full encodings found

while stack:
    partial, symlist, score = stack.pop()
    part_len = len(partial)
    if part_len == msg_len:
        encodings.append((partial, symlist, score))
        continue

    # Find the next letter
    for i, char in enumerate(message[part_len:]):
        if char.isalpha():
            break
    else:
        encodings.append((partial + message[part_len:], symlist, score))
        continue

    partial += message[part_len:part_len + i]
    part_len += i

    # Find possible next symbols
    options = [message[part_len].lower()]
    if msg_len - part_len >= 2:
        options.append(message[part_len : part_len + 2].lower())

    # Check if each option is actually a symbol
    for option in options:
        if option in symbols:
            stack.append(
                (
                    partial + option.capitalize(),  # Proper element format
                    symlist + [option],
                    score + symlist.count(option),
                )
            )

for-else

Possibly the most interesting section of this code and likely the most unfamiliar to many is the inclusion of an else clause to the first for loop. This creates a branch that is reached only when the loop terminates without reaching a break clause. The same structure can be used with while or for the full try-except-else-finally conditional.

The use of else in this way is somewhat controversial within the Python community, primarily because of a slightly confusing choice of naming for historical reasons. (nobreak as a better name has been suggested by Raymond Hettinger.) However, regardless of the naming, this structure provides a succinct way to deal with a common problem: searching for an object in a collection, and differentiating behaviour based on whether or not the object was found.

scores

The scores being stored along with each encoding are a measurement of how often the same symbols are being reused. This will be useful for choosing a "best" encoding down the line.

Command Line Interface

The command line section of the project turned out to be remarkably simple, thanks to the standar library's argparse module. Initially I had only planned on having two modes (string as an argument or path to a text file), however after realizing quite how hard it is to find common words which will encode at all, I decided to add a third, interactive mode, to allow for faster trial and error. Since no information needs to be provided in this case, this can be set to the default mode, with flags to activate the other two.

The final code for parsing arguments ended up being only five lines:

parser = argparse.ArgumentParser(
    description="Rewrites strings as combinations of element symbols."
)
mode = parser.add_mutually_exclusive_group()
mode.add_argument("-f", "--file", help="Specify a file to encode the contents of.")
mode.add_argument("-s", "--string", help="Specify a string to encode.")
args = parser.parse_args()

Draw the rest of the owl

The remaining work consists of arranging the code into a package, loading in the list of elements from an easily editable .csv file and handling the arguments collected in args. None of this was particularly difficult or interesting, but the whole of the code can be viewed at the github repo for those interested.

Testing

There isn't a huge amount of complication in this project to test on, but for the sake of good practice, I implemented some unit tests around the encoding algorithm to catch any problems with changing code. The three things we can easily check are:

The encoder can encode some valid string.
The encoder throws the desired exception when attempting to encode an invalid string.
The encoder will prioritize less repetitive encodings for longer strings.

My preferred testing framework is pytest for its ease of use and versatility. The test functions for each of these checks are:

def test_valid_encoding():
    """Ensure that the encoder is creating a string with characters matching the input.
    Also make sure that a valid amount of symbols are being used.
    """
    test_string = "NatIon"
    encoded, elements = encoder.encode(test_string)
    assert encoded.lower() == test_string.lower()
    assert len(test_string) / 2 <= len(elements) <= len(test_string)


def test_invalid_encoding():
    """Ensure that the encoder throws the correct error when a string cannot be
    encoded.
    """
    test_string = "nathan"
    with pytest.raises(encoder.SymbolError):
        assert encoder.encode(test_string)


def test_avoid_repetitions():
    """Ensure that the encoder uses different substitutions where possible."""
    short_string = "fer"
    encoded_short, _ = encoder.encode(short_string)
    encoded_long, _ = encoder.encode(short_string * 3)
    assert encoded_long != encoded_short * 3

Here encoder is a module containing a function encode which finds all valid encodings using the code we discussed earlier and then returns the least repetitive one.

I particularly appreciate the context manager provided by pytest for exception checking. Previously I have used

try:
    function_raising_exception()
except WantedException:
    pass
else:
    assert False

But this is quite clunky in comparison.

Conclusion

The project ended up being fairly simple and resulted in a fun tool to play around with. argparse was less complicated than expected and I got to try out some different test functionality.

Some simple next steps might be:

wrap the app in a GUI (gooey seems like a good candidate)
package as an executable
use the tool with some large body of English text to see exactly how often words can be encoded (on first run this seems to hover around 20%)