Project code: https://github.com/n-Holmes/elemental
Project and Goals
This was a fun little side project I started to try and get to grips with argparse
.
The idea is to take a word or string of text and rewrite it using only the element
symbols from the periodic table of elements.
As an example, the longest english word that can be written in this way is nonrepresentationalism, which can be written NoNRePReSeNTaToPNaLiSm, combining the symbols for Nobelium, Nitrogen, Rhenium, Phosphorus and so on.
The end goal is to create a simple command line app that can be called with either a string or the location of a text file and re-case the provided text to match the element symbols as well as outputting the elements used.
Code
Since the app is going to consist of a thin command line wrapper around a single algorithm, it makes sense to get the algorithmic part up and running to at least a minimum-viable-solution stage before starting on the interface.
Encoding algorithm
The first complication that becomes evident when trying to find a solution for the encoding problem is that a greedy algorithm is insufficient. Take for example the word Calypso.
For the first symbol in the encoding of this word, we could choose either Carbon or Calcium. Calcium may seem like the better choice as it gives us the first two letters rather than just one, but leaves us with the problem that neither L or Ly is a symbol for an element. If we instead choose to start with C, we can use Aluminium for the next symbol and the rest of the word can be encoded.
In this case, the problem with a choice appeared straight away. This is quite likely due to the limited choices of symbols, but in theory there is nothing stopping a bad choice from only becoming evident after several further decisions. To get around this problem, we need to implement some form of backtracking.
In this case I went for a simple depth-first search, using the fact that Python lists function roughly as a stack (O(1) append
and pop
methods). In addition, since some strings have multiple options, I opted to find them all for the time being.
In this code message
is the string to be encoded and symbols
is a dictionary of symbol
:full_name
pairs (all lower case).
msg_len = len(message) stack = [("", [], 0)] # Working list of partial encodings encodings = [] # Full encodings found while stack: partial, symlist, score = stack.pop() part_len = len(partial) if part_len == msg_len: encodings.append((partial, symlist, score)) continue # Find the next letter for i, char in enumerate(message[part_len:]): if char.isalpha(): break else: encodings.append((partial + message[part_len:], symlist, score)) continue partial += message[part_len:part_len + i] part_len += i # Find possible next symbols options = [message[part_len].lower()] if msg_len - part_len >= 2: options.append(message[part_len : part_len + 2].lower()) # Check if each option is actually a symbol for option in options: if option in symbols: stack.append( ( partial + option.capitalize(), # Proper element format symlist + [option], score + symlist.count(option), ) )
for-else
Possibly the most interesting section of this code and likely the most unfamiliar to many is the inclusion of an else
clause to the first for
loop. This creates a branch that is reached only when the loop terminates without reaching a break
clause. The same structure can be used with while
or for the full try-except-else-finally
conditional.
The use of else
in this way is somewhat controversial within the Python community, primarily because of a slightly confusing choice of naming for historical reasons. (nobreak
as a better name has been suggested by Raymond Hettinger.) However, regardless of the naming, this structure provides a succinct way to deal with a common problem: searching for an object in a collection, and differentiating behaviour based on whether or not the object was found.
scores
The scores being stored along with each encoding are a measurement of how often the same symbols are being reused. This will be useful for choosing a "best" encoding down the line.
Command Line Interface
The command line section of the project turned out to be remarkably simple, thanks to the standar library's argparse
module. Initially I had only planned on having two modes (string as an argument or path to a text file), however after realizing quite how hard it is to find common words which will encode at all, I decided to add a third, interactive mode, to allow for faster trial and error. Since no information needs to be provided in this case, this can be set to the default mode, with flags to activate the other two.
The final code for parsing arguments ended up being only five lines:
parser = argparse.ArgumentParser( description="Rewrites strings as combinations of element symbols." ) mode = parser.add_mutually_exclusive_group() mode.add_argument("-f", "--file", help="Specify a file to encode the contents of.") mode.add_argument("-s", "--string", help="Specify a string to encode.") args = parser.parse_args()
Draw the rest of the owl
The remaining work consists of arranging the code into a package, loading in the list of elements from an easily editable .csv file and handling the arguments collected in args
. None of this was particularly difficult or interesting, but the whole of the code can be viewed at the github repo for those interested.
Testing
There isn't a huge amount of complication in this project to test on, but for the sake of good practice, I implemented some unit tests around the encoding algorithm to catch any problems with changing code. The three things we can easily check are:
- The encoder can encode some valid string.
- The encoder throws the desired exception when attempting to encode an invalid string.
- The encoder will prioritize less repetitive encodings for longer strings.
My preferred testing framework is pytest
for its ease of use and versatility. The test functions for each of these checks are:
def test_valid_encoding(): """Ensure that the encoder is creating a string with characters matching the input. Also make sure that a valid amount of symbols are being used. """ test_string = "NatIon" encoded, elements = encoder.encode(test_string) assert encoded.lower() == test_string.lower() assert len(test_string) / 2 <= len(elements) <= len(test_string) def test_invalid_encoding(): """Ensure that the encoder throws the correct error when a string cannot be encoded. """ test_string = "nathan" with pytest.raises(encoder.SymbolError): assert encoder.encode(test_string) def test_avoid_repetitions(): """Ensure that the encoder uses different substitutions where possible.""" short_string = "fer" encoded_short, _ = encoder.encode(short_string) encoded_long, _ = encoder.encode(short_string * 3) assert encoded_long != encoded_short * 3
Here encoder
is a module containing a function encode
which finds all valid encodings using the code we discussed earlier and then returns the least repetitive one.
I particularly appreciate the context manager provided by pytest
for exception checking. Previously I have used
try: function_raising_exception() except WantedException: pass else: assert False
But this is quite clunky in comparison.
Conclusion
The project ended up being fairly simple and resulted in a fun tool to play around with. argparse
was less complicated than expected and I got to try out some different test functionality.
Some simple next steps might be:
- wrap the app in a GUI (gooey seems like a good candidate)
- package as an executable
- use the tool with some large body of English text to see exactly how often words can be encoded (on first run this seems to hover around 20%)