
What Software Should I Learn? Pt. II
General Purpose Languages

In a previous post we talked about how learning Linux can make your life in graduate school a lot easier, particularly if you have to process large amounts of text. As we discussed, the shell is a text-processing machine. However, it quickly becomes cumbersome once you start dealing with other types of data such as numbers or arrays. If you’re going to be working with many different types of data in graduate school, it’s a good idea to learn a general-purpose language.
As the name suggests, general purpose languages can be used in many different applications. These include, but are not limited to:
- String manipulation and parsing (though often not as fast as using the shell’s built-in tools)
- Analyzing images
- Processing biological sequence data
- Numerical analysis, for instance solving systems of linear equations
- Machine learning
- Geospatial data analysis
- Analyzing tabular data
- Statistical analysis
We’ll go into more detail on the last two applications in a future post when we talk about the more specialized language R.
What options are out there?
As of the time of this writing, the two most popular languages in the world according to the TIOBE index are the general-purpose languages C and Java. However, neither of these is particularly easy to learn for newcomers. If you’re working in the life sciences it’s unlikely that you’ll need to learn either unless you want to develop more polished software that other people use. In this post, we’ll be focusing on Python, the currently dominant general purpose language used in the life sciences (and in many other fields). In a future post we’ll do a brief comparison between Python and a relatively new language: Julia. Here I’ll just say that Julia is a young, rapidly-evolving and well-thought out language that has a lot going for it.
When I started graduate school, there was more or less an even split between Python and Perl users – Perl was what many people knew, and Python was what they wanted to learn. Since then Python has rapidly ascended in popularity, possibly due to its adoption by Google and its resulting high level of adoption in machine learning applications. Python is currently ranked third on the TIOBE index, while Perl is 19th place. This is a stark reversal from Perl’s heyday in the mid 90’s, when it held the 3rd place ranking.
It seems that it’s case closed then – if you’re looking to learn a general purpose, first programming language in 2021, you can’t really go wrong with Python. That’s not to say that there aren’t any other alternative trajectories to consider. Many graduate students (myself included) end up learning R before Python, because they frequently work with the types of data that R excels at handling. However, the R language has many idiosyncrasies and is more specialized, so I wouldn’t necessarily recommend learning it first. R isn’t such a great choice once you have to work with non-tabular data. It’s also not a great choice when you need to process a file line-by-line, and it de-emphasizes iteration using loops, both of which I consider critical learning blocks of coding. Python and Julia can now also do many dataframe-oriented tasks reasonably well, albeit in a somewhat more clunky manner than the R ecosystem. At this point though, maybe it would be best to just look at a brief example.
Example – Reverse Complementation
A common task when working with DNA data is finding the reverse complement of a sequence. We typically write DNA sequences in the 5′-3′ (five prime to three prime) direction. The 5′-3′ orientation runs in opposite directions on the two DNA strands, so if we want to translate a sequence to its usual representation on the opposite strand, we must perform two tasks:
- Reverse the sequence
- Substitute each nucleotide with its complement (A for T, T for A, C for G, and G for C)
This gets a little complicated using the shell (although the bioawk program can do it out of the box). However, it would still be advantageous to be able to stream the file one line at a time instead of reading it all into memory, in case we encounter a really large file. This requires a general purpose language and largely precludes using a more specialized language such as R. I should point out that Perl, Python and Julia all have packages that can process sequence data and find reverse complements. However, it’s easy enough to try out the process for ourselves. First we can write a tiny Python script to generate a fasta file with some random 20-nucleotide sequences:
import random n_seqs = 5 n_bases = 20 out_file = '5random_20mers.fa' nucleotides = ('A', 'C', 'G', 'T') with open(out_file, 'w') as f: for i in range(n_seqs): seq_list = random.choices(nucleotides, k = n_bases) seq = ''.join(seq_list) print('>seq_' + str(i), file = f) print(seq, file = f)
In this script, we open our output file for writing using open()
with the ‘w’ option. This assigns a file object, or handle, which we can use as convenient shorthand for the file name. Then we sequentially loop from 1 to our number of desired sequences (sort of – we’ll get to that below), and the choices()
function from the random
module is used to randomly pick one of the four nucleotides 20 times. This gives us a list of 20 nucleotides, which then must be joined together without any delimiter (''.join()
) in order to form a string to output to our file. Then we just print our sequence identifier followed by the sequence itself to the output file. Normally we would also have to close our file, but using the open()
statement automatically closes the file when we’re done.
When we run this we get back a fasta file containing five random DNA sequences of 20 nucleotides each:
>seq_0
TCTGCAGCGACTCGGGGGTG
>seq_1
ACAGAAATTCATGTGAGAAA
>seq_2
GTCGTTCTAATATGGATGCG
>seq_3
GGCTCAAAAAACCGTCGCTG
>seq_4
TGGCAAGGGGCAGACTAAGG
This file nicely demonstrates that in Python, as in most languages, indices start at 0. So we asked for five sequences, and we get sequences labeled 0 through 4. In Julia and R, counting starts at 1, so if we had created this same file using one of those languages we would have sequences labeled 1 through 5. I don’t personally see this as a big problem, but as with anything you’ll find people on the internet who are really mad about languages in which indices start at 1.
Now let’s write a short Python script to generate the reverse complement of each sequence:
input_fasta = "5random_20mers.fa" comp_dict = {'A':'T', 'T':'A', 'C':'G', 'G':'C'} with open(input_fasta, 'r') as fa: for line in fa: line = line.rstrip() if line.startswith('>'): line = line + '_revcomp' else: line = line[::-1] line_list = list(line) for index, base in enumerate(line_list): line_list[index] = comp_dict[base] line = ''.join(line_list) print(line)
Here we use the time-honored tradition of lazy script coding by setting the path to our input file as a constant inside the script. We then define a dictionary containing each of the four complementary nucleotide pairings. Finally we loop through the file. If a line starts with “>”, that signifies that it is a sequence identifier. In that case we just append “_revcomp” to it and print it out. Otherwise we first reverse it, and then replace all the letters in it using our lookup dictionary. To do this we break apart the string into a list of individual characters, make our substitutions one at a time, and then reassemble our complemented string and print it out. The syntax to reverse the sequence is a little strange: line = line[::-1]
. This is related to Python’s syntax for slicing (i.e. subsetting) parts of strings. When we call print()
without any further arguments, we just print to standard out (the screen).
The snippet of code above demonstrates one property of Python code – that it is partially object-oriented. In order to remove trailing whitespace from the right-hand side of each line that the script reads in, we called line.rstrip()
. Here our line of text is an object (of type string), and the function rstrip()
is one of its methods. In general we can call an object’s method using the syntax object.method
. Whether you prefer object-oriented code vs. the more functional code of languages like R, Julia or Matlab is a matter of personal preference. One thing I personally don’t like about Python is that it’s partially object-oriented. This creates some inconsistent syntax. For instance, to split our line into a list of characters, we called list(line)
, not line.list()
. This leads to a lot of confusion (at least on my part), as it isn’t always clear when to call a base function, versus a method of a package, versus a method of an object. In addition, the syntax is sometimes non-intuitive. For instance, you might think that join would be a method of a list, to join it together into a string. However, above we call ''.join()
– join is a method of the delimiter (a string, in this case an empty one), and we supply it with the list.
Python + Shell
The code above works fine, and as I mentioned we used the time-honored lazy tradition of defining our input file inside the script. However, we could be lazier. If we are working in a Unix-like environment, we can use the shell to supply our input and output files. The script below has a few improvements over the previous one. Here we read our input from stdin using fileinput.input()
. Also, previously we had split apart our sequences, performed our complementation, and then reassembled them. However, since we don’t have to perform any additional tasks with the reverse-complement sequences, this is unneccessary. Instead we can just loop through the bases in the reversed sequence, find their complement, and print each one out, without any space between them. We just have to be sure to call print()
without any arguments at the end of each sequence – by default it will print out a newline:
#!/usr/bin/env python3 import fileinput comp_dict = {'A':'T', 'T':'A', 'C':'G', 'G':'C'} for line in fileinput.input(): line = line.rstrip() if line.startswith('>'): print(line + '_revcomp') else: for base in line[::-1]: print(comp_dict[base], end = '') print()
The funny looking comment at the beginning of the script is important. This is the shebang. It tells the shell which program to use to run the code, so that we don’t always need to type “python3” in front of our script to get it to run. We just need to make sure that this script has the ability to execute, by typing chmod +x revcomp.py
in a terminal. Then, we can run it like so (assuming the input fasta file and script are both in our current working directory):
cat 5random_20mers.fa | ./revcomp.py > 5random_20mers_revcomp.fa
If we look at the output file, we see the following:
>seq_0_revcomp
CACCCCCGAGTCGCTGCAGA
>seq_1_revcomp
TTTCTCACATGAATTTCTGT
>seq_2_revcomp
CGCATCCATATTAGAACGAC
>seq_3_revcomp
CAGCGACGGTTTTTTGAGCC
>seq_4_revcomp
CCTTAGTCTGCCCCTTGCCA
Nice. Now we can easily run our script on any fasta file without having to modify any text inside the script itself. This is key when you need to write some code and get results fast. To test out our new script, we can run our reverse complementation twice:
cat 5random_20mers.fa | ./revcomp.py | ./revcomp.py > 5random_20mers_2RC.fa
Then we can compare the two files on the command line using diff
. If everything worked as expected, we should see that only the ID lines differ. Specifically, the ID lines in the output file should have “_revcomp_revcomp” appended:
$ diff 5random_20mers.fa 5random_20mers_2RC.fa
1c1
< >seq_0
---
> >seq_0_revcomp_revcomp
3c3
< >seq_1
---
> >seq_1_revcomp_revcomp
5c5
< >seq_2
---
> >seq_2_revcomp_revcomp
7c7
< >seq_3
---
> >seq_3_revcomp_revcomp
9c9
< >seq_4
---
> >seq_4_revcomp_revcomp
Going Further
That’s about all there is to it for our simple reverse complementation script. However, there are a few further improvements that could be performed. For one, longer sequences in fasta files may be wrapped at a certain number of characters (often 80), so that sequences are listed across multiple lines. Parsing this kind of file is much more challenging then what we did here. In addition, real fasta files may contain ambiguity characters in addition to the four bases, which would require a larger replacement dictionary. I’ll leave these as a challenge for you to explore if you so choose.