Skip to content
Brian Ward, Ph.D.
  • Home
  • Blog
  • CV
  • Contact
Uncategorized

Julia vs. Python

  • October 17, 2021October 17, 2021
  • by brian_ward

Previously, we went through an example of performing reverse complementation of nucleotide sequences using Python. Python is a very popular language, and for good reason. It’s easy to learn (relatively speaking), is used widely in industry and academia, and boasts hundreds of thousands of user-developed packages. In the previous post I briefly mentioned a relatively new language – Julia. Recently I’ve been reaching to Julia more and more frequently for performing tasks that I once used Python for.

This image has an empty alt attribute; its file name is Julia_caesaris.jpg
Julia, daughter of Julius Caesar and fourth wife of Pompey the Great

For me personally, my decision to transition towards Julia comes down to one main factor – I find it easier to read and write than Python (more on that below). I also don’t need many of the more specialized packages available for Python, the exception possibly being some machine learning packages such as scikit-learn. Of course the merits of each of these two languages are somewhat subjective, and the best way to learn which you prefer is to try doing a bit of programming in both.

However, before we get too invested in comparisons, I should point out that it probably doesn’t matter too much which one you learn first. Perl, Python, and Julia each have their pros and cons, but unless you happen to be developing more professional code, it’s more important to focus on the problems that you want to solve. The arcane details of the inner workings of different languages are not likely to be very important when you’re just starting out. If you happen to be working in the life sciences, each of these languages also have their own “Bio” ecosystems – BioPerl, Biopython, and BioJulia. These all contain various modules or packages that are designed to handle the files of biological data that you are likely to work with.

Reverse Complementation Script

Before proceeding, let’s create a new and improved reverse complementation script in Julia:

#!/usr/bin/env julia


"""
Nucleotide complement function

Takes as input one character corresponding to a base or IUPAC ambiguity code
and returns its complement character
"""
function complement(b::Char)

    ## Define complementation dictionary
    comp_dict = Dict(
        'A' => 'T',
        'T' => 'A',
        'C' => 'G',
        'G' => 'C',
        'U' => 'A',
        'W' => 'W',
        'S' => 'S',
        'M' => 'K',
        'K' => 'M',
        'R' => 'Y',
        'Y' => 'R',
        'B' => 'V',
        'V' => 'B',
        'D' => 'H',
        'H' => 'D',
        'N' => 'N'
    )

    ## Test if input base is lowercase
    ## using tertiary condition ? then : else syntax
    islowercase(b) ? lower = true : lower = false

    ## Convert to uppercase if required; attempt to perform complementation
    b = uppercase(b)
    if b ∈ keys(comp_dict)
        c = comp_dict[b]
    else
        error("Input should be an IUPAC nucleotide code\n
               see https://www.insdc.org/documents/feature_table.html#7.4.1")
    end

    ## Return complement (convert back to lowercase if input was lowercase)
    ## semicolons are used to place the if block on a single line
    if lower; c = lowercase(c); end
    return(c)
end


#### Executable ##########

for line in eachline(stdin)
    if startswith(line, ">")
        println(line * "_revcomp")
    else
        for base in reverse(line)
            print(complement(base))
        end
        print("\n")
    end
end

In general the process is the same as in our Python script. However, there are a few improvements. We have written a function, complement(), which does some more sophisticated handling of the input than what our Python script’s lookup dictionary did. While we still use a dictionary to perform the complementation, this time we include all the various IUPAC ambiguity codes. In addition, we can handle the RNA-specific base uracil (U), and the function will throw an exception if it encounters input that isn’t a valid IUPAC code.

Syntax

Otherwise, there are some obvious syntax differences compared to the Python script; for instance, Julia requires the keyword end to signify the end of a code block such as an if or for statement, while Python only relies on indentation. In Python you can concatenate strings together using “+”, while “*” does the equivalent in Julia. In Julia, print() will print without any delimiter by default, while println() prints and appends a newline. The syntax for defining a dictionary in Julia is also somewhat different from the syntax used to define a dictionary in Python, using the “=>” arrow symbol to denote each key/value pairing.

One nice feature included in Julia is support for many Unicode characters. At first I thought this was a bit gimmicky, but it’s actually really great. It’s much easier to read code when you can write π instead of pi, or λ instead of lambda. You can see a bit of this in the code above where we use the set membership operator, so that we write: if b ∈ keys(comp_dict) instead of if b in keys(comp_dict).

There is one subtle Julia-specific gotcha present in the code above. In Python single quotes or double quotes can both be used to denote strings. In Julia double quotes denote strings, while single quotes denote characters, so:

julia> "A" == "A"
true

julia> 'A' == 'A'
true

julia> "A" == 'A'
false

‘A’ is a character, while “A” is a string of length one. Looping through a string yields characters, and a dictionary can’t be used to replace characters with strings, or vice versa, so the dictionary above wouldn’t work for our case if we defined the letters with double quotes. We can specify that the complement() function only accept a character as input when we write ::Char to limit the type of input. In general characters can be concatenated together into strings, for instance using the join() function, but again in our case we can merely print out our complemented characters one at a time.

Another not so small difference here is that Julia is not object-oriented. In general you won’t see the object.method syntax that is common in Python, except when explicitly specifying which module or package a particular function belongs to. Rather, Julia uses a sophisticated multiple dispatch system as one of its core paradigms. Therefore functions will perform different actions based on the type of data they are called on. For instance, here we use reverse() to reverse a string, but we could also call reverse() on a vector of integers: reverse([1, 2, 3, 4]), which yields [4, 3, 2, 1]. The lack of object orientation makes Julia’s syntax somewhat more similar to R’s. I personally find this very convenient, as I also use R quite extensively.

Speed

The fastest languages are all compiled – this means that you write your code, and then run it through a program called a compiler, which translates it into machine code. Machine code is low-level, not human-interpretable, and tailored to the computer’s hardware. Machine code compiled from good higher-level code can be extremely fast. Modern compilers are highly optimized and can even catch and improve some forms of poor coding. In contrast, Python and R are both interpreted languages. In an interpreted language, a program called an interpreter directly executes code, without first converting it to machine code. Interpreted languages tend to be slower but “snappier” for interactive work.

You might be thinking to yourself that Python and R don’t seem that slow. This is partially true – a lot of functions in popular packages run very well. However, if you take a look under the hood, you’ll see something surprising: most of these functions are written in compiled languages such as C, C++, or Fortran. This has a few consequences. For one, it means that if you want to become a good Python or R developer, you must counterintuitively become proficient at these other languages. In addition, if you as an end user want to dig into a package’s functions and see how they work, you’ll be out of luck unless you’re familiar with these languages.

Julia is interesting in that it is sort of a hybrid of the two approaches. It uses a just-in-time (JIT) compiler, so when you run any code, it is first compiled and then run immediately afterwards. This has a few ramifications. First, a Julia script will typically run substantially slower the first time it is run, since it must also be compiled. If you run it again in the REPL (the read-evaluate print loop user interface), it will be much, much faster. Second, Julia can be used interactively, but in this application it definitely feels a little more clunky than a language such as python or R (though there are some workarounds to this). Julia’s developers claim that it can match the speed of older compiled languages such as C and Fortran. I don’t doubt this is the case in some scenarios. I have a harder time believing that it can generally match the speed of these languages over a wide range of applications, but maybe I’ll be proven wrong. Recently Julia became the newest language to successfully perform calculations on a supercomputer at the petaflop scale – until then the only other languages used for this feat were C, C++ and Fortran.

Below I’ll try timing our two reverse complementation scripts head to head. However, we’ll up the stakes a little bit by giving them 10 million 20-mers to reverse complement, instead of just five. Here are the timings on my four year old Toshiba laptop:

$ time cat 10million_random_20mers.fa | ./revcomp_std.py > 10M_20mers_RC.fa

real	1m41.934s
user	1m39.669s
sys	0m1.689s

$ time cat 10million_random_20mers.fa | ./revcomp_std.jl > 10M_20mers_RC.fa

real	3m2.536s
user	2m58.627s
sys	0m2.696s

Here the Julia script takes about twice as long to run as the Python script. Of course the Julia script we wrote is doing many more operations to check the input. What happens if we run a script that is identical to the Python script in Julia (or as close to identical as possible)?

#!/usr/bin/env julia

comp_dict = Dict('A' => 'T', 'T' => 'A', 'C' => 'G', 'G' => 'C')

#### Executable ##########

for line in eachline(stdin)
    if startswith(line, ">")
        println(line * "_revcomp")
    else
        for base in reverse(line)
            print(comp_dict[base])
        end
        print("\n")
    end
end

We’ll call this script revcomp_simp.jl and test its running time on the same input file:

$ time cat 10million_random_20mers.fa | ./revcomp_simp.jl > 10M_20mers_RC.fa 

real	0m50.844s
user	0m49.598s
sys	0m1.018s

As you can see, the Julia script runs roughly twice as fast as its Python counterpart. This is highly surprising to me. I was expecting rather modest differences between the two since this is a relatively simple task. Note that because we are calling the Julia script directly, and not running it from within a REPL session, it will have to recompile every time. Therefore, the timing for the Julia script contains the compilation time.

Development Environments

Integrated Development Environments (IDEs) are programs that function as one stop shops for writing code in a particular language. Typically they include a text editor with syntax awareness for the particular language being used, a way to run code interactively, and often ways to plot output graphics. Working with R, I became spoiled by Rstudio, one of the best IDEs around. In contrast, I’ve never been able to find a python IDE that I was really comfortable with. I’ve probably come closest with Spyder, the IDE that comes bundled with Anaconda. Some other IDEs like PyCharm are overkill for my purpose, while browser-based notebooks such as Jupyter are a little too barebones for my needs.

Julia has the advantage that its help documentation and package management system are very well integrated into the basic REPL. Open a terminal, type “julia” to start the REPL, and you can then type ? to open the help prompt, ] to open the package manager prompt, or ; to open the shell prompt. In general Julia’s package management system is very well thought out, and supports the creation and use of separate, self-contained environments out of the box. On top of the REPL’s strengths, Julia has two very good development environments in Juno (a plugin for the Atom text editor) and the Julia plugin for Visual Studio Code.

Conclusion

As I mentioned earlier, the choice of which general purpose language to learn really comes down to personal preference, as well as its adoption by the people you’ll be working with. For me, Julia ticks off all the boxes, combining ease of coding with speed when its needed. However, you can’t go wrong with Python, particularly if you think you’ll be relying heavily on pre-existing packages. Ultimately, the best way to figure out which language you prefer is to try your hand at coding similar projects in each, to get a hands-on evaluation of each language’s strengths and weaknesses.

Uncategorized

What Software Should I Learn? Pt. II

  • March 28, 2021March 28, 2021
  • by brian_ward

General Purpose Languages

reticulated python head
A reticulated python (Malayopython reticulatus). Photo by Tomáš Malík from Pexels.

In a previous post we talked about how learning Linux can make your life in graduate school a lot easier, particularly if you have to process large amounts of text. As we discussed, the shell is a text-processing machine. However, it quickly becomes cumbersome once you start dealing with other types of data such as numbers or arrays. If you’re going to be working with many different types of data in graduate school, it’s a good idea to learn a general-purpose language.

As the name suggests, general purpose languages can be used in many different applications. These include, but are not limited to:

  • String manipulation and parsing (though often not as fast as using the shell’s built-in tools)
  • Analyzing images
  • Processing biological sequence data
  • Numerical analysis, for instance solving systems of linear equations
  • Machine learning
  • Geospatial data analysis
  • Analyzing tabular data
  • Statistical analysis

We’ll go into more detail on the last two applications in a future post when we talk about the more specialized language R.

What options are out there?

As of the time of this writing, the two most popular languages in the world according to the TIOBE index are the general-purpose languages C and Java. However, neither of these is particularly easy to learn for newcomers. If you’re working in the life sciences it’s unlikely that you’ll need to learn either unless you want to develop more polished software that other people use. In this post, we’ll be focusing on Python, the currently dominant general purpose language used in the life sciences (and in many other fields). In a future post we’ll do a brief comparison between Python and a relatively new language: Julia. Here I’ll just say that Julia is a young, rapidly-evolving and well-thought out language that has a lot going for it.

When I started graduate school, there was more or less an even split between Python and Perl users – Perl was what many people knew, and Python was what they wanted to learn. Since then Python has rapidly ascended in popularity, possibly due to its adoption by Google and its resulting high level of adoption in machine learning applications. Python is currently ranked third on the TIOBE index, while Perl is 19th place. This is a stark reversal from Perl’s heyday in the mid 90’s, when it held the 3rd place ranking.

It seems that it’s case closed then – if you’re looking to learn a general purpose, first programming language in 2021, you can’t really go wrong with Python. That’s not to say that there aren’t any other alternative trajectories to consider. Many graduate students (myself included) end up learning R before Python, because they frequently work with the types of data that R excels at handling. However, the R language has many idiosyncrasies and is more specialized, so I wouldn’t necessarily recommend learning it first. R isn’t such a great choice once you have to work with non-tabular data. It’s also not a great choice when you need to process a file line-by-line, and it de-emphasizes iteration using loops, both of which I consider critical learning blocks of coding. Python and Julia can now also do many dataframe-oriented tasks reasonably well, albeit in a somewhat more clunky manner than the R ecosystem. At this point though, maybe it would be best to just look at a brief example.

Example – Reverse Complementation

A common task when working with DNA data is finding the reverse complement of a sequence. We typically write DNA sequences in the 5′-3′ (five prime to three prime) direction. The 5′-3′ orientation runs in opposite directions on the two DNA strands, so if we want to translate a sequence to its usual representation on the opposite strand, we must perform two tasks:

  1. Reverse the sequence
  2. Substitute each nucleotide with its complement (A for T, T for A, C for G, and G for C)

This gets a little complicated using the shell (although the bioawk program can do it out of the box). However, it would still be advantageous to be able to stream the file one line at a time instead of reading it all into memory, in case we encounter a really large file. This requires a general purpose language and largely precludes using a more specialized language such as R. I should point out that Perl, Python and Julia all have packages that can process sequence data and find reverse complements. However, it’s easy enough to try out the process for ourselves. First we can write a tiny Python script to generate a fasta file with some random 20-nucleotide sequences:

import random

n_seqs = 5
n_bases = 20
out_file = '5random_20mers.fa'

nucleotides = ('A', 'C', 'G', 'T')

with open(out_file, 'w') as f:
    for i in range(n_seqs):
        seq_list = random.choices(nucleotides, k = n_bases)
        seq = ''.join(seq_list)
        print('>seq_' + str(i), file = f)
        print(seq, file = f)

In this script, we open our output file for writing using open() with the ‘w’ option. This assigns a file object, or handle, which we can use as convenient shorthand for the file name. Then we sequentially loop from 1 to our number of desired sequences (sort of – we’ll get to that below), and the choices() function from the random module is used to randomly pick one of the four nucleotides 20 times. This gives us a list of 20 nucleotides, which then must be joined together without any delimiter (''.join()) in order to form a string to output to our file. Then we just print our sequence identifier followed by the sequence itself to the output file. Normally we would also have to close our file, but using the open() statement automatically closes the file when we’re done.

When we run this we get back a fasta file containing five random DNA sequences of 20 nucleotides each:

>seq_0
TCTGCAGCGACTCGGGGGTG
>seq_1
ACAGAAATTCATGTGAGAAA
>seq_2
GTCGTTCTAATATGGATGCG
>seq_3
GGCTCAAAAAACCGTCGCTG
>seq_4
TGGCAAGGGGCAGACTAAGG

This file nicely demonstrates that in Python, as in most languages, indices start at 0. So we asked for five sequences, and we get sequences labeled 0 through 4. In Julia and R, counting starts at 1, so if we had created this same file using one of those languages we would have sequences labeled 1 through 5. I don’t personally see this as a big problem, but as with anything you’ll find people on the internet who are really mad about languages in which indices start at 1.

Now let’s write a short Python script to generate the reverse complement of each sequence:

input_fasta = "5random_20mers.fa"

comp_dict = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}

with open(input_fasta, 'r') as fa:
    for line in fa:
        line = line.rstrip()
        if line.startswith('>'):
            line = line + '_revcomp'
        else:
            line = line[::-1]
            line_list = list(line)
            for index, base in enumerate(line_list):
                line_list[index] = comp_dict[base]
            line = ''.join(line_list)
        print(line)

Here we use the time-honored tradition of lazy script coding by setting the path to our input file as a constant inside the script. We then define a dictionary containing each of the four complementary nucleotide pairings. Finally we loop through the file. If a line starts with “>”, that signifies that it is a sequence identifier. In that case we just append “_revcomp” to it and print it out. Otherwise we first reverse it, and then replace all the letters in it using our lookup dictionary. To do this we break apart the string into a list of individual characters, make our substitutions one at a time, and then reassemble our complemented string and print it out. The syntax to reverse the sequence is a little strange: line = line[::-1]. This is related to Python’s syntax for slicing (i.e. subsetting) parts of strings. When we call print() without any further arguments, we just print to standard out (the screen).

The snippet of code above demonstrates one property of Python code – that it is partially object-oriented. In order to remove trailing whitespace from the right-hand side of each line that the script reads in, we called line.rstrip(). Here our line of text is an object (of type string), and the function rstrip() is one of its methods. In general we can call an object’s method using the syntax object.method. Whether you prefer object-oriented code vs. the more functional code of languages like R, Julia or Matlab is a matter of personal preference. One thing I personally don’t like about Python is that it’s partially object-oriented. This creates some inconsistent syntax. For instance, to split our line into a list of characters, we called list(line), not line.list(). This leads to a lot of confusion (at least on my part), as it isn’t always clear when to call a base function, versus a method of a package, versus a method of an object. In addition, the syntax is sometimes non-intuitive. For instance, you might think that join would be a method of a list, to join it together into a string. However, above we call ''.join() – join is a method of the delimiter (a string, in this case an empty one), and we supply it with the list.

Python + Shell

The code above works fine, and as I mentioned we used the time-honored lazy tradition of defining our input file inside the script. However, we could be lazier. If we are working in a Unix-like environment, we can use the shell to supply our input and output files. The script below has a few improvements over the previous one. Here we read our input from stdin using fileinput.input(). Also, previously we had split apart our sequences, performed our complementation, and then reassembled them. However, since we don’t have to perform any additional tasks with the reverse-complement sequences, this is unneccessary. Instead we can just loop through the bases in the reversed sequence, find their complement, and print each one out, without any space between them. We just have to be sure to call print() without any arguments at the end of each sequence – by default it will print out a newline:

#!/usr/bin/env python3

import fileinput

comp_dict = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}

for line in fileinput.input():
    line = line.rstrip()
    if line.startswith('>'):
        print(line + '_revcomp')
    else:
        for base in line[::-1]:
            print(comp_dict[base], end = '')
        print()

The funny looking comment at the beginning of the script is important. This is the shebang. It tells the shell which program to use to run the code, so that we don’t always need to type “python3” in front of our script to get it to run. We just need to make sure that this script has the ability to execute, by typing chmod +x revcomp.py in a terminal. Then, we can run it like so (assuming the input fasta file and script are both in our current working directory):

cat 5random_20mers.fa | ./revcomp.py > 5random_20mers_revcomp.fa

If we look at the output file, we see the following:

>seq_0_revcomp
CACCCCCGAGTCGCTGCAGA
>seq_1_revcomp
TTTCTCACATGAATTTCTGT
>seq_2_revcomp
CGCATCCATATTAGAACGAC
>seq_3_revcomp
CAGCGACGGTTTTTTGAGCC
>seq_4_revcomp
CCTTAGTCTGCCCCTTGCCA

Nice. Now we can easily run our script on any fasta file without having to modify any text inside the script itself. This is key when you need to write some code and get results fast. To test out our new script, we can run our reverse complementation twice:

cat 5random_20mers.fa | ./revcomp.py | ./revcomp.py > 5random_20mers_2RC.fa

Then we can compare the two files on the command line using diff. If everything worked as expected, we should see that only the ID lines differ. Specifically, the ID lines in the output file should have “_revcomp_revcomp” appended:

$ diff 5random_20mers.fa 5random_20mers_2RC.fa
1c1
< >seq_0
---
> >seq_0_revcomp_revcomp
3c3
< >seq_1
---
> >seq_1_revcomp_revcomp
5c5
< >seq_2
---
> >seq_2_revcomp_revcomp
7c7
< >seq_3
---
> >seq_3_revcomp_revcomp
9c9
< >seq_4
---
> >seq_4_revcomp_revcomp

Going Further

That’s about all there is to it for our simple reverse complementation script. However, there are a few further improvements that could be performed. For one, longer sequences in fasta files may be wrapped at a certain number of characters (often 80), so that sequences are listed across multiple lines. Parsing this kind of file is much more challenging then what we did here. In addition, real fasta files may contain ambiguity characters in addition to the four bases, which would require a larger replacement dictionary. I’ll leave these as a challenge for you to explore if you so choose.

Recent Posts

  • Julia vs. Python
  • What Software Should I Learn? Pt. II
  • What Software Should I Learn? Pt. I
  • Misconceptions About Genetics and Evolution in Pop Culture, Part II
  • Investigating COVID-19 From Your Couch

Recent Comments

    Archives

    • October 2021
    • March 2021
    • January 2021
    • June 2020
    • May 2020
    • January 2020
    • October 2019
    • February 2019
    • June 2018
    • February 2018
    • December 2017
    • November 2017
    • October 2017

    Categories

    • Everyday Annoyances
    • Lab Protocols
    • Observations
    • R coding
    • shell scripting
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    LinkedIn   ResearchGate   ORCID

    © 2017 Brian Ward
    Theme by Colorlib Powered by WordPress
     

    Loading Comments...