Previously, we went through an example of performing reverse complementation of nucleotide sequences using Python. Python is a very popular language, and for good reason. It’s easy to learn (relatively speaking), is used widely in industry and academia, and boasts hundreds of thousands of user-developed packages. In the previous post I briefly mentioned a relatively new language – Julia. Recently I’ve been reaching to Julia more and more frequently for performing tasks that I once used Python for.
For me personally, my decision to transition towards Julia comes down to one main factor – I find it easier to read and write than Python (more on that below). I also don’t need many of the more specialized packages available for Python, the exception possibly being some machine learning packages such as scikit-learn. Of course the merits of each of these two languages are somewhat subjective, and the best way to learn which you prefer is to try doing a bit of programming in both.
However, before we get too invested in comparisons, I should point out that it probably doesn’t matter too much which one you learn first. Perl, Python, and Julia each have their pros and cons, but unless you happen to be developing more professional code, it’s more important to focus on the problems that you want to solve. The arcane details of the inner workings of different languages are not likely to be very important when you’re just starting out. If you happen to be working in the life sciences, each of these languages also have their own “Bio” ecosystems – BioPerl, Biopython, and BioJulia. These all contain various modules or packages that are designed to handle the files of biological data that you are likely to work with.
Reverse Complementation Script
Before proceeding, let’s create a new and improved reverse complementation script in Julia:
#!/usr/bin/env julia """ Nucleotide complement function Takes as input one character corresponding to a base or IUPAC ambiguity code and returns its complement character """ function complement(b::Char) ## Define complementation dictionary comp_dict = Dict( 'A' => 'T', 'T' => 'A', 'C' => 'G', 'G' => 'C', 'U' => 'A', 'W' => 'W', 'S' => 'S', 'M' => 'K', 'K' => 'M', 'R' => 'Y', 'Y' => 'R', 'B' => 'V', 'V' => 'B', 'D' => 'H', 'H' => 'D', 'N' => 'N' ) ## Test if input base is lowercase ## using tertiary condition ? then : else syntax islowercase(b) ? lower = true : lower = false ## Convert to uppercase if required; attempt to perform complementation b = uppercase(b) if b ∈ keys(comp_dict) c = comp_dict[b] else error("Input should be an IUPAC nucleotide code\n see https://www.insdc.org/documents/feature_table.html#7.4.1") end ## Return complement (convert back to lowercase if input was lowercase) ## semicolons are used to place the if block on a single line if lower; c = lowercase(c); end return(c) end #### Executable ########## for line in eachline(stdin) if startswith(line, ">") println(line * "_revcomp") else for base in reverse(line) print(complement(base)) end print("\n") end end
In general the process is the same as in our Python script. However, there are a few improvements. We have written a function,
complement(), which does some more sophisticated handling of the input than what our Python script’s lookup dictionary did. While we still use a dictionary to perform the complementation, this time we include all the various IUPAC ambiguity codes. In addition, we can handle the RNA-specific base uracil (U), and the function will throw an exception if it encounters input that isn’t a valid IUPAC code.
Otherwise, there are some obvious syntax differences compared to the Python script; for instance, Julia requires the keyword
end to signify the end of a code block such as an if or for statement, while Python only relies on indentation. In Python you can concatenate strings together using “+”, while “*” does the equivalent in Julia. In Julia,
print() will print without any delimiter by default, while
println() prints and appends a newline. The syntax for defining a dictionary in Julia is also somewhat different from the syntax used to define a dictionary in Python, using the “=>” arrow symbol to denote each key/value pairing.
One nice feature included in Julia is support for many Unicode characters. At first I thought this was a bit gimmicky, but it’s actually really great. It’s much easier to read code when you can write
π instead of
λ instead of
lambda. You can see a bit of this in the code above where we use the set membership operator, so that we write:
if b ∈ keys(comp_dict) instead of
if b in keys(comp_dict).
There is one subtle Julia-specific gotcha present in the code above. In Python single quotes or double quotes can both be used to denote strings. In Julia double quotes denote strings, while single quotes denote characters, so:
julia> "A" == "A" true julia> 'A' == 'A' true julia> "A" == 'A' false
‘A’ is a character, while “A” is a string of length one. Looping through a string yields characters, and a dictionary can’t be used to replace characters with strings, or vice versa, so the dictionary above wouldn’t work for our case if we defined the letters with double quotes. We can specify that the
complement() function only accept a character as input when we write
::Char to limit the type of input. In general characters can be concatenated together into strings, for instance using the
join() function, but again in our case we can merely print out our complemented characters one at a time.
Another not so small difference here is that Julia is not object-oriented. In general you won’t see the
object.method syntax that is common in Python, except when explicitly specifying which module or package a particular function belongs to. Rather, Julia uses a sophisticated multiple dispatch system as one of its core paradigms. Therefore functions will perform different actions based on the type of data they are called on. For instance, here we use
reverse() to reverse a string, but we could also call
reverse() on a vector of integers:
reverse([1, 2, 3, 4]), which yields
[4, 3, 2, 1]. The lack of object orientation makes Julia’s syntax somewhat more similar to R’s. I personally find this very convenient, as I also use R quite extensively.
The fastest languages are all compiled – this means that you write your code, and then run it through a program called a compiler, which translates it into machine code. Machine code is low-level, not human-interpretable, and tailored to the computer’s hardware. Machine code compiled from good higher-level code can be extremely fast. Modern compilers are highly optimized and can even catch and improve some forms of poor coding. In contrast, Python and R are both interpreted languages. In an interpreted language, a program called an interpreter directly executes code, without first converting it to machine code. Interpreted languages tend to be slower but “snappier” for interactive work.
You might be thinking to yourself that Python and R don’t seem that slow. This is partially true – a lot of functions in popular packages run very well. However, if you take a look under the hood, you’ll see something surprising: most of these functions are written in compiled languages such as C, C++, or Fortran. This has a few consequences. For one, it means that if you want to become a good Python or R developer, you must counterintuitively become proficient at these other languages. In addition, if you as an end user want to dig into a package’s functions and see how they work, you’ll be out of luck unless you’re familiar with these languages.
Julia is interesting in that it is sort of a hybrid of the two approaches. It uses a just-in-time (JIT) compiler, so when you run any code, it is first compiled and then run immediately afterwards. This has a few ramifications. First, a Julia script will typically run substantially slower the first time it is run, since it must also be compiled. If you run it again in the REPL (the read-evaluate print loop user interface), it will be much, much faster. Second, Julia can be used interactively, but in this application it definitely feels a little more clunky than a language such as python or R (though there are some workarounds to this). Julia’s developers claim that it can match the speed of older compiled languages such as C and Fortran. I don’t doubt this is the case in some scenarios. I have a harder time believing that it can generally match the speed of these languages over a wide range of applications, but maybe I’ll be proven wrong. Recently Julia became the newest language to successfully perform calculations on a supercomputer at the petaflop scale – until then the only other languages used for this feat were C, C++ and Fortran.
Below I’ll try timing our two reverse complementation scripts head to head. However, we’ll up the stakes a little bit by giving them 10 million 20-mers to reverse complement, instead of just five. Here are the timings on my four year old Toshiba laptop:
$ time cat 10million_random_20mers.fa | ./revcomp_std.py > 10M_20mers_RC.fa real 1m41.934s user 1m39.669s sys 0m1.689s $ time cat 10million_random_20mers.fa | ./revcomp_std.jl > 10M_20mers_RC.fa real 3m2.536s user 2m58.627s sys 0m2.696s
Here the Julia script takes about twice as long to run as the Python script. Of course the Julia script we wrote is doing many more operations to check the input. What happens if we run a script that is identical to the Python script in Julia (or as close to identical as possible)?
#!/usr/bin/env julia comp_dict = Dict('A' => 'T', 'T' => 'A', 'C' => 'G', 'G' => 'C') #### Executable ########## for line in eachline(stdin) if startswith(line, ">") println(line * "_revcomp") else for base in reverse(line) print(comp_dict[base]) end print("\n") end end
We’ll call this script revcomp_simp.jl and test its running time on the same input file:
$ time cat 10million_random_20mers.fa | ./revcomp_simp.jl > 10M_20mers_RC.fa real 0m50.844s user 0m49.598s sys 0m1.018s
As you can see, the Julia script runs roughly twice as fast as its Python counterpart. This is highly surprising to me. I was expecting rather modest differences between the two since this is a relatively simple task. Note that because we are calling the Julia script directly, and not running it from within a REPL session, it will have to recompile every time. Therefore, the timing for the Julia script contains the compilation time.
Integrated Development Environments (IDEs) are programs that function as one stop shops for writing code in a particular language. Typically they include a text editor with syntax awareness for the particular language being used, a way to run code interactively, and often ways to plot output graphics. Working with R, I became spoiled by Rstudio, one of the best IDEs around. In contrast, I’ve never been able to find a python IDE that I was really comfortable with. I’ve probably come closest with Spyder, the IDE that comes bundled with Anaconda. Some other IDEs like PyCharm are overkill for my purpose, while browser-based notebooks such as Jupyter are a little too barebones for my needs.
Julia has the advantage that its help documentation and package management system are very well integrated into the basic REPL. Open a terminal, type “julia” to start the REPL, and you can then type
? to open the help prompt,
] to open the package manager prompt, or
; to open the shell prompt. In general Julia’s package management system is very well thought out, and supports the creation and use of separate, self-contained environments out of the box. On top of the REPL’s strengths, Julia has two very good development environments in Juno (a plugin for the Atom text editor) and the Julia plugin for Visual Studio Code.
As I mentioned earlier, the choice of which general purpose language to learn really comes down to personal preference, as well as its adoption by the people you’ll be working with. For me, Julia ticks off all the boxes, combining ease of coding with speed when its needed. However, you can’t go wrong with Python, particularly if you think you’ll be relying heavily on pre-existing packages. Ultimately, the best way to figure out which language you prefer is to try your hand at coding similar projects in each, to get a hands-on evaluation of each language’s strengths and weaknesses.