Linux and Shell Scripting
If you are preparing to enter graduate school in the sciences, or have recently started a graduate program, you have probably become keenly aware that working with modern datasets requires the use of tools that are more powerful and more esoteric than Excel. One way or another, you will have to learn how to code, and it probably behooves you to learn this skill sooner rather than later. You might start off graduate school like me, having taken one intro to computer science class as an undergraduate, where I learned to write rudimentary programs in C. On the other end of the spectrum, you may come to graduate school with a fair amount of coding experience under your belt, looking to refine your skills and learn how to apply them to your chosen research subject.
Whatever the case, I think it is helpful to review the various applications for coding that you’re likely to run into in graduate school. My advice here is geared towards the life sciences, as my field of study falls within this domain. I won’t offer any specific advice on how exactly to learn a new language, because there isn’t any magical, easy way to go about it. You should learn how to code the same way you learn everything in school – by taking a class, either in-person or online. More importantly, during or after that class you should then learn the new coding language for real by working on a project that requires you to use it. At first you will stumble about, writing terrible code that gives you grotesque, misshapen output. Eventually you’ll progress to writing terrible code that gives you the correct output. Someday down the road you may write very good code that other people use in their research. Otherwise you’ll be like me, writing mediocre code that gets the job done. Either way, you will have learned a valuable skill that will make your future academic life easier.
What is Linux?
If your research project(s) will involve any degree of bioinformatic work, in particular the processing of sequencing data, learning how to use Linux will be absolutely essential. It’s very unlikely that you haven’t heard of Linux, but if that is the case, Linux is one of several “Unix-like” operating systems. Also included among the ranks of Unix-like operating systems is one you may not have heard of, BSD, and one you have heard of – MacOS. The Unix-like OSes are, of course, like Unix, which is an operating system that was first developed at Bell Labs starting in the 1970s. Linux itself is named after its developer, Linus Torvalds, and was first released in 1991. Since then, many thousands of different versions, or distributions of Linux have been developed. Whatever distribution you choose (more on that below), you can open a terminal (actually technically a terminal emulator) and begin typing in commands by hand. The interface between the user and the operating system is called a shell. There are several different shells out there – sh, zsh, fish, and the most popular, bash. The magic of Linux and other Unix-like operating systems is that the commands that you would normally type out by hand can be combined into scripts – shell scripts – to automate the process.
What should I use the Linux shell for?
Why bother with learning how to use Linux, when Windows and MacOS dominate the personal computer market? In your mind, Linux and shell scripting should be associated with one type of data – large text files. This speaks both to the strengths and weaknesses of the shell. You see, the shell’s greatest weakness is that it treats everything it sees as strings (i.e. text). That means that if you type in something that you think of as a number, like “12,” the shell will treat it as text, unless you specifically tell it to interpret it as a number using some syntax that is frankly rather clunky. On the other hand, the shell, along with all its associated utilities, does its job of text processing very, very well.
It just so happens that many of the files containing biological data that you encounter will be large text files. This isn’t by coincidence. These include:
- Files of unaligned sequences, either with associated quality information (FASTQ) or without (FASTA). These are typically referred to as “formats” or “specifications,” but that is technically incorrect. There is no formal specification for how these files should be formatted. Rather, they represent sets of informal conventions that have been adopted by the community over time.
- Sequence alignment files (SAM). Typically you would generate a SAM file by aligning the reads in a FASTQ or FASTA file against a reference sequence.
- Genetic variant data stored in variant call format (VCF) files. These are typically generated by comparing multiple sequence alignment files and seeing where they differ.
You may have already encountered these types of files. They usually are compressed (often with the extension “.gz”), and if you try to open one with a text editor on your typical laptop it may promptly crash. Why does this happen? Because the file when uncompressed could easily be larger than the amount of memory on your computer. For instance, you may encounter a 4GB file with the extension “.fastq.gz.” However, when uncompressed this file could actually be 15GB. If you try to open it on a computer with 8GB of RAM, it will completely fill all available memory, bringing everything to a standstill. In Linux, many tools avoid this problem by streaming – reading an input file one line at a time, doing some operation on each line, and then writing it to an output file. Since these utilities only hold one line in memory at a time, they can process files of arbitrary size. It’s important to note that they are also quite fast – likely far faster than anything you could write yourself, since they have been developed over the course of several decades. The primary limiting factor for file operations becomes the read/write speed of the storage medium.
The ability of most Linux command-line utilities to handle very large files is one reason that the large computers you will encounter in your research – servers and high performance computing clusters (i.e. supercomputers) – will almost invariably run some kind of Linux distribution. They will also almost invariably have no graphical user interface. So when I talk about “learning how to use Linux,” I specifically mean learning how to work with the command line. If your experience is like that of many thousands of graduate students around the world, you may get some general guidance from your university’s IT department on how to log in to some particular server using a secure shell (ssh) interface. Upon logging in, you will be confronted by an empty black screen with a blinking cursor. What now?
Some Simple Examples
Linux, like Unix, uses the input redirect (<), pipe (|), and output redirect (>) operators. Most command line utilities will by default read from standard input (the keyboard) and write to standard output (the screen), but these operators allow you to read input from a file, use the output of one command as the input of another, and send output to a file instead of the screen. The input redirect operator is used less frequently than the other two, because many command line utilities can be pointed directly to an input file to read from. For instance, say we had a gzip-compressed fasta file in the typical format of sequence IDs preceded by “>”, sequence descriptions appearing after the first space in ID lines, and sequences as everything not proceeded by “>”. For instance:
>sequence1_id sequence1_description AGTCCGCATAG >sequence2_id sequence2_description GGTCACTTTGA >sequence3_id sequence3_description TTACTGGCAAA ...
If we wanted to, say, isolate the first 10,000 sequence IDs (without the “>” preceding them) and place them into a text file, we could write:
gunzip -c my_fasta_file.fa.gz | grep ">" -m 10000 | sed 's/>//' | sed 's/ .*$//' > my_fasta_ids.txt
Here we perform the following steps, using pipes to supply the output of one step as the input for the next:
- decompress the gzipped fasta file using gunzip -c. The “-c” option stands for “console,” and would normally write the output to the screen, except here we redirect using the pipe operator
- Use grep, which pulls out any line containing “>”. Using the -m (for max) option limits the output to the first 10,000 matches.
- Use the Unix stream editor program (sed) to reformat our sequence ID lines. Here we are using find/replace commands, written in the strange-looking nomenclature ‘s/find/replace/’. Here “s” stands for “string,” so in the first line we’re replacing the string “>” with nothing (that is, just removing it).
- Perform a second sed replacement using a regular expression to replace everything after the first space (including the space) with nothing.
- Redirect our final output to my_fasta_ids.txt.
Notably, you can use this short command to isolate the first 10,000 sequence IDs from a fasta file containing 100,000 sequences, or 100,000,000 (as long as it can fit on your hard drive). Some, but not all Linux systems will have the program zgrep installed, which is a version of grep that can directly read gzip-compressed files. In addition, we can supply sed with multiple find/replace expressions by using the -e option multiple times, so we could condense the above command to:
zgrep ">" -m 10000 my_fasta.fa.gz | sed -e 's/>//' -e 's/ .*$//' > my_fasta_ids.txt
Many bioinformatics softwares are designed to process specific types of the files listed above, using syntax that is more or identical to general shell syntax. So we have the program samtools for working with SAM files, and the program bcftools (or the older vcftools) for working with VCF files. We can do various piping/redirection operations using these, so for instance to output the chromosome, position, reference allele and alternate allele for all SNPs on chromosome 2 in a given VCF file, we could write:
bcftools view my_input.vcf.gz --regions chr2 | bcftools query - '%CHROM\t%POS\t%REF\t%ALT\n' > chr2_snp_info.txt
Here we are using
bcftools view to isolate just chromosome 2, and then passing that off to
bcftools query (the lonesome “-” here means “read from standard input,” i.e. the pipe’s output). Then we have a very literal statement showing
bcftools query how to write the output, separating our desired output fields by tabs (“
\t“) and signifying that we want each SNP to be listed on a separate line (“
Hardware for Learning Linux
You always have the option of learning Linux without any graphical interface, as you would on a university server or cluster, or else through a virtual machine. However, it can be a hassle to have to deal with these systems before you have the basics down. If you want to go a more local route, you have a few options:
- A PC running a Linux distribution. This is by far the best option. There are a huge amount of distributions to choose from – some popular ones are Ubuntu (though it faces some criticism due to corporate practices of its parent company Canonical), Linux Mint (a very popular community-driven fork of Ubuntu), Debian, Zorin OS and Elementary OS (both good for people used to Windows or Mac), Manjaro, Gentoo, and the list goes on. I personally prefer Linux Mint. Working on a Linux computer will allow you to try out various commands and scripts on small test datasets before running larger jobs on servers or clusters. It’s also much easier to learn how to use the command line locally, where you don’t have to worry about constantly logging in to a remote computer, making other users mad by hogging resources, or else scheduling jobs like you would on a cluster. It’s also probably easier than you think to obtain a computer running Linux. If you have an old laptop or desktop lying around that struggles to run Windows or MacOS, you can often install a lightweight Linux distribution and magically resurrect it. Some good distributions to consider in this case would be those using the Xfce desktop environment, such as Linux Mint Xfce edition, or those using the LXDE desktop environment, such as Lubuntu.
- A dual-boot computer. This is when you have two operating systems installed side-by-side on the same machine. This way you can have both Windows or MacOS installed alongside a Linux OS. This might seem like the best of both worlds, but there are some caveats. Switching between the two OSes requires restarting the machine, which can become tedious. Also, setting up dual boot will immediately void your warranty if it is still active.
- Use Windows 10 or MacOS. Traditionally, Macs were considered superior to Windows PCs if you wanted a popular, commercial operating system that somewhat resembled Linux. Recall that both Linux and MacOS are Unix-like. However, MacOS more closely resembles BSD than Linux, and it requires some tweaking to truly get it working more like Linux. Even then, it’s never quite the same. By comparison, Windows was the odd one out, excepting a few solutions like Cygwin or virtual machines (VMs). However, that changed somewhat with the release of Windows 10 and the Windows Subsystem for Linux (WSL). This is a VM that allows you to install one of several Linux distributions inside of Windows. The first version of WSL was quite a bit faster than a traditional VM, though still somewhat clunky. The new WSL 2 has vastly improved performance over the original WSL. However, there is a catch – WSL runs with its own filesystem, and it can be quite slow when working with anything in the Windows filesystem on the computer. With the release of WSL 2, Windows has become more competitive with Mac when it comes to Linux functionality. However, if you’re looking for a seamless experience for learning Linux, Both Windows and Mac will still give you some headaches from time to time.
Whatever route you choose, the first bit of learning might be a little bit tough and frustrating. However, now that you can look up or ask any question on the internet, the task of learning Linux has become a lot easier. Soon enough you’ll be breezing through the command line and processing enormous files, surprising your advisor by getting through huge computational projects with relative ease.