Everyday Annoyances: String Formatting
This post is intended to be the first in a series of everyday annoyances that one typically runs into when working with data. Today’s example should be familiar to anyone who has ever worked with data recorded by multiple people or groups of people. Let me give a little vignette to illustrate the problem. I’ll assume that you are working with data that contains a column with some sort of identifier for research subjects, genotypes, etc.
The Tyranny of Identifier Formatting
You have just received a set of data from one of your collaborators. Excited to see what it contains, you get to work compiling it with your own data. However, within a couple of minutes you realize that something is wrong. For instance, you try to perform a merge using the identifier columns – and find that nothing matches between the two sets of data. Or, you try binding the rows of the two data tables together, and you suddenly end up with about twice as many different IDs as you had anticipated. Your spirits begin to sink as you experience an acute wave of panic.
After taking a minute to cool off, you look at both sets of data side by side. You look up one particular ID – “OR_2016_47_3.” When you find the same ID in your collaborator’s data, you see the problem – its listed there as “OR-2016-47-3.” The mystery is solved – at some time in the past, some person or script decided to take matters into their own hands, deviating from the identifier formatting that everyone involved in the project (hopefully) agreed on at the outset.
The situation above is just one of those things that happens – more often than not – when you deal with data that has been worked over by somebody else. Luckily there’s an easy fix. We simply need a sort of disciplinarian function that we can sic on any vector or list of strings to enforce some order. Below are some examples written in both R and Python. I’m sure there might be some versions of these functions written into this or that R package or Python module that I’m not aware of, but it’s easy enough to make our own versions.
R example
The full R code snippet is available here.
Note that this function requires R version > 3.2.0, as it uses the trimws() function internally. It simply takes in a vector of strings, and for each element converts everything to uppercase, and then removes any leading and/or trailing whitespace. The alien-looking syntax on line 16 is using regular expressions to convert several common word delimiters (space, dash, underscore, period) into one delimiter of the user’s choosing.
stand_str <- function(str_in, word_delim = "-") { ## Check for appropriate word delimiter selection if(!word_delim %in% c(" ", "-", "_", ".")) { stop('Please select one of the following for the output word delimiter: " " [space], ".", "-", "_"') } ## Convert everything to uppercase str_out <- toupper(str_in) ## Remove leading/trailing whitespace str_out <- trimws(str_out, which = "both") ## Swap out whitespace, dash, period, underscore for selected delimiter str_out <- gsub("\\s|-|\\.|_", word_delim, str_out) return(str_out) }
So, running it on a vector consisting of some variations on the theme of “New York City”:
ny <- c(" New York City ", "New-York_City", "new.york.city", "New_York.city") stand_str(ny, "_") [1] "NEW_YORK_CITY" "NEW_YORK_CITY" "NEW_YORK_CITY" "NEW_YORK_CITY"
Nice – our messy list of New York City strings are now all converted to a common format.
Python Example
Below is the same function written in Python, and the full code snippet that it comes from is available here.
def stand_strings( str_list, delim ): ## Check for appropriate word delimiter selection if delim not in [' ', '-', '_', '.']: sys.exit('''Please select one of the following for the output word delimiter: " " (space), "-", "_", "." ''') ## Convert to uppercase, strip leading/trailing whitespace str_list = [x.upper().strip() for x in str_list] ## Swap out whitespace, dash, period, underscore for selected delimiter str_list = [re.sub('\s|\.|-|_', delim, x) for x in str_list] return str_list
And here is the same example (though note that we have to import a couple modules for the function to work properly):
import re import sys ny = [' New York City ', 'New-York_City', 'new.york.city', 'New_York.city'] print(stand_strings(ny, "_")) ['NEW_YORK_CITY', 'NEW_YORK_CITY', 'NEW_YORK_CITY', 'NEW_YORK_CITY']
Conclusion – and Caution for R
That’s all there is to it. These functions aren’t anything fancy, but like I said I do end up using one or the other almost every time I receive data from somebody else. Getting in the habit of using these on all my data files has certainly saved me from many headaches. It would also be easy to create a version of these functions that takes user-inputted search and replace patterns, but 98% of the time, you’ll probably be dealing with one of the four different common word delimiters used in these functions.
Do keep in mind that if your data is set up with identifiers as column headers, R will try to do its own formatting upon import if these names don’t live up to its standards. Specifically, it will add an “X” to the front of any column name that starts with a number, and it will convert spaces or dashes to periods. For instance, if we read in a .csv file with a column named “2017-New-York-City”, R will convert this to “X2017.New.York.City” unless we set check.names = FALSE in the call to the read.csv() function.
In a future post, I may get a little more sophisticated by going over approximate or “fuzzy” string matching.