Skip to content
Brian Ward, Ph.D.
  • Home
  • Blog
  • CV
  • Contact

Everyday Annoyances: String Formatting

  • November 12, 2017January 9, 2018
  • by brian_ward

This post is intended to be the first in a series of everyday annoyances that one typically runs into when working with data. Today’s example should be familiar to anyone who has ever worked with data recorded by multiple people or groups of people. Let me give a little vignette to illustrate the problem. I’ll assume that you are working with data that contains a column with some sort of identifier for research subjects, genotypes, etc.

The Tyranny of Identifier Formatting

You have just received a set of data from one of your collaborators. Excited to see what it contains, you get to work compiling it with your own data. However, within a couple of minutes you realize that something is wrong. For instance, you try to perform a merge using the identifier columns – and find that nothing matches between the two sets of data. Or, you try binding the rows of the two data tables together, and you suddenly end up with about twice as many different IDs as you had anticipated. Your spirits begin to sink as you experience an acute wave of panic.

After taking a minute to cool off, you look at both sets of data side by side. You look up one particular ID – “OR_2016_47_3.” When you find the same ID in your collaborator’s data, you see the problem – its listed there as “OR-2016-47-3.” The mystery is solved – at some time in the past, some person or script decided to take matters into their own hands, deviating from the identifier formatting that everyone involved in the project (hopefully) agreed on at the outset.

The situation above is just one of those things that happens – more often than not – when you deal with data that has been worked over by somebody else. Luckily there’s an easy fix. We simply need a sort of disciplinarian function that we can sic on any vector or list of strings to enforce some order. Below are some examples written in both R and Python. I’m sure there might be some versions of these functions written into this or that R package or Python module that I’m not aware of, but it’s easy enough to make our own versions.

R example

The full R code snippet is available here.

Note that this function requires R version > 3.2.0, as it uses the trimws() function internally. It simply takes in a vector of strings, and for each element converts everything to uppercase, and then removes any leading and/or trailing whitespace. The alien-looking syntax on line 16 is using regular expressions to convert several common word delimiters (space, dash, underscore, period) into one delimiter of the user’s choosing.

stand_str <- function(str_in, word_delim = "-") {
  
  ## Check for appropriate word delimiter selection
  if(!word_delim %in% c(" ", "-", "_", ".")) {
    stop('Please select one of the following for the output word delimiter:
         " " [space], ".", "-", "_"')
  }
  
  ## Convert everything to uppercase
  str_out <- toupper(str_in)
  
  ## Remove leading/trailing whitespace
  str_out <- trimws(str_out, which = "both")
  
  ## Swap out whitespace, dash, period, underscore for selected delimiter
  str_out <- gsub("\\s|-|\\.|_", word_delim, str_out)
  
  return(str_out)
}

So, running it on a vector consisting of some variations on the theme of “New York City”:

ny <- c("  New York City  ", "New-York_City", "new.york.city", "New_York.city")
stand_str(ny, "_")

[1] "NEW_YORK_CITY" "NEW_YORK_CITY" "NEW_YORK_CITY" "NEW_YORK_CITY"

Nice – our messy list of New York City strings are now all converted to a common format.

Python Example

Below is the same function written in Python, and the full code snippet that it comes from is available here.

def stand_strings( str_list, delim ):
    
    ## Check for appropriate word delimiter selection
    if delim not in [' ', '-', '_', '.']:
        sys.exit('''Please select one of the following for the output word delimiter:
            " " (space), "-", "_", "." ''')
    
    ## Convert to uppercase, strip leading/trailing whitespace        
    str_list = [x.upper().strip() for x in str_list]
    
    ## Swap out whitespace, dash, period, underscore for selected delimiter
    str_list = [re.sub('\s|\.|-|_', delim, x) for x in str_list]
    return str_list

And here is the same example (though note that we have to import a couple modules for the function to work properly):

import re
import sys

ny = ['  New York City  ', 'New-York_City', 'new.york.city', 'New_York.city']
print(stand_strings(ny, "_"))

['NEW_YORK_CITY', 'NEW_YORK_CITY', 'NEW_YORK_CITY', 'NEW_YORK_CITY']

Conclusion – and Caution for R

That’s all there is to it. These functions aren’t anything fancy, but like I said I do end up using one or the other almost every time I receive data from somebody else. Getting in the habit of using these on all my data files has certainly saved me from many headaches. It would also be easy to create a version of these functions that takes user-inputted search and replace patterns, but 98% of the time, you’ll probably be dealing with one of the four different common word delimiters used in these functions.

Do keep in mind that if your data is set up with identifiers as column headers, R will try to do its own formatting upon import if these names don’t live up to its standards. Specifically, it will add an “X” to the front of any column name that starts with a number, and it will convert spaces or dashes to periods. For instance, if we read in a .csv file with a column named “2017-New-York-City”, R will convert this to “X2017.New.York.City” unless we set check.names = FALSE in the call to the read.csv() function.

In a future post, I may get a little more sophisticated by going over approximate or “fuzzy” string matching.

Share this:

  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on Pocket (Opens in new window)
  • More
  • Click to share on Reddit (Opens in new window)
  • Click to share on Tumblr (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to print (Opens in new window)
  • Click to email this to a friend (Opens in new window)

Related

Multi-Panel Figures in R
Tissue Grinding for 96-well DNA Extraction
format Python R standardize strings

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • What Software Should I Learn? Pt. I – Linux and Shell Scripting
  • Misconceptions About Genetics and Evolution in Pop Culture, Part II
  • Investigating COVID-19 From Your Couch
  • Frequent Misconceptions About Genetics and Evolution in Pop Culture
  • Multi-Environment Genomic Prediction Paper

Recent Comments

    Archives

    • January 2021
    • June 2020
    • May 2020
    • January 2020
    • October 2019
    • February 2019
    • June 2018
    • February 2018
    • December 2017
    • November 2017
    • October 2017

    Categories

    • Everyday Annoyances
    • Lab Protocols
    • Observations
    • R coding
    • shell scripting
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    LinkedIn   ResearchGate   ORCID

    © 2017 Brian Ward
    Theme by Colorlib Powered by WordPress
    loading Cancel
    Post was not sent - check your email addresses!
    Email check failed, please try again
    Sorry, your blog cannot share posts by email.