Tidying messy Excel data (tidyxl)

May 8, 2018
By

(This article was originally published at Stat Bandit, and syndicated at StatsBlogs.)

Reposted from Abhijit’s blog. Some <- have been replaced by = due to idiosyncracies of the WordPress platform.


Well, here’s what I was dealing with:

Exemplar Excel file from collaborator

Exemplar Excel file from collaborator

Notice that we have 3 header rows, first with patient IDs, second with spine region, and third with variable names (A and B, to protect the innocent).

Goal

A dataset that, for each patient and each angle gives us corresponding values of A and B. So this would be a four-column data set with ID, angle, A and B.

Attempt 1 (readxl)

d1 <- readxl::read_excel('spreadsheet1.xlsx')
head(d1)

## # A tibble: 6 x 26
## X__1 patient `44` `44__1` `10` `10__1` `3` `3__1` `53` `53__1`
##
## 1 IDS T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6
## 2 angles A B A B A B A B
## 3 60 31.83… 1 31.52… 1 32.9… 0 31.8… 0
## 4 65 31.66… 1 31.33… 1 32.2… 0 32.3… 0
## 5 70 31.45… 1 31.09… 0.20200… 31.7… 0 32.5… 0
## 6 75 31.08… 1 30.96… 0.44831… 31.2… 8.641… 32.3… 1
## # … with 16 more variables: `2` , `2__1` `8` ,
## # `8__1` , `6` , `6__1` , `43` , `43__1` ,
## # `48` , `48__1` , `46` , `46__1` , `4` ,
## # `4__1` , `9` , `9__1`

This strategy gives us funky column names, and pushes two of the headers into data rows. Since the headers are in rows, they’re a little harder to extract and work with. More worrisome is the fact that since the headers leaked into the data rows, the columns are all of type character rather than type numeric, which would now require further careful conversion after cleaning. So I don’t think readxl is the way to go here, if there’s a better solution.

Attempt 2 (tidyxl)

d2 <- tidyxl::xlsx_cells('spreadsheet1.xlsx')
head(d2)

## # A tibble: 6 x 21
## sheet address row col is_blank data_type error logical numeric
##
## 1 T5T6 B1 1 2 FALSE character NA NA
## 2 T5T6 C1 1 3 FALSE numeric NA 44.
## 3 T5T6 D1 1 4 FALSE numeric NA 44.
## 4 T5T6 E1 1 5 FALSE numeric NA 10.
## 5 T5T6 F1 1 6 FALSE numeric NA 10.
## 6 T5T6 G1 1 7 FALSE numeric NA 3.
## # … with 12 more variables: date , character ,
## # character_formatted , formula , is_array ,
## # formula_ref , formula_group , comment , height ,
## # width , style_format , local_format_id

The xlsx_cells captures the data in a tidy fashion, explicitly calling out rows and columns and other metadata within each cell. We can clean up this data using tidyverse functions:

library(tidyverse)
cleanData1 = function(d) {
  angle = d %>% filter(row >= 4, col == 1) %>% pull(numeric)
  name = d %>% filter(row %in% c(1,3), col >= 3) %>%
    mutate(character = ifelse(is.na(character),
                              as.character(numeric),
                              character)) %>%
    select(row, col, character) %>%
    filter(!is.na(character)) %>%
    spread(row, character) %>%
    unite(ID, `1`:`3`, sep = '_') %>%
    pull(ID)
  data = d  %>% filter(row >= 4, col >= 3) %>%
    filter(!is.na(numeric)) %>%
    select(row, col, numeric) %>%
    spread(col, numeric) %>%
    select(-row) %>%
    set_names(name) %>%
    cbind(angle) %>%
    gather(variable, value, -angle) %>%
    separate(variable, c('ID','Measure'), sep = '_') %>%
    spread(Measure, value) %>%
    select(ID, angle, A, B) %>%
    arrange(ID, angle)
  return(data)
}

head(cleanData1(d2))
##   ID angle        A        B
## 1 10    60 31.52867 1.000000
## 2 10    65 31.33477 1.000000
## 3 10    70 31.09272 0.202002
## 4 10    75 30.96078 0.448317
## 5 10    80 30.79397 0.670876
## 6 10    85 30.52185 0.461406

This is a lot of data munging, and though dplyr is powerful, it took a lot of trial and error to get the final pipeline done.Nonetheless, I was really psyched about tidyxl, since it automated a job that would have taken manual manipulation (I had 12 spreadsheets like this to process). I was going to write a blog post on this cool package that made my life dealing with messy Excel file a piece of cake. But wait, there’s more…

Attempt 3 (tidyxl + unpivotr)

I didn’t know about unpivotr until this post:


Subscribe

Email:

  Subscribe