I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.
I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds.One thing nobody admits to is: Excel data is hard to read reliably into R. Admitting this means admitting there is something Excel is better at than R is (even if it is just reading its own data). Cran suggests several methods to read data into R, but most of them fail on non-trivial (but not even large) data. For example cran suggests:
- Methods that depend on connecting to a Windows service in a 32 bit environment (like RODBC).
The xlsx package which even when you add more memory to the required Java dependency (
.jinit(parameters="-Xmx6G")) runs out of memory parsing a 44 megabyte file.
- The gdata package which uses Perl for the Excel parsing.
- The XLConnect package which seems to not install on R2.15.0
- The Omegahat package RExcelXML (not part of cran).
Of these packages we have had the best luck with gdata (and already this is not simple). However we have seen the package silently truncate data on read. Also, strings are converted to factors (though this can be fixed by passing the appropriate read.table() settings through the read.xls() call). Digging into the gdata package we see it uses a Perl script to try and export Excel data into a quoted format- but the package does not get all of the various possibilities of embedded line-breaks, embedded quotes, escapes and comments correct. The wrong string-valued field can silently poison the results.
Additionally we have found is R’s read.table() and read.xls() command do not fully understand all the complexities of Excel export quoting (which allows actual quotes, fields separators and line separators inside fields, relying on a non-Unix escape sequence of doubling quotes). This is a problem for any data set which includes text extracts (bringing in line breaks, false field separators and problematic punctuation). To add insult to injury R’s read table is in no way guaranteed to be able to read the output of R’s own write.table(), unless you remember to set enough options about quoting, escaping, line breaks and comments both during writing and reading. Finally Excel’s native save as text or save as CSV has the amazing bug that any fields that is too big to display (and thus displayed as hash-marks) is in fact exported as hash-marks! So even having a copy of Excel isn’t enough to solve the Excel export problem.
If needless complexity is causing problems- let us try avoiding the complexity. I have modified the Perl script found in the gdata package to a new version that exports to what I call “strong TSV.” In strong TSV fields never contain the fields separator (tab) or a line-separator (CR or LF) and there is no quoting, escaping or comments (quoting is no longer needed, so we just treat quotes as a standard character like any other). To export to strong TSV we just use a simple Perl regexp to replace all field whitespace with standard space (slightly damaging the field, but saving a lot of down-stream complexity). The file converted in this manner can then be read reliably using R’s read.table command:
d <- read.table('file.tab', quote='',comment.char='', allowEscapes=F,sep='\t', header=T,as.is=T, stringsAsFactors=F)
The modified perl-script ( xls2goodTSV.pl ) is placed in the Perl section of the gdata library (in my case
Library/R/2.15/library/gdata/perl) and run in that environment.
Some of these weaknesses are flaws in the cran-suggested R libraries. But some of the problems are due to the complexity of Excel and even quoting in exported Excel documents (and the hash-bug). All of these steps could be skipped if the data were available in a true machine readable format like strong TSV, JSON or SQL dump (all of which get quoting correct). My suggestion is to try strong TSV as it is very simple and interoperates well with most tools (R, linux command line tools and even Excel itself). The time you save may be your own.
Update: forgot to mention a cool example where insisting on a simple format that avoids too much quoting and escaping has a big payoff: Apache Hadoop Map Reduce TextInputFormat.
Because Apache Hadoop Map Reduce TextInputFormat format is simple the Hadoop master node can split a large file of records by just specifying seek locations (see org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits()) and then letting the reader skip one (usually partial) line to re-align the reader to valid record boundaries (see org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize()). This allows file splitting that does not stress the master node (as the master node does not need to scan the file) and efficient job distribution as each split destination only needs to look at the file interval it is responsible for plus at most one more contiguous line. If the master node had to scan the how file for record breaks (which formats that allow general quoting and escaping would entail) there would be little point in the split and distribute parallelism- you might as well complete all your work on the master node as you have already forced it to scan the whole input file.
The details being: if Apache allowed a file format here that required a complicated state machine to determine line breaks (tracking things like are we in a quoted or commented environment) then you could not tell where record boundaries are without scanning from the start of the file every time! Because even if you see properly formatted records you need to know that there is not in an outer quoting context that says all of your records are just a single spreadsheet included with a single cell of an outer spreadsheet (so it is not just details of any one quoting mechanism that is working against you, it is the desire to faithfully quote; which makes some desired information like true record boundary locations essentially non-local).
The take-away is if you are going to dedicate all of your effort to supporting complicated quoting and legacy formats: that is all you will ever achieve. You will be held hostage to the considerations and trade-offs that went into those formats and not be able to achieve new things (like nearly free parallelism).
Also I would like to re-emphasize that a lot of tools that claim to correctly read complicated formats often only work correctly on the simple or limited cases that the original tool builder needed during original tool construction. These libraries often fail to appreciate the large amount of complexity quickly and correctly parsing formats with many interacting features (quoting, escaping, comments, field separators, line separators, character encodings, localization and many more unrewarding features). So you can try libraries, just be prepared to waste a lot of time being disappointed.
Please comment on the article here: Win-Vector Blog » Statistics