qdap brief intro
For the past year I’ve been working on a package (qdap) to assist my field in quantitative discourse analysis; basically looking at patterns in language. It’s still a ways from being finished and lacks documentation (roxygen2 is my friend), but after seeing the presidential debates yesterday I decided to try using some of the package’s functions on a transcript of the dialogue.
Getting qdap to work may take some finagling because the package relies on the openNLP package. You have to make sure you have the correct version of java installed. I know the package is able to be installed on all three major OS. You’ll also notice quickly that the tm, ggplot2, and wordcloud packages are relied upon as well.
Note: I display the graphics here with .png files but recommend .pdf or .svg as the image is much clearer. For a combined pdf version of the graphics in this post click here.
Getting and cleaning transcripts of the debate
library(qdap) url_dl("pres.deb1.docx") #downloads a docx file of the debate to wd # the read.transcript function allows reading in of docx file # special thanks to Bryan Goodrich for his work on this dat <- read.transcript("pres.deb1.docx", col.names=c("person", "dialogue")) truncdf(dat) left.just(dat) # qprep wrapper for several lower level qdap functions # removes brackets & dashes; replaces numbers, symbols & abbreviations dat$dialogue <- qprep(dat$dialogue) # sentSplit splits turns of talk into sentences # special thanks to Dason Kurkiewicz for his work on this dat2 <- sentSplit(dat, "dialogue", stem.col=FALSE) htruncdf(dat2) #view a truncated version of the data(see also truncdf)
Wordclouds (relies on Ian Fellows’ wordcloud package)
#first put a unique character between words we want to keep together #first put a unique character between words we want to keep together dat2$dia2 <- space_fill(dat2$dialogue, c("Governor Romney", "President Obama", "middle class", "The President", "Mister President")) #Generate target words to color by tw <- list( health=c("health", "insurance", "medic", "obamacare", "hospital"), economic = c("econom", "jobs", "unemploy", "business", "banks", "budget", "market", "paycheck"), foreign = c("war ", "terror", "foreign"), class = c("middle~~class", "poor", "rich"), opponent = c("romney ", "obama", "the~~president", "mister~~president") ) #create stop word list from qdap data set Top25Words but exclude he and I sw <- exclude(Top25Words, "he", "I") #the word cloud by grouping variable function with(dat2, trans.cloud(dia2, person, proportional = TRUE, target.words = tw, cloud.colors = c("red", "blue", "black", "orange", "purple", "gray45"), legend = names(tw), stopwords=sw, max.word.size = 4, char2space = "~~"))
Visuals of the trans.cloud function
Gantt Plot of the dialogue over time
Obviously (when you see the output), this uses Hadley Wickham’s ggplot2.
# special thanks to Andrie de Vries for his work on this function with(dat2, gantt_plot(dialogue, person, xlab = "duration(words)", x.tick=TRUE, minor.line.freq = NULL, major.line.freq = NULL, rm.horiz.lines = FALSE))
Visualization of the Gantt Plot
Formality scores (how formal a person’s language is)
This concept comes from:
Heylighen, F., & Dewaele, J.-M. (2002). Variation in the contextuality of language: An empirical measure. Foundations of Science, 7(3), 293–340. doi:10.1023/A:1019661126744
The code can be run in parallel because this is a slower function. It uses openNLP to first map parts of speech for every word.
#parallel about 1:20 on 8 GB ram 8 core i7 machine v1 <- with(dat2, formality(dialogue, person, parallel=TRUE)) plot(v1) #about 4 minutes on 8GB ram i7 machine v2 <- with(dat2, formality(dialogue, person)) plot(v2) # note you can resupply the output from formality back # to formality and change arguments. This avoids the need for # openNLP, saving time. v3 <- with(dat2, formality(v1, person)) plot(v3, bar.colors=c("Dark2"))
Output and plot from the formality function
person word.count formality 1 ROMNEY 4068 61.82 2 LEHRER 765 61.31 3 OBAMA 3595 58.30
Afterthought: I was remiss to mention that the word clouds are proportional (argument proportional = TRUE) for all words spoken rather than frequency per person. This enables comparison across clouds.
Please comment on the article here: TRinker's R Blog