Presidential Debates with qdap-beta

October 4, 2012
By

(This article was originally published at TRinker's R Blog, and syndicated at StatsBlogs.)

qdap brief intro
For the past year I’ve been working on a package (qdap) to assist my field in quantitative discourse analysis; basically looking at patterns in language. It’s still a ways from being finished and lacks documentation (roxygen2 is my friend), but after seeing the presidential debates yesterday I decided to try using some of the package’s functions on a transcript of the dialogue.

Getting qdap to work may take some finagling because the package relies on the openNLP package. You have to make sure you have the correct version of java installed. I know the package is able to be installed on all three major OS. You’ll also notice quickly that the tm, ggplot2, and wordcloud packages are relied upon as well.

Note: I display the graphics here with .png files but recommend .pdf or .svg as the image is much clearer. For a combined pdf version of the graphics in this post click here.

Getting and cleaning transcripts of the debate

library(qdap)
url_dl("pres.deb1.docx")  #downloads a docx file of the debate to wd
# the read.transcript function allows reading in of docx file 
# special thanks to Bryan Goodrich for his work on this
dat <- read.transcript("pres.deb1.docx", col.names=c("person", "dialogue"))
truncdf(dat)
left.just(dat)
# qprep wrapper for several lower level qdap functions
# removes brackets & dashes; replaces numbers, symbols & abbreviations
dat$dialogue <- qprep(dat$dialogue)  
# sentSplit splits turns of talk into sentences
# special thanks to Dason Kurkiewicz for his work on this
dat2 <- sentSplit(dat, "dialogue", stem.col=FALSE)  
htruncdf(dat2)   #view a truncated version of the data(see also truncdf)

Wordclouds (relies on Ian Fellows’ wordcloud package)

#first put a unique character between words we want to keep together
#first put a unique character between words we want to keep together
dat2$dia2 <- space_fill(dat2$dialogue, c("Governor Romney", "President Obama", 
    "middle class", "The President", "Mister President"))

#Generate target words to color by
tw <- list(
        health=c("health", "insurance", "medic", "obamacare", "hospital"), 
        economic = c("econom", "jobs", "unemploy", "business", "banks", 
            "budget", "market", "paycheck"),
        foreign = c("war ", "terror", "foreign"),
        class = c("middle~~class", "poor", "rich"),
        opponent = c("romney ", "obama", "the~~president", "mister~~president")
)

#create stop word list from qdap data set Top25Words but exclude he and I
sw <- exclude(Top25Words, "he", "I")

#the word cloud by grouping variable function
with(dat2, trans.cloud(dia2, person, 
    proportional = TRUE,
    target.words = tw,
    cloud.colors = c("red", "blue", "black", "orange", "purple", "gray45"),
    legend = names(tw),
    stopwords=sw, 
    max.word.size = 4,
    char2space = "~~"))

Visuals of the trans.cloud function
wordcloud 1
wordcloud 2
wordcloud 3

Gantt Plot of the dialogue over time
Obviously (when you see the output), this uses Hadley Wickham’s ggplot2.

# special thanks to Andrie de Vries for his work on this function
with(dat2, gantt_plot(dialogue, person,  xlab = "duration(words)", x.tick=TRUE,
    minor.line.freq = NULL, major.line.freq = NULL, rm.horiz.lines = FALSE))

Visualization of the Gantt Plot
Gantt Plot

Formality scores (how formal a person’s language is)
This concept comes from:

Heylighen, F., & Dewaele, J.-M. (2002). Variation in the 
    contextuality of language: An empirical measure. Foundations 
    of Science, 7(3), 293–340. doi:10.1023/A:1019661126744

The code can be run in parallel because this is a slower function. It uses openNLP to first map parts of speech for every word.

#parallel about 1:20 on 8 GB ram 8 core i7 machine
v1 <- with(dat2, formality(dialogue, person, parallel=TRUE))
plot(v1)
#about 4 minutes on 8GB ram i7 machine
v2 <- with(dat2, formality(dialogue, person)) 
plot(v2)
# note you can resupply the output from formality back
# to formality and change arguments.  This avoids the need for
# openNLP, saving time.
v3 <- with(dat2, formality(v1, person))
plot(v3, bar.colors=c("Dark2"))

Output and plot from the formality function

  person word.count formality
1 ROMNEY       4068     61.82
2 LEHRER        765     61.31
3  OBAMA       3595     58.30

formality

Afterthought: I was remiss to mention that the word clouds are proportional (argument proportional = TRUE) for all words spoken rather than frequency per person. This enables comparison across clouds.




Please comment on the article here: TRinker's R Blog

Tags: , , , , , , , , , , ,

Subscribe

Email:

Add to Google Reader or Homepage

  Subscribe