Text Mining the Listserve Emails
The Listserve is an email list where people sign up for a chance to send an email out to the entire list to discuss whatever they want. Currently the number of people enrolled is about 20,000 and there has been one email per day since 16 April 2012. I thought that since this project has been running for about a year, it would be a nice opportunity to learn a little more about text mining.
In this first part I will discuss how I fetched all those emails and parsed them and in a second blog post I will talk about what I found.
The first issue was how to get the emails off the server and after
trying a few solutions I finally ended up using the Python
imaplib which is a
Python library for connecting with an IMAP4 email server which is used
by all the major providers such as Yahoo and Google. After connecting I
used the Python email library which helped facilitate selecting certain parts of the email. I relied
heavily on the function
email.message_from_string() to fetch email
attributes such as Message-ID or Sender. I took all these emails and
dumped them into a SQLite database to later parse with
I use R almost daily for work so it was nice to tackle this part of the
project with tools I knew pretty well. I used
strsplit() mostly to parse out parts of various email attributes and
then used the tm
package to handle all of the text processing. The tm package makes it
easier to get all the emails into a term document matrix which is much
easier to work with a large corpus of text such as this. I used an
English dictonary with the tm package to remove stop words and for
stemming (reducing the word to its base form). There have been two
emails so far in Portuguese but the rest are all in English.
Initially I thought I could track all the emails by date but this proved to be a difficult task due to the nuances of email and when they actually got sent off the server. Instead I ended up using the Message-ID for making sure that I did not duplicate emails in the analysis.
I put up all the source code on a github repo.