Analysis of the Listserve Emails

The Listserve is an email lottery, you sign up and once a day someone gets a chance to send the entire list an email. My previous post covered how I fetched these emails, this post will discuss the actual statistics obtained from The Listserve emails.

To:

The Listserve website mentions the countries of subscribers but thats about it. As of today, there are currently 21,402 subscribers. I fetched all the archival data I could from Internet Archive and looked at enrollment over time which has stayed consistent around 20,000. I also plotted enrollment over time.

From:

The Listserve allows you to use any name you want as the sender of the email, here are the ones that occurred more than once:

Name Occurrence
Anonymous 12
Laura 3
The Listserve 2
Ben 2
Beth 2
David 2
Sam 2
Michelle Huang 2
T. 2

Interesting that Michelle Huang had two entries, what happens if we look at first name only?

Name Occurrence
Anonymous 12
Chris 8
David 7
Jordan 4
Michelle 4
Alex 3
Andy 3
Ben 3
Brian 3
Daniel 3
James 3
Laura 3

What about time of day sent?

I took all the timestamps from the emails and plotted when they were sent based on GMT. This was more due to personal curiosity but interesting nonetheless. The red line in the plot is the mean time which ended up being 17:19:15 GMT. Those large drops are likely due to some nuances in email dates. For example, I got two emails on 23 June 2012 and none on 22 June.

Subject:

I took all the subject lines and created a word frequency table on how often that word occurred:

Word Occurrence
life 9
world 9
day 8
little 8
love 7
story 7
advice 6
time 6

Body:

For the body of the email I created a Term-Document Matrix which is a matrix that describes the frequency of words and how often they occur together. This allows themes and trends of the body of work or corpus, which in this case happens to be The Listserve emails. I took all the emails and removed punctuation and stop words such as “and” or “but” and made a matrix based on how often the most common words occured together. I then created a dendrogram of all the words and how they clustered with each other.

The majority of the words are pretty evenly clustered and its difficult to determine any trends. However there is a cluster on the far left side of the tree which I zoomed in on:

This cluster includes word pairs such as “email” and “listserve”, “love” and “time”, and “life” and “people”. While its not surprising to see these words occurring so often together, it is interesting to see that a majority of people use this email to dispense wisdom or advice to the masses.

I have not yet been selected for The Listserve but I am sure these findings here will strongly influence what I write. In the meantime, I want to learn more about text processing since I found it pretty interesting.