At Luminoso, we sometimes combine our serious analytics with something a bit more fanciful.
We were pitching to a company that deals with internal corporate documents -- there are lots of fun things we can do with text analytics and a lot of internal documents -- but where on earth are we going to get our hands on secret, internal corporate documents of any kind? Nobody's handing them out on the street.
Rather than the oft-proposed method called "theft," we did what most email researchers do: take advantage of the fact that Enron got all their email subpoenaed. Hence, the Enron Corpus, consisting of about half a million emails between about 150 top Enron executives, running from late 1999 to early 2002, from slightly before the actual price fixing to slightly after the bankruptcy. Then we did what we usually do.
Here's our map of the concepts in the Enron corpus. We're showing the most interesting ten dimensions here in a nice two-dimensional projection. Color represents proximity -- points that are close to each other in the true space appear in similar colors and the farther the color, the farther the points. This helps to keep things clear when we manipulate the projection. But start with a few observations:
That nice orange swath up at the top is emails between traders about trades -- asking for money to be moved, futures to be cleared, and so on. Since energy trading was the principal business of Enron, it's not surprising that the topics related to energy trading are so important in this space. Down towards the lower left, in purple, we have an area that seems related to fun and recreation, with concepts like "fly kite," "girlfriend," "karaoke," and, especially, "fantasy football." But rearranging the space a little we can zoom in on our fifth principal component.
Actually, this isn't quite the fifth principal component but it's very close. The origin is offscreen to the northeast and along the axis we have "air pollution," "threat," "catastrophe," and "blackout." Farther out in the same direction, we see "overpay," "power outage," and "go bankrupt." It's not just counting negativity; this direction is about energy-related badness. All that good price-fixing stuff is in here. As such, we're calling it the Axis of Evil.
But not just that. Emails have timestamps. So for instance, we can chart evil in the email stream over time:
And we can see almost exactly what we expect: lots of evilness in summer 2000, while blackouts are being engineered, a big drop in the winter, and a resurgence in summer 2001 corresponding to the traders emailing each other newspaper articles discovering Enron's misdeeds. Finally, after the bankruptcy, evilness slowly dies away. But could this be the bankruptcy causing a more general reduction in non-bankruptcy correspondence? As it turns out, no:
As alluded to earlier, the Recreation axis is mostly football related, so we can see its relevance to the corpus increasing during football season and dropping away afterwards. This pattern persists even through the bankruptcy. When your company is swarming with feds, you may not be able to be evil, but at least you can talk sports.
The example is certainly a little silly, but we can of course deploy the same techniques on non-ridiculous subject matter. We might risk coming off a touch unserious by posting this, but we felt it was too much fun not to share.
(Thanks to Simon Thompson, who finally gets his pictures.)