Fuzzy Link Bot is an implementation of network analysis of entities based on coocurance in unstructured text. It's fuzzy because it's often wrong. It makes all kinds of mistakes and forms links based on a very simple criteria, but because it deals with such a large amount of data it can present associations that you might not have found otherwise.
How It's Made
There is a Groovy application that runs every hour to check a series of rss feeds for music news sites. It downloads the text of new articles, tries to remove ads, scripts, boilerplate, and other fluff to get just the text of the article. Then it submits the text to the Calais web service for entity extraction. The Calais service parses the text and returns info on what appeared where in the document. It categorizes things as people, places, companies, music groups, etc. and tries to determine which entities pronouns refer to. This might sound like a small thing, but parsing language is extremely difficult for a robot. Things are often misidentified, or missed altogether and this is part of what earns the robot the Fuzzy moniker.
The results of the entity extraction are stored in a MySQL database. This data is served by Java applets for the HTML application, and by BlazeDS to the Flash application - Fuzzy Link Bot. The Flash application is written using Flex 4 and the Flare library. The HTML application uses d3.js.
Entities are drawn as circles connected to other entities based on how often the appear near each other in the articles. This is also fuzzy. Things appear near other things in the text for any number of reasons. Fuzzy tries to ignore adds, but they still slip in. Things that aren't really that related to each other will get linked while other things that are related won't be. The curious thing is that if you can throw enough data at a fairly stupid process, it starts to produce results that are smarter than they should be.
When you first load the page, the thing that is loaded as an example is the trendiest entity of the moment. A trendy thing is something that has been mentioned a lot in the last seven days compared to the last year and has appeared in more than one source. When this entity changes, the robot sends a Tweet - @FuzzyLinkBot.