网络资源的拷贝粘贴 备份参考之用


8 May 2009

The Business of Mining the Twitter Stream

The Business of Mining the Twitter Stream

(http://datamining.typepad.com/data_mining/2009/02/the-business-of-mining-the-twitter-stream.html)

February 19, 2009

While mining Twitter data for business and marketing intelligence (trend/buzz analysis, sentiment/opinion mining, authority/influence analysis) looks like a compelling path to explore for a business model, it is important to consider the proposition from the point of view of the customer. Enterprises have been working with vendors in this space (mining social media content for BI) for well over 5 years and already have expectations regarding the features and quality of reports that these analytics needs to deliver to be useful (actionable).

  • Domain coverage: how broad is the topical space available in the solution? Crawling all data sources is the way to win here.
  • Demographic coverage: the broader the demographic coverage (and the accuracy with which the demographic features of the content authors can be determined) the better.
  • Content Analysis/Text Mining: how well does the solution take all the unstructured content and deliver structured interpretations that can then act as the input for further data mining. This is generally a matter of applied research (taking the current state of the art in text mining and making it work with the greater variety and complexity of social media content).
  • Timeliness: how timely is the analysis. This is generally a function of how timely the data is collected. Blog data, for example, can be gathered in a very timely manner thanks to the ping/feed  mechanism. However, the reality of real time mining is that the consumer of the data is the real calibrator - real time may mean 4 hourly, not second by second.

If the business model for Twitter is going to be mining the Twitter stream for BI/MI, then they will be competing with companies that gather very large data sets (weblogs, usenet, message boards, reviews, groups, mailing lists, etc.). Seth Grimes suggested that the short texts of the Twitter stream may make hard problems like sentiment mining simpler as the limited space requires the author to be concise. However, this is a double edges sword as it means that the depth of analysis will be far shallower.

I believe that mining Twitter data will be a very exciting experiment, but I think that if Twitter goes down this path, it will have to either provide analytics over the other data sets, or partner with an existing company (say Visible Technologies). In fact, such a partnership would take the burden of building out an analytics engine away from the small Twitter team allowing them to continue to focus on infrastructure and ensuring the flow of this valuable data stream.

Add to del.icio.usEmail this

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c994053ef011278fcc1a428a4

Listed below are links to weblogs that reference The Business of Mining the Twitter Stream:

Comments

Matthew, thanks for the mention. I'd venture that tweet mineability is also easier because short messages cover a single topic.

Short messages are easy to post so they folks can post more frequently. So maybe the more interesting thing to mine from twitter is message propagation. Then from propagation threads and connectedness patterns, one could infer influence networks and knowledge about the types & topics & forms of messages that travel farthest and fastest.

I don't get it, anyone can mine twitter for sentiment (using the search API)... why would twitter reinvent the wheel?

Nice analysis, Matthew - I'd also say that as part of the way that people use Twitter is to share links to interesting content/conversations elsewhere, the need to be analysing the networks around the Twitter streams is very important indeed.

There's a reality of the value of the raw data to the marketplace, which I'll get to in a minute. Regardless, short messages may very well be harder to search, not easier. Here's some reasons why:

* For indexing purposes, it's not only the corpus of the text that matters, it's the number of objects. So a search architecture has to take that into account. It's a non-trivial problem; especially with the kind of volumes involved here. Not to mention that servers are going to be thrashed with reading/writing if anything is meant to be done real time. (Perhaps less so for batch analysis of course.)

* Next we have the nature of the messages themselves. Due to the 140 character nature, there's an increase in odd acronyms even beyond the brb, lol, etc. Perhaps synonym dictionaries could be produced, but the variability here seems extreme just based on anecdotal experience.

* Regarding sentiment mining, that's difficult enough in larger text, but may be harder in small text. Not for raw sentiment where the phrases are obvious. But sentiment analysis lags with regards to humor and sarcasm, which may need more markers to divine actual meaning.

These are solvable problems. And in the latter case, it might not matter that terribly much if some stuff gets missed as general trends can still be spotted easily enough. Personally, I feel confident someone(s) will work this out to some reasonable degree of satisfaction.

Next, as to the dollars. I can tell you from experience the industry does not value the raw data terribly highly for specific social media data streams. The value added analysis? Yes. The actual data? Not so much. This is because it's easy enough for a variety of people to crawl blogs, forums and so forth. And several do, though in some cases there's really only a couple of providers feeding data to the 60+ reputation monitoring companies.

Unless Twitter made itself the sole availability for the full data stream, they wouldn't be able to command that great a price. I'm just guesstimating based on past experience with other data types here, but MAYBE 1M / month if they sold to every rep services company out there. (Who would in turn add analysis and re-sell for much more.) That's decent money, but it's not 'to the moon' money. I could be wrong here. People are valuing this stuff more highly. But to really capitalize on it, there's no way they could just let anyone suck down all they could eat off the stream. Which means less open. Which is fine. They're entitled to do so.

We'll see!

Nice post. I have included this blog into my rss subscriptions. Very nicely put on data mining using social media. I honestly have not thought about it in this much detail but it makes sense and could be used as a great competitor intelligence tool!

I'm still working on it but Twitter data sure is tasty.. Lots of goodies !

Thanks for the run down'

Mike

www.wannadevelop.com

There is a lot of potential in analyzing Tweets : Segmentation of users, Sentiment Analysis to name a few. In my experience, the fact that tweets are maximum 140 characters makes things easier in catching emerging trends but also in Text analysis.

Combining Information Extraction and Ontologies (using IE to mark Text and using NLP to insert information to an Ontological Setting) is the way to go although it requires considerable effort.

No comments:

Google