Jan 10, 2012

The mining of crowd-sources


While some are wondering why scientists appear not to appreciate tools like Twitter to communicate , there is more proof for the value of the meta-information that can be plucked from the stream of micro-utterances.
Roughly two years ago we speculated about possibilities to extract (useful) crowd-information. Increasing mentioning of umbrellas/rain - together with localization -, for example, could give valuable input to the weather forecast. As we put in 'Meta Mining':"If the noise of individual utterances will be systematically analyzed for overlying macro-structures and for phase-transitions from the purely random to the organized, there will be more information gained than individually and knowingly put in. The sheer boundless chatter of Twitter and alike corresponds to the cells, the web is the organism." We were encouraging to step back and look at structures rather than the individual tweets.
In a recent report in "The American Journal of Tropical Medicine and Hygiene" that is reviewed in Nature, scientists show how analysis of Twitter-messages would have been a quick way to detect and track the deadly cholera outbreak in Haiti - simply by looking at the number of 'cholera' posts on Twitter. They found a stunning correlation between the official number of cases and the volume of chatter related to that.
This is only one more - scientifically proven - example for the potential of the data deluge.
It is a matter of time until publicly available analysis-tools mine crowd-sources like twitter (or even de-personalized sms…) for real-time input to forecasting tools.

3 comments:

Sandor Ragaly said...

Hi Carsten, really interesting as usual. While it is not new at all, of course, to extract higher-level information out of observed basic units by aggregating data, or composing meaning-loaded indicators, it's indeed a master novelty to try to integrate the -internet openness of mass information- being offered by the ongoing "net revolution" - and this is still only the beginning.

You introduced a pretty ambitious cell-organism metaphor - the -still simple- examples are plausible, like Tweet intensity over time as an indicator for the number of sicknesses. Furthermore, early-warning/forecasts seem fascinating and important chances to be explored. However, in my opinion, tools for macro-analysis of crowd communication/behaviour using (e.g. forecast) modelling may yield sensible results in fact - and then again, may not... Why that?

The problem lies in the fact that at least one of the variables in such models has to do with the dynamics of -(human) communication- (internet or other). E.g. in empirical communication research, it has been shown that -mass media coverage of ecologic (or other) issues over time- often did NOT correspond with the intensity of (or threat by) the resp. problems. (Environmental) pressure and media coverage could even develop contrariwise (which is partially also the fact with ecologic issues and the political or public agenda's issues).

The reason is, ecologic problems like air pollution (and of course your example, cholera sicknesses, too) belong (primarily) to the sphere of natural physics and chemistry (measured by rel. simple "real-world indicators") - while communication and "issues", be it in the classic media, in the internet or elsewhere, are extremely dependent of -social construction-. We have to deal with construction, in which not only the physical information of air or soil burdens are the inputs, but also societal preferences and perception patterns, media issue cycles, trigger events, framing efforts, political action and on... Communication is as complex a phenomenon as can be... (btw here we touch the earlier blog discussion on brain, determinism and complexity).

So, similar to the classic mass media, what you have with Twitter and cholera is a relation which will in other cases very often be -broken-, mediated, even reversed or destroyed (by random results).
This of course is no argument against witty "crowd indicators", but points at (euphoria-);-)intervening variables to be counted in, and to a use more limited than seeming at first sight, maybe.

Carsten Hucho said...

Thank you for your very valid remarks! My examples were deliberately simple. A meaningful evaluation of crowd-behavior has to factor in all of the caveats you point at.
An important difference to the bias-function of mass-media, however, is obvious. While mass-media play to the *perceived* interest of a mass-audience and more often than not *generate* that interest, the crowd-behavior itself is an intrinsic phenomenon - a reaction to external stimuli.
Of course, singularities (like catastrophic feedback) are conceivable - but also those are not without charm.
Crowd-sourcing of this kind is at its very beginning - it will be at its best if and when the results are not fed back. In an ideal case the analyzed source of chatter is completely decoupled from the extracted results - an assumption not too often true in real life...

Sandor Ragaly said...

You're right in that that mass media have their own rules and dynamics - however, so have the individuals forming a crowd: they also have their own communication patterns. So these patterns might also develop different or contrariwise to "real-world indicators" like number of sicknesses, water eutrophication or the jobless. But exactly the media comparison is strong, because (you remember an earlier comment of mine?) journalists' news factors (influencing the news value of events/issues and the selection and salience of them for publication) are derived from (anticipated) audience needs and also general human perception patterns. So we can expect significant similarity between individuals' communication (=aggregatable to crowd behaviour) and the developing aggregated media agenda - with the shortcomings related to "real-world" problem developments described in the last comment.