Skip to main content

The mining of crowd-sources


While some are wondering why scientists appear not to appreciate tools like Twitter to communicate , there is more proof for the value of the meta-information that can be plucked from the stream of micro-utterances.
Roughly two years ago we speculated about possibilities to extract (useful) crowd-information. Increasing mentioning of umbrellas/rain - together with localization -, for example, could give valuable input to the weather forecast. As we put in 'Meta Mining':"If the noise of individual utterances will be systematically analyzed for overlying macro-structures and for phase-transitions from the purely random to the organized, there will be more information gained than individually and knowingly put in. The sheer boundless chatter of Twitter and alike corresponds to the cells, the web is the organism." We were encouraging to step back and look at structures rather than the individual tweets.
In a recent report in "The American Journal of Tropical Medicine and Hygiene" that is reviewed in Nature, scientists show how analysis of Twitter-messages would have been a quick way to detect and track the deadly cholera outbreak in Haiti - simply by looking at the number of 'cholera' posts on Twitter. They found a stunning correlation between the official number of cases and the volume of chatter related to that.
This is only one more - scientifically proven - example for the potential of the data deluge.
It is a matter of time until publicly available analysis-tools mine crowd-sources like twitter (or even de-personalized sms…) for real-time input to forecasting tools.

Comments

Sandor Ragaly said…
Hi Carsten, really interesting as usual. While it is not new at all, of course, to extract higher-level information out of observed basic units by aggregating data, or composing meaning-loaded indicators, it's indeed a master novelty to try to integrate the -internet openness of mass information- being offered by the ongoing "net revolution" - and this is still only the beginning.

You introduced a pretty ambitious cell-organism metaphor - the -still simple- examples are plausible, like Tweet intensity over time as an indicator for the number of sicknesses. Furthermore, early-warning/forecasts seem fascinating and important chances to be explored. However, in my opinion, tools for macro-analysis of crowd communication/behaviour using (e.g. forecast) modelling may yield sensible results in fact - and then again, may not... Why that?

The problem lies in the fact that at least one of the variables in such models has to do with the dynamics of -(human) communication- (internet or other). E.g. in empirical communication research, it has been shown that -mass media coverage of ecologic (or other) issues over time- often did NOT correspond with the intensity of (or threat by) the resp. problems. (Environmental) pressure and media coverage could even develop contrariwise (which is partially also the fact with ecologic issues and the political or public agenda's issues).

The reason is, ecologic problems like air pollution (and of course your example, cholera sicknesses, too) belong (primarily) to the sphere of natural physics and chemistry (measured by rel. simple "real-world indicators") - while communication and "issues", be it in the classic media, in the internet or elsewhere, are extremely dependent of -social construction-. We have to deal with construction, in which not only the physical information of air or soil burdens are the inputs, but also societal preferences and perception patterns, media issue cycles, trigger events, framing efforts, political action and on... Communication is as complex a phenomenon as can be... (btw here we touch the earlier blog discussion on brain, determinism and complexity).

So, similar to the classic mass media, what you have with Twitter and cholera is a relation which will in other cases very often be -broken-, mediated, even reversed or destroyed (by random results).
This of course is no argument against witty "crowd indicators", but points at (euphoria-);-)intervening variables to be counted in, and to a use more limited than seeming at first sight, maybe.
Carsten Hucho said…
Thank you for your very valid remarks! My examples were deliberately simple. A meaningful evaluation of crowd-behavior has to factor in all of the caveats you point at.
An important difference to the bias-function of mass-media, however, is obvious. While mass-media play to the *perceived* interest of a mass-audience and more often than not *generate* that interest, the crowd-behavior itself is an intrinsic phenomenon - a reaction to external stimuli.
Of course, singularities (like catastrophic feedback) are conceivable - but also those are not without charm.
Crowd-sourcing of this kind is at its very beginning - it will be at its best if and when the results are not fed back. In an ideal case the analyzed source of chatter is completely decoupled from the extracted results - an assumption not too often true in real life...
Sandor Ragaly said…
You're right in that that mass media have their own rules and dynamics - however, so have the individuals forming a crowd: they also have their own communication patterns. So these patterns might also develop different or contrariwise to "real-world indicators" like number of sicknesses, water eutrophication or the jobless. But exactly the media comparison is strong, because (you remember an earlier comment of mine?) journalists' news factors (influencing the news value of events/issues and the selection and salience of them for publication) are derived from (anticipated) audience needs and also general human perception patterns. So we can expect significant similarity between individuals' communication (=aggregatable to crowd behaviour) and the developing aggregated media agenda - with the shortcomings related to "real-world" problem developments described in the last comment.

Popular posts from this blog

Academics should be blogging? No.

"blogging is quite simply, one of the most important things that an academic should be doing right now" The London School of Economics and Political Science states in one of their, yes, Blogs . It is wrong. The arguments just seem so right: "faster communication of scientific results", "rapid interaction with colleagues" "responsibility to give back results to the public". All nice, all cuddly and warm, all good. But wrong. It might be true for scientoid babble. But this is not how science works.  Scientists usually follow scientific methods to obtain results. They devise, for example, experiments to measure a quantity while keeping the boundary-conditions in a defined range. They do discuss their aims, problems, techniques, preliminary results with colleagues - they talk about deviations and errors, successes and failures. But they don't do that wikipedia-style by asking anybody for an opinion . Scientific discussion needs a set

My guinea pig wants beer!

Rather involuntary train rides (especially long ones, going to boring places for a boring event) are good for updates on some thoughts lingering in the lower levels of the brain-at-ease. My latest trip (from Berlin to Bonn) unearthed the never-ending squabble about the elusive 'free will'. Neuroscientists make headlines proving with alacrity the absence of free will by experimenting with brain-signals that precede the apparent willful act - by as much as seven seconds! Measuring brain-activity way before the human guinea pig actually presses a button with whatever hand or finger he desires, they predict with breathtaking reproducibility the choice to be made. So what? Is that the end of free will? I am afraid that those neuroscientists would accept only non-predictability as a definite sign of free will. But non-predictability results from two possible scenarios: a) a random event (without a cause) b) an event triggered by something outside of the system (but caused).

No theory - no money!

A neuroscientist I was talking to recently complained that the Higgs-research,even the Neutrino-fluke at CERN is getting humungous funding while neuroscience is struggling for support at a much more modest level. This, despite the undisputed fact that understanding our brain, and ultimately ourselves, is the most exciting challenge around. Henry Markram of EPFL in Switzerland   is one of the guys aiming for big, big funding to simulate the complete brain. After founding the brain institute and developing methods to analyze and then reconstruct elements of the brain in a supercomputer he now applies for 1.5 Billion Euro in EU-funding for the 'flagship-projects' of Blue Brain -and many believe his project is simply too big to fail. Some call the project daring, others audacious. It is one of the so very few really expensive life-science endeavours. Why aren't there more like that around? Why do we seem to accept the bills for monstrous physics experiments more easily? Is