Skip to main content

The mining of crowd-sources

While some are wondering why scientists appear not to appreciate tools like Twitter to communicate , there is more proof for the value of the meta-information that can be plucked from the stream of micro-utterances.
Roughly two years ago we speculated about possibilities to extract (useful) crowd-information. Increasing mentioning of umbrellas/rain - together with localization -, for example, could give valuable input to the weather forecast. As we put in 'Meta Mining':"If the noise of individual utterances will be systematically analyzed for overlying macro-structures and for phase-transitions from the purely random to the organized, there will be more information gained than individually and knowingly put in. The sheer boundless chatter of Twitter and alike corresponds to the cells, the web is the organism." We were encouraging to step back and look at structures rather than the individual tweets.
In a recent report in "The American Journal of Tropical Medicine and Hygiene" that is reviewed in Nature, scientists show how analysis of Twitter-messages would have been a quick way to detect and track the deadly cholera outbreak in Haiti - simply by looking at the number of 'cholera' posts on Twitter. They found a stunning correlation between the official number of cases and the volume of chatter related to that.
This is only one more - scientifically proven - example for the potential of the data deluge.
It is a matter of time until publicly available analysis-tools mine crowd-sources like twitter (or even de-personalized sms…) for real-time input to forecasting tools.


Sandor Ragaly said…
Hi Carsten, really interesting as usual. While it is not new at all, of course, to extract higher-level information out of observed basic units by aggregating data, or composing meaning-loaded indicators, it's indeed a master novelty to try to integrate the -internet openness of mass information- being offered by the ongoing "net revolution" - and this is still only the beginning.

You introduced a pretty ambitious cell-organism metaphor - the -still simple- examples are plausible, like Tweet intensity over time as an indicator for the number of sicknesses. Furthermore, early-warning/forecasts seem fascinating and important chances to be explored. However, in my opinion, tools for macro-analysis of crowd communication/behaviour using (e.g. forecast) modelling may yield sensible results in fact - and then again, may not... Why that?

The problem lies in the fact that at least one of the variables in such models has to do with the dynamics of -(human) communication- (internet or other). E.g. in empirical communication research, it has been shown that -mass media coverage of ecologic (or other) issues over time- often did NOT correspond with the intensity of (or threat by) the resp. problems. (Environmental) pressure and media coverage could even develop contrariwise (which is partially also the fact with ecologic issues and the political or public agenda's issues).

The reason is, ecologic problems like air pollution (and of course your example, cholera sicknesses, too) belong (primarily) to the sphere of natural physics and chemistry (measured by rel. simple "real-world indicators") - while communication and "issues", be it in the classic media, in the internet or elsewhere, are extremely dependent of -social construction-. We have to deal with construction, in which not only the physical information of air or soil burdens are the inputs, but also societal preferences and perception patterns, media issue cycles, trigger events, framing efforts, political action and on... Communication is as complex a phenomenon as can be... (btw here we touch the earlier blog discussion on brain, determinism and complexity).

So, similar to the classic mass media, what you have with Twitter and cholera is a relation which will in other cases very often be -broken-, mediated, even reversed or destroyed (by random results).
This of course is no argument against witty "crowd indicators", but points at (euphoria-);-)intervening variables to be counted in, and to a use more limited than seeming at first sight, maybe.
Carsten Hucho said…
Thank you for your very valid remarks! My examples were deliberately simple. A meaningful evaluation of crowd-behavior has to factor in all of the caveats you point at.
An important difference to the bias-function of mass-media, however, is obvious. While mass-media play to the *perceived* interest of a mass-audience and more often than not *generate* that interest, the crowd-behavior itself is an intrinsic phenomenon - a reaction to external stimuli.
Of course, singularities (like catastrophic feedback) are conceivable - but also those are not without charm.
Crowd-sourcing of this kind is at its very beginning - it will be at its best if and when the results are not fed back. In an ideal case the analyzed source of chatter is completely decoupled from the extracted results - an assumption not too often true in real life...
Sandor Ragaly said…
You're right in that that mass media have their own rules and dynamics - however, so have the individuals forming a crowd: they also have their own communication patterns. So these patterns might also develop different or contrariwise to "real-world indicators" like number of sicknesses, water eutrophication or the jobless. But exactly the media comparison is strong, because (you remember an earlier comment of mine?) journalists' news factors (influencing the news value of events/issues and the selection and salience of them for publication) are derived from (anticipated) audience needs and also general human perception patterns. So we can expect significant similarity between individuals' communication (=aggregatable to crowd behaviour) and the developing aggregated media agenda - with the shortcomings related to "real-world" problem developments described in the last comment.

Popular posts from this blog

Academics should be blogging? No.

"blogging is quite simply, one of the most important things that an academic should be doing right now" The London School of Economics and Political Science states in one of their, yes, Blogs . It is wrong. The arguments just seem so right: "faster communication of scientific results", "rapid interaction with colleagues" "responsibility to give back results to the public". All nice, all cuddly and warm, all good. But wrong. It might be true for scientoid babble. But this is not how science works.  Scientists usually follow scientific methods to obtain results. They devise, for example, experiments to measure a quantity while keeping the boundary-conditions in a defined range. They do discuss their aims, problems, techniques, preliminary results with colleagues - they talk about deviations and errors, successes and failures. But they don't do that wikipedia-style by asking anybody for an opinion . Scientific discussion needs a set

Information obesity? Don't swallow it!

Great - now they call it 'information obesity'! If you can name it, you know it. My favourite source of intellectual shallowness,, again wraps a whiff of nothing into a lengthy video-message. As if seeing a person read a text that barely covers up it's own emptyness makes it more valuable. More expensive to produce, sure. But valuable? It is ok, that Clay Johnson does everything to sell his book. But (why) is it necessary to waste so many words, spoken or written, to debate a perceived information overflow? Is it fighting fire with fire? It is cute to pack the problem of distractions into the metaphore of 'obesity', 'diet' and so on. But the solution is the same. At the core of every diet you have 'burn more than you eat'. If you cross a street, you don't read every licence-plate, you don't talk to everybody you encounter, you don't count the number of windows of the houses across, you don't interpret the sounds an

Driven by rotten Dinosaurs

My son is 15 years old. He asked me what a FAX-machine was. He get's the strange concept of CDs because there is a rack full with them next to the bookshelf, which contains tons of paper bound together in colorful bundles, called 'books'. He still accepts that some screens don't react to you punching your fingers on them. He repeatedly asks why my 'car' (he speaks the quotation marks) is powered by 'rotten dinosaurs'. At the same time he writes an email to Elon Musks Neuralink asking for an apprenticeship and sets up discord-servers for don't-ask-me-what. And slowly I am learning that it is a very good thing to be detached from historic technology, as you don't try to preserve an outdated concept while aiming to innovate. The optimized light-bulb would be an a wee bit more efficient, tiny light-bulb. But not a LED. An optimized FAX would probably handle paper differently - it would not be a file-transfer-system. Hyper-modern CDs might have tenf