Open Data (2): Effective Data Use

Posted on September 9, 2010


My previous blogpost on Open Data went viral (going from an average of 80 hits for each new blogpost to roughly 4000 hits in two days!), after being commented on by several leading open data advocates and then being re-blogged and tweeted very actively. But perhaps more usefully it also got caught up in some crowd-sourcing and a number of extremely useful comments and suggestions were made both on the blog and by email discussing the overall notions in the blogpost. These have significantly advanced my thinking in this area—and so thank you to all who commented and my apologies for not referencing all of you folks individually.

In the following I want to extend the argument to include some discussion and analysis that was left out of the original post and as well to integrate some of the comments that folks have made, into the overall argument.

First, is to point to Tim Davies’ extremely useful notion of Open Data “supply” as compared to Open Data “demand” ( “use” in my lingo).  By “supply” Tim is referring to the various background conditions and design features in how the data which is being made available (open to the user) is structured, configured, and otherwise pre-processed prior to it being provided to the end user community.  I would refer readers to Tim’s Master’s thesis where among other things, he goes into considerable and useful detail on these issues and particularly how they impact on the overall capacity of end users to make use of the data.

A second commentator on the blog, Writeruns makes a somewhat similar argument but relates the issues of the supply side of data more directly to the broader social and cultural context from which the data has been gathered (processed) and further links this into how the now, non-decontextualized data is recontextualized in a new form both of which processes (decontextualizing and recontextualizing) have significant impacts on the semantic content of the data and thus on how the data gains meaning for the end user.  One example that Writeruns refers to that of crime statistics whose meaning (and thus whose use) may be very much a function of, for example, the geographical divisions by means of which the data is formatted and made available (i.e. where geographical boundaries are built into the data descriptions for example).

This leads to a third and perhaps even more important point raised by a number of commentators which is that in looking at “open data” it is necessary to include a three step process: “access”, “interpretation” and “use”. In my original blogpost I only referred to the “access” and “use” elements and made some non-articulated assumptions about matters of “interpretation/meaning (or sense) making and so on.

The point that is made here is that the process of interpreting or understanding “open data” is a separate process from making (effective) use of the data and that any critical analysis of “open data” use has to include how and under what conditions the data that is being made available is contextualized and given meaning.  Thus for example among the cases I discussed in the earlier post, in the case drawn from Tim Berners-Lee’s presentation, the “interpretation” (sense making) of the data was the contribution of the consulting firm and was presumably based on their experience and expertise in on-going work with geographically based information and advocacy.

In the Solana case, it was a specific feature of the UCLA Health survey to provide training in data interpretation. This training was then directly available for application by the Solana community to take their local (anecdotal) experience with high incidences of asthma and be able to interpret the data made available through the survey to give support to their advocacy.  In the case of the land digitization in Bangalore (and a very interesting parallel example provided concerning digitization of land records in Nova Scotia, Canada) it was the expertise that was available to the wealthy land owners that enabled them to exploit the digitization process.

In the journal article where I first discussed the concept of “effective use” (as applied to Internet access and the digital divide”); I talked about a 7 layer model of how to achieve effective use of the Internet going beyond simple Internet access as a response to the digital divide.  In this current blogpost I am updating this model in the context of responding to an anticipated “data divide”.

In the following I will itemize what I think are the various elements that are required to be in place on the end user side for effective use of open data to take place.  Some of these are more essential than others but to my mind some component of each needs to be in place or large numbers of those who might otherwise make use of Open Data to improve their lives and particularly the poor and marginalized will be excluded from making “effective use” of open data.

These include:

1.      Internet access – having an available telecommunications/Internet access service infrastructure sufficient to support making the data available to those all. Issues here would include:

a.       the affordability of Internet access – a major issue for many particularly in the Developing world.

b.      the availability of sufficient bandwidth for the range of uses to which the data might be effectively put e.g. whether the data access has been designed on the basis that for example, broadband is necessary for the use of the data being made available

c.       the accessibility of the network e.g. where access to the network or to connectivity is restricted for political or other reasons.

d.      physical accessibility/usability of access sites as for example for the physically disabled

2.      Computers and software –having access to machines/computers/software to access and process the available data and machines that are sufficiently powerful to do various analyses; having sufficient time on the equipment to do the analyses (many people need to share computers); knowledge of how to operate the equipment sufficient to access and analyse the data and so on. Does the use of the data require more powerful (and expensive) computers or software than might be generally available for example?

3.      Computer/software skills – having sufficient knowledge/skill to use the software required for the analyses/making the mashups/doing the crosstabs etc.etc. Techies know how to do the visualization stuff, university and professional types know how to use the analytical software but ordinary community people might not know how to do either and getting that expertise/support might be either difficult or expensive or both.

4.      Content and formatting – Having the data available in a format (language, coding for display, appropriate geo-coding, and so on) such as to allow for effective use at a variety of levels of linguistic and computer literacy. What are the language, computer literacy, data analytic literacy levels that are required for an effective use of the “open data”? Does the use of the data presume that it is being used by a professional and are there means through which those professionals might be available to those who can’t afford expensive fees?

5.      Interpretation/Sense making – sufficient knowledge and skill to see what data uses make sense (and which don’t) and to add local value (interpretation and sense making); being able to identify the worthwhile information and to figure out how to put the data into the right format or context so that what might otherwise be numbers on a page becomes something that can change people’s lives.

6.      Advocacy – having supportive individual or community resources sufficient for translating data into activities for local benefit. Availability of skills and local resources, community infrastructures, training, the means for advocacy and representation all are required to enable effective local interventions based on the open or other data.

7.      Governance – the required financing, legal, regulatory or policy regime, required to enable the use to which the data would be put.

Looking closely at the above list and then cross-checking it with the cases discussed in the earlier post it is clear that say in the Zanesville case quoted by Berners-Lee all of the elements are in place but many of them and particularly the data formatting and analysis, interpretation, and advocacy (#3, 4, 5, and 6) are being provided by expert professionals. In the Solana case the UCLA Centre is providing a degree of support for the local application (# 4) and targeted training for community advocates to interpret (#5) and the community very likely with the support of State funding is providing for turning the data into advocacy (#6). The fact that Solana is located in a wealthy and highly developed part of the world is ensuring that they have access to the required infrastructure and software supports (#1, 2 and 3) and to a legal/regulatory system that is open to this kind of data driven advocacy (# 7).

The wealthy landowners in Bangalore are, as a matter of course, able to provide themselves with the basic technical infrastructure of Internet access, computers and software (#1, 2 and 3). The Government of India through its digitization program is ensuring that element data is available in a useable format and that there is a supportive legal and regulatory system for enforcing the outcomes of decisions and actions based on this data (#4 and 7); and given their financial resources, the wealthy landowners are able to hire professionals for elements interpretation and (self-interested) “advocacy” (#5 and 6).  In the case of Bangalore again, even if there are publicly accessible means for gaining Internet access and computer use (#1,2 and 3) and even though the actions of the Government of India provide what they would consider a “level playing field” for elements #4 and 7, in the absence of financial resources to interpret the data and then develop advocacy actions based on the data (#5 and 6), the poor and marginalized would be unable to use their data access in any meaningful way.

What the above analysis suggests is that for “open data” to have a meaningful and supportive impact on the poor and marginalized, direct intervention is required to ensure that elements currently absent in the local technology and social ecosystem are in fact, made available..

In the absence of such interventions, as Tim O’Reilly so correctly observed in tweeting my original blogpost, not only can open data not be used by the poor but in fact “open data” can be used “against the poor”!