Big Data, Open Data and Official Statistics

There are (at least) two big challenges official statistics will be faced with in the  next few years and which will possibly change its quasi-monoplistic position.

.

On the input side it’s Big Data

‘“Big Data” is a term used to describe massive information stores – generally measured in petabytes and exabytes – and also refers to the methods and technologies used to analyze these large data volumes.  The core principles of Big Data (data mining, analytics) have been around for some time, but recent technology has enabled the collection and analysis of previously unimaginable data volumes at extremely high speeds.’ So says for example SAP and gives some examples how  Big Data will change your life (big words and they show how big software and hardware players begin to occupy the field).

Official Statistics has already put this on the agenda! And so has the in United Nations Statistics Division’s (UNSD) Friday Seminar on Emerging Issues, 22 February 2013.

Some papers from this Seminar:

Gosse van der Veen Statistics Netherlands. High Level Group for the Modernization of Statistical Products and Services. Big Data: Big Opportunity!

2013-04-15_vanderveen-statcom2013

The High-Level Group for the Modernisation of Statistical Production and Services (HLG) established an informal Task Team of national and international experts, coordinated by the UNECE Secretariat. The Paper of this group gives an excellent overview of the topic: What Does “Big Data” mean for Official Statistics.

2013-04-15_HLG-BIGData-Paper

.

Andrew Wyckoff, Big Data for Policy,Development and Official Statistics, Directorate for Science, Technology & Industry. Organisation for Economic Co-operation and Development OECD (personal opinion).

2013-04-15_BigDataRoles-WykoffOECD

.

Aspects of Big Data and real-time analytics are provided in another paper by Global Pulse (an innovation initiative launched by the Executive Office of the United Nations Secretary-General): Big Data for Development: Opportunities & Challenges

2013-04-15_globalpulse

.

The discussion is launched and as mentions the HLG  paper: ‘To use Big data, statisticians are needed with a different mind-set and new skills. The processing of more and more data for official statistics requires statistically aware people with an analytical mind-set, an affinity for IT (e.g. programming skills) and a determination to extract valuable ‘knowledge’ from data. These so-called “data scientists” can be derived from various scientific disciplines.’

.

On the output side it’s (Linked) Open Data in combination with APIs

Open Data is not at all a new topic for Official Statistics. National Statistical Institutes were forerunners in openly providing data; organizations like UN or EUROSTAT went this way as well.

Several Open Data initiatives (USA, UK, France, EU …) consist mostly of data catalogues, and are in that sense also public relations initiatives. A large part of the data so provided consists of statistical data already available, often, on the website of the National Statistical Institute concerned. The EU portal, for instance, offers 5716 datasets  of statistical data from a total of 5893 (as of April 2013).

Further central questions are the licensing of data, 2013-04-20_CCBYas well as their availability in machine-readable formats.

Machine-readable statistical data, Application Programming Interfaces (APIs) to the data and especially Linked Open Data LOD (–> essentials, –>tutorial) open the way to creative applications and new models of presenting information.

2015-01-25_berners lee

An Europe-wide Linked Open Data (LOD2) project ‘was launched in September 2010 and will run for four years. It addresses exploitation of the web as a platform for data and information integration, and the use of semantic technologies to make government data more useable.’

Looking for third-party APPs

Data Providers are looking at applications or mashups made with their data  with much interest, and they are even sponsoring competitions and hack days (like Apps4EU) to stimulate the reuse of open data, especially from the public sector.

The most popular APP creator and statistical storyteller is Hans Roslings  with Gapminder. Rosling himself is a pioneer in fighting for open data.

http://www.youtube.com/watch?feature=player_embedded&v=jbkSRLYSojo

Changing paradigms

Open Data, Linked Open Data and APIs are changing the dissemination paradigm of statistical agencies. More people with new skills will do new things. Coding is becoming the new literacy, says i.e. Garrett Heath in his advice for his unborn daughter: ‘I was blown away that the buzz is not around mobile apps, but rather around using APIs. Ten years ago saw the creation of the social networking platforms. The past five years has been about accumulating the data. The next five years and beyond will be about interpreting that data. [My daughter will have access to] a boatload of interesting data sitting in accessible databases that is waiting to be exposed and interpreted with her [the programmer’s]) creativity.’

Storytelling with data

Storytelling based on data is less and less the domain of statistical agencies. Storytelling can access multiple (new) resources and take on new forms.  To satisfy the basic idea of an easily understandable and appealing presentation of statistical content, statistical institutions cannot avoid taking certain measures to improve their content and presentation. The “composer” must know how the music is to be played, that is as a quick, competent, qualitatively unique, reliable and indispensable data source.
But this presentation job can no longer be done on one’s own: cooperative partnerships are necessary and have already begun to some extent, both with partners outside statistical institutions and between such institutions. This discussion has been launched.

Statistical Storytelling revisited! More in a paper from IMAODBC Vilnius 2010:

2013-04-20_storytellingrevisited2010.

And this: Many small open data give big data insights

FORGET BIG DATA, SMALL DATA IS THE REAL REVOLUTION says Rufus Pollock co-Director of the Open Knowledge Foundation : ‘… the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”.’

small-data-640x120

Country Portraits – Open and Embedabble

Looking for important statistical indicators of European countries? Comparing these countries? Taking the application to your own website? Making a brochure of it?

All this is provided by a newly designed application on Statistic Switzerland’s portal.

2013-03-10_Portrait2

Embed

2013-03-10_embed

And download all countries as a brochure

2013-03-10_Brochure

Open Data

The Source Data (from Eurostat and Swiss Statistics) are available as an EXCEL file: So data are open and the app made from these data is open, too. It provides selecting and embedding and also the output of all indicators as a PDF file. It may also be embedded into third party websites or other apps can be written by other people.

2013-03-09_countryportr-explanations-downloadxls

App made with a CMS

This Portrait-App is one of several Apps of the same flavour. There are also portraits of the 26 Swiss Cantons, the biggest Cities and and the (more than)  2500 Communes.

2013-03-10_portraits-full

A Content Management System helps building these Portrait-Apps once the data are in correct shape. And this in a very short time (hours).

API and Apps: An example fom official statistics

An example of an API access to statistical data

The U.S. Census Bureau  now offers some of its public data in machine-readable format. This is done via an Application Programming Interface (“API”).
Based on this API an App has been developed helping to query data from the Cenus 2010:

No data without legal clarification. The Census Bureau does it like follows:

‘Use
You may use the Census Bureau API to develop a service or service to search, display, analyze, retrieve, view and otherwise “get” information from Census Bureau data.
Attribution
All services, which utilize or access the API, should display the following notice prominently within the application: “This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.” You may use the Census Bureau name in order to identify the source of API content subject to these rules. You may not use the Census Bureau name, or the like to imply endorsement of any product, service, or entity, not-for-profit, commercial or otherwise.’

Open Government Data Benchmark: FR, UK, USA

Finally there’s a very interesting comparison of OGD in three leading countries.

qunb did it . Have a look at this presentation.

1) There are lots of duplicates on OGD platforms

.

2) There are very few structured data yet

.

.

3) Apps are the real challenge

There are different strategies fostering the developmemt of Apps made with open data. The U.K. method seems to be one of the most productive

.

The presentation in French

Linked Data: It’s not a top-down system. Berners-Lee and OpenGov

There’s not much noise about Semantic Web these days. But in the fascinating and creative semantic-web niche activities go on.

Once more Tim Berners-Lee explains what Linked data are.

.

The 5-star system helps measuring i.e. how far or near open-gov data  are from being part of the Semantic Web.

Available on the web (whatever format), but with an open licence

★★

Available as machine-readable structured data (e.g. excel instead of image scan of a table)

★★★

as (2) plus non-proprietary format (e.g. CSV instead of excel)

★★★★

All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff

★★★★★

All the above, plus: Link your data to other people’s data to provide context

.

.

And there is a interesting example how Linked Data are published:

‘Linked data is data in which real-world things are given addresses on the web (URIs), and data is published about them in machine-readable formats at those locations. Other datasets can then point to those things using their URIs, which means that people using the data can find out more about something without that information being copied into the original dataset.

This page lists the sectors for which we currently publish linked data and some additional resources that will help you to use it. Most sectors have one or more SPARQL endpoints, which enable you to perform searches across the data; you can access these interactively on this site‘.

Evaluating

What’s the effect of open data? Some journals (like the Guardian) make ample use of open data, but there is no wide-spread activity or commitment or lots of evaluation studies to be seen. Infoweek just published an article about US open gov and found that there is a lot  to be done as only small groups seem to take notice of this government activity. ‘The most difficult part of open government may be getting the public to participate.   … the “if you build it, they will come” approach simply doesn’t work.’ (InformationWeek, Feb 21, 2011: Open Government Reality Check: Federal agencies are making progress on the Obama administration’s Open Government Directive, but there’s still a long way to go. Here’s our list of top priorities.)

Using facebook apps

As facebook captures more and more of the time users spend online, content providers are more often deciding to move to the continent of more than 500 mio users.

Facebook allows this by  integrating apps. Lots of companies specialise in this field and provide facebook-app-development services.

A much discussed facebook app is the one of the London School of Business and Finance LSBF.

LSBF goes where they think people interested in this school are.

Why not following this idea and put some statistical  literacy topics on facebook ?

.

And here some stats about facebook:

http://www.facebakers.com/facebook-statistics/

Bye-bye Browser (?)

For more and more online users the device of choice is a mobile device and for more and more of these users  ‘Apps are the Web and the Web is Apps”.

Applications (Apps)  for mobile devices can be downloaded and installed in seconds. These apps focus on certain needs and perhaps half a dozen of Apps meet the daily online demands for you and me.

With Apple’s planned App store for laptop and desktop computers  these devices join this philosophy, too.  So what about the future of Websurfing using classic browsers? And what about the future of complex Websites offering many levels of browser navigation and tons of pages delivering information?

The discussion (the fight) is under way and the users will decide.

For information suppliers like statistical agencies this issue is of huge importance.

How to ensure the mission for public information and democracy given such developments in the online world?

– with traditional websites?
– with (small) Apps (or Widgets) with specific, user-focused information portions?
– or both (for how long)?
– with integration into existing Apps or platforms where people are, like facebook or Google?

There are already today some interesting developments in statistics’ dissemination giving partial answers.

So have a look at:

CBS iPhone App (search CBS Statline in the iPhone App store)

And also some of the widgets like i.e. https://blogstats.wordpress.com/2010/09/10/imaodbc-2010-and-the-winner-is/

Aggregating statistical news with twitter

More and more statistical agencies are using twitter to spread their news. It’s quite simple:  Opening a twitter account and populating it via the RSS-feeds the agency already offers.  Several tools like twitterfeed, hootsuite or Google’s feedburner help to do this.  Here the Slovenian example (using twitterfeed).

Make a list

But how to get an overview of the many twitter accounts you are interested in? You can follow their twitter in your own account (and only you see these) or you can populate your or another twitter account with all the RSS feeds of the many agencies.
But there’s an even better solution. Twitter offers a good tool: lists. Put together all the twitters you like in a list and you will find the news in one place.

An example: I put (surely not) all twitter offers of statistical agencies into the list ‘stats-all‘  .

Here it is:

Embed the list

The advantage of a list: It can be embedded in a third-party website using the twitter widget service . And there it is updated continually.

See this example:

http://status.li/WordPress/?page_id=82

And this example, embedding the list in another aggregator like netvibes ;).

Netvibes. International Statistics. Experimental.

Create a magazine

And that’s not all: With such a list you may create a Statistical Magazine on the iPad without any further programming using one of the great free apps, so i.e. flipboard.
Add this list http://twitter.com/grar/stats-all to flipboard and it’s done. flipboard shows more than two or three lines (like twitter) and allows a first insight into the twitter update.

More data.govs

USA: On May 21 2009, the Obama administration launched Data.gov , a web site that provides access to raw data from federal government agencies.

UK: In June 2009 Gordon Brown gave Tim Berners-Lee the job to help open up access to government data .

And now data.gov.uk is ready to get official.


There is a huge interest of the media to unveil the story behind this new data access. From “Whitehall’s web revolution: the inside story” (The Prospect) to “Tim Berners-Lee unveils government data project” (BBC) all flavours are there.

On data.gov.uk some interesting applications that others have created and submitted can already be seen. They take data directly (API, linked data, csv downloads???)  from the government data silos (so ONS and especially ONS Census geography) and use these data to offer public services.

As an example have a look at the house prices:

There is also a SPARQL endpoint, but no help page provided for it.

But it’s not all web services in a modern understanding, some public data are just given as simple downloads of xls or pdf (even tables!), so for instance births outside marriage. For most statistical offices this is a frequently offered service for some time already, in the UK it’s also part of a PR campaign.

See the listed ONS (Office National Statiscs) offer on data.gov.uk -> here

See this new article comparing the two data.gov.

Timetric Makes Web Data Useful with Time Series Analysis

Timetric Makes Web Data Useful with Time Series Analysis (from ReadWriteSTART)

Written by Jolie O’Dell / August 5, 2009 8:40 PM / 0 Comments

This post is part of our ReadWriteStart channel, which is dedicated to profiling startups and entrepreneurs. The channel is sponsored by Microsoft BizSpark. To sign up for BizSpark, click here.

A winner at this year’s mini-Seedcamp in London, Timetric is an app from Inkling Software, a three-principle shop composed of chemistry and physics PhDs.

The premise is fairly simple: Timetric was created to store, share, and analyze data over time. For predicting trends, proving assertions, or recommending actions, time series analysis is a highly valuable tool. It’s Facebook’s Lexicon all grown up and actually useful, pulling data from all over the web and querying this huge database to serve significant results.

Aviary timetric-com Picture 2

Yahoo Pipes – still going strong

Mixing up several web services into a new web service – a mashup –  is made easy with Yahoo Pipes (see earlier post).

Yahoo Pipes is also capable of integrating csv-files and with this becomes interesting for statistics (globalizes official statistics).

An example with data from the US census Bureau:

csv pipe

Another one for UK Cities

csv pipe-uk

This example used an interesting workflow the author describes with all details:

‘So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map. ‘