Reading a Picture

Visual storytelling

Visualising data helps understanding facts.
Sometimes it’s very easy to understand a graph; sometimes it’s necessary to read it and to study it to discover unknown territory.

Such graphs are little masterpieces. Here’s one of these and I am sure the authors had more than one iteration and discussion while creating it.
The graph tells the story of the average disposable income and savings of households in Switzerland, published by the Swiss Federal Statistical Office FSO.

snip_disposable-income2

The authors kindly give a short explanation:

How to read this graph.
In one-person households aged 64 or under, the upper-income group has a disposable income of CHF 8487 per month and savings of CHF 2758 per month. Representing 4.0% of all households, this income group corresponds to a fifth of one-person households aged 64 or under (20.1%)

There’s another nice graph, a little bit less elaborated, also explained by the authors:

snip-povertyrates

Statistics ♥

But there’s one thing that is not explained:

snip_poverty-cithe confidence interval!

‘A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data,‘ and the above poverty data are from a sample of ‘approximately 7000 households, i.e. more than 17,000 persons who are randomly selected…’.
Or:
The confidence intervals for the mean give us a range of values around the mean where we expect the “true” (population) mean is located (with a given level of certainty, see also Elementary Concepts). ….. as we all know from the weather forecast, the more “vague” the prediction (i.e., wider the confidence interval), the more likely it will materialize. Note that the width of the confidence interval depends on the sample size and on the variation of data values…..’

Khan Academy gives lectures about topics like confidence intervals, sampling, etc.

snip_20161129160845.

Which one ?

The above graphs use just one of multiple possibilities for visualising data.

snip_graph-catalogue

Severino Ribecca’s Data Visualisation Catalogue is one of many websites trying to give an overview. And there’s the risk to get lost in these compilations.

snip_swimring                            © listverse.com

LOD MOOC

Massive Open Online Courses (MOOC) are available worldwide and offer tons of topics, also about Linked Open Data (LOD). An easy way to enter the semantic web.

Two examples:

HPI

snip_opehpi

The Hasso Plattner Institute, Potsdam provides, for some years now, a course in Linked Data Engineering with a certificate. I did it some years ago and enjoyed it.

FUN (INRIA)

snip_lodcourse2

The French platform FUN offers a LOD course, too. (Thanks to Adrian at zazuko.com for the hint)

And books

Step by step Bob DuCharme introduces RDF, SPARQL, LOD …

snip_ducharmesparql

.

snip_ducharmesparql-preface

20 Years Ago

1996

On the 2nd of September 1996, Statistics Switzerland published its brand-new website, www.bfs.admin.ch. It was one of the first (if not the first) of the Swiss Administration (www.admin.ch).

info-internet

In three languages…

… and already with quite rich structure and content.

SwissStats-April1997

The Wayback Machine …

… shows the developments since 1996

snip_waybackmachine

https://archive.org/web/.

1996:
Handmade with Frontpage as editing software

SwissStats-fields

https://web.archive.org/web/19970502093430/http://www.admin.ch/bfs/stat_ch/eber_m.htm

November 2004:
New layout, made with Day Communiqué as Content Management System and a database for file download

StatSchweiz-November2004

December 2007 ff
Layout adapted to the general layout of admin.ch. The same CMS now bought by Adobe and renamed Adobe Experience Manager AEM

StatSchweiz-Dezember2007

snip_bfs-2016

The next one …

… must be based on a new technology:

  • CMS remains state of the art for content presentation
  • Assets come from databases
  • Web services (via a web service platform) deliver assets from databases to the presentation platform.

And with such a three-layer architecture the website will be able to display data from ubiquitous databases and also offering data to third parties via web services: Open data compatible.

Disrupting Dissemination – From Print to Bots

Digitally disrupted data production

“The collection of statistics has been digitally disrupted, along with everything else, and there are important questions about collection methods and whether or not “big data” genuinely offers promise for a giant leap forward in the productivity of official statistics.”

This statement in Financial Times’ edition of August 20th, 2015 deals with UK’s Office for National Statistics ONS. Its title:  “UK needs a statistical strategy to catch up with digital disruption”. Its message: ONS (and I think all Official Statistics) has problems to keep up “with the profound changes in the structure of the economy during recent decades.”

The “Independent Review of UK Economic Statistics” by Professor Sir Charles Bean, Professor of Economics at the London School of Economics in March 2016 goes deeper and gives 24 recommendations, some of these obviously valid for statistics’ production and producers in general.

snip_Beans-independentReview

“Innovation and technological change are the wellspring of economic advancement. The rapid and sustained rise in computing power, the digitisation of information and increased connectivity have together radically altered the way people conduct their lives today, both at work and play. These advances have also made possible new ways of exchanging goods and services, prompted the creation of new and disruptive business models, and made the location of economic activity more nebulous. This has generated a whole new range of challenges in measuring the economy.” (p.71)

“Measuring the economy has become even more challenging in recent times, in part as a consequence of the digital revolution. Quality improvements and product innovation have been especially rapid in the field of information technology. Not only are such quality improvements themselves difficult to measure, but they have also made possible completely new ways of exchanging and providing services. Disruptive business models, such as those of Spotify, Amazon Marketplace and Airbnb, are often not well-captured by established statistical methods, while the increased opportunities enabled by online connectivity and access to information provided through the internet have muddied the boundary between work and home production. Moreover, while measuring physical capital – machinery and structures – is hard enough, in the modern economy, intangible and unobservable knowledge-based assets have become increasingly important. Finally, businesses such as Google operate across national boundaries in ways that can render it difficult to allocate value added to particular countries in a meaningful fashion. Measuring the economy has never been harder.” (p. 3)

And: “Recommended Action 4: In conjunction with suitable partners in academia and the user community, ONS should establish a new centre of excellence for the analysis of emerging and future issues in measuring the modern economy.”  (p.118)

Disrupting Dissemination of Statistics

The rise of new technologies followed by new information behavior has also disrupted existing dissemination formats (from print to digital) and dissemination practices (from quasi-monopolistic to open and multiple).

A well-known example for disrupting dissemination is given by Wikipedia and its subject is Wikipedia itself:
“The free, online encyclopedia Wikipedia was a disruptive innovation that had a major impact on both the traditional, for-profit printed paper encyclopedia market (e.g., Encyclopedia Britannica) and the for-profit digital encyclopedia market (e.g.,Encarta). The English Wikipedia provides over 5 million articles for free; in contrast, a $1,000 set of Britannica volumes had 120,000 articles.” (Article: https://en.wikipedia.org/wiki/Disruptive_innovation)

In fact, disruptive tendencies happen on both sides: in producing and in presenting or accessing statistical information.

Some thoughts about this:

  1. Until the end of the 20th-century, print was the main channel for disseminating statistics. Libraries in Statistical Offices and Society had their very vital role.
  2. With the internet opened a new channel: Statistical Offices’ Websites appeared, access to databases and attractive data presentation (visual, storytelling, see i.e. this) were top themes and stuff for long discussions. Access to data was now simple and for everyone.
  3. With the open data initiatives not only accessing but also disseminating statistical information got much easier. Nearly everyone could become a data provider. License fees no longer hindered the redisseminaton of official statistics and APIs or webservices provided by statistical offices made this possible in an automated way.
    Statistics can be easily integrated into websites and apps of non-official data providers, this with all the chances to enable democratic conversation and the risks of data misuse.
  4. All this gives statistics a much more important role in communication processes. On the other hand communicating with statistics gets simpler: Letters, telephone calls and even e-mails become cumbersome seen the possibility bots (will) provide. With a stats bot in my daily used messenger, I ask for a statistical information, and the bot uses a search engine or connects me directly to a statistical expert.
    Brands that already have full-fledged apps and responsive websites can take advantage of bots’ ability to act as concierges, handling basic tasks and micro-interactions for users and then gracefully connecting users with apps or websites, as appropriate, for a more involved experience.” (Adam Fingerman, venturebeat, 20.7.2016)
  5. What’s next? Innovation with disruption goes on, but disruption does not always mean destruction: It’s still a wise decision to keep some information in paper format. A statistical yearbook with key data lasts for centuries, not so a website, an API or a bot.

 

Statistics is Dead – Long Live Statistics

To be an expert in a thematic field!

Lee Baker wrote an article that will please the whole community of official statistics where specialists of many thematic fields (and not alone statisticians or mathematicians or … data scientists) are collecting, analysing, interpreting, explaining and publishing data.
It’s this core message that counts:
“… if you want to be an expert Data Scientist in Business, Medicine or Engineering”  (or vice versa: An expert statistician in a field of official statistics like demography, economy, etc.)  “then the biggest skill you’ll need will be in Business, Medicine or Engineering…. In other words, …. you really do need to be an expert in your field as well as having some of the other listed skills”

Here is his chain of arguments:

“Statistics is Dead – Long Live Data Science…

by Lee Barker

I keep hearing Data Scientists say that ‘Statistics is Dead’, and they even have big debates about it attended by the good and great of Data Science. Interestingly, there seem to be very few actual statisticians at these debates.

So why do Data Scientists think that stats is dead? Where does the notion that there is no longer any need for statistical analysis come from? And are they right?

Is statistics dead or is it just pining for the fjords?

I guess that really we should start at the beginning by asking the question ‘What Is Statistics?’.
Briefly, what makes statistics unique and a distinct branch of mathematics is that statistics is the study of the uncertainty of data.
So let’s look at this logically. If Data Scientists are correct (well, at least some of them) and statistics is dead, then either (1) we don’t need to quantify the uncertainty or (2) we have better tools than statistics to measure it.

Quantifying the Uncertainty in Data

Why would we no longer have any need to measure and control the uncertainty in our data?
Have we discovered some amazing new way of observing, collecting, collating and analysing our data that we no longer have uncertainty?
I don’t believe so and, as far as I can tell, with the explosion of data that we’re experiencing – the amount of data that currently exists doubles every 18 months – the level of uncertainty in data is on the increase.

So we must have better tools than statistics to quantify the uncertainty, then?
Well, no. It may be true that most statistical measures were developed decades ago when ‘Big Data’ just didn’t exist, and that the ‘old’ statistical tests often creak at the hinges when faced with enormous volumes of data, but there simply isn’t a better way of measuring uncertainty than with statistics – at least not yet, anyway.

So why is it that many Data Scientists are insistent that there is no place for statistics in the 21st Century?

Well, I guess if it’s not statistics that’s the problem, there must be something wrong with Data Science.

So let’s have a heated debate…

What is Data Science?

Nobody seems to be able to come up with a firm definition of what Data Science is.
Some believe that Data Science is just a sexed-up term for statistics, whilst others suggest that it is an alternative name for ‘Business Intelligence’. Some claim that Data Science is all about the creation of data products to be able to analyse the incredible amounts of data that we’re faced with.
I don’t disagree with any of these, but suggest that maybe all these definitions are a small part of a much bigger beast.

To get a better understanding of Data Science it might be easier to look at what Data Scientists do rather than what they are.

Data Science is all about extracting knowledge from data (I think just about everyone agrees with this very vague description), and it incorporates many diverse skills, such as mathematics, statistics, artificial intelligence, computer programming, visualisation, image analysis, and much more.

It is in the last bit, the ‘much more’ that I think defines a Data Scientist more than the previous bits. In my view, if you want to be an expert Data Scientist in Business, Medicine or Engineering then the biggest skill you’ll need will be in Business, Medicine or Engineering. Ally that with a combination of some/all of the other skills and you’ll be well on your way to being in great demand by the top dogs in your field.

In other words, if you want to call yourself a Data Scientist you really do need to be an expert in your field as well as having some of the other listed skills.

Are Computer Programmers Data Scientists?

On the other hand – as seems to be happening in Universities here in the UK and over the pond in the good old US of A – there are Data Science courses full of computer programmers that are learning how to handle data, use Hadoop and R, program in Python and plug their data into Artificial Neural Networks.

It seems that we’re creating a generation of Computer Programmers that, with the addition of a few extra tools on their CV, claim to be expert Data Scientists.

I think we’re in dangerous territory here.

It’s easy to learn how to use a few tools, but much much harder to use those tools intelligently to extract valuable, actionable information in a specialised field.

If you have little/no medical knowledge, how do you know which data outcomes are valuable?
If you’re not an expert in business, then how do you know which insights should be acted upon to make sound business decisions, and which should be ignored?

Plug-And-Play Data Analysis

This, to me, is the crux of the problem. Many of the current crop of Data Scientists – talented computer programmers though they may be – see Data Science as an exercise in plug-and-play.

Plug your dataset into tool A and you get some descriptions of your data. Plug it into tool B and you get a visualisation. Want predictions? Great – just use tool C.

Statistics, though, seems to be lagging behind in the Data Science revolution. There aren’t nearly as many automated statistical tools as there are visualisation tools or predictive tools, so the Data Scientists have to actually do the statistics themselves.

And statistics is hard.
So they ask if it’s really, really necessary.
I mean, we’ve already got the answer, so why do we need to waste our time with stats?

Booooring….

So statistics gets relegated to such an extent that Data Scientists declare it dead.”

The original article and discussion –>here


About the Author

Lee Baker is an award-winning software creator with a passion for turning data into a story.
A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!
Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress – but 100 times the fun!”


This post is taken from datascience.central and has been published previously in Innovation Enterprise and LinkedIn Pulse