Statistics is Dead – Long Live Statistics

To be an expert in a thematic field!

Lee Baker wrote an article that will please the whole community of official statistics where specialists of many thematic fields (and not alone statisticians or mathematicians or … data scientists) are collecting, analysing, interpreting, explaining and publishing data.
It’s this core message that counts:
“… if you want to be an expert Data Scientist in Business, Medicine or Engineering”  (or vice versa: An expert statistician in a field of official statistics like demography, economy, etc.)  “then the biggest skill you’ll need will be in Business, Medicine or Engineering…. In other words, …. you really do need to be an expert in your field as well as having some of the other listed skills”

Here is his chain of arguments:

“Statistics is Dead – Long Live Data Science…

by Lee Barker

I keep hearing Data Scientists say that ‘Statistics is Dead’, and they even have big debates about it attended by the good and great of Data Science. Interestingly, there seem to be very few actual statisticians at these debates.

So why do Data Scientists think that stats is dead? Where does the notion that there is no longer any need for statistical analysis come from? And are they right?

Is statistics dead or is it just pining for the fjords?

I guess that really we should start at the beginning by asking the question ‘What Is Statistics?’.
Briefly, what makes statistics unique and a distinct branch of mathematics is that statistics is the study of the uncertainty of data.
So let’s look at this logically. If Data Scientists are correct (well, at least some of them) and statistics is dead, then either (1) we don’t need to quantify the uncertainty or (2) we have better tools than statistics to measure it.

Quantifying the Uncertainty in Data

Why would we no longer have any need to measure and control the uncertainty in our data?
Have we discovered some amazing new way of observing, collecting, collating and analysing our data that we no longer have uncertainty?
I don’t believe so and, as far as I can tell, with the explosion of data that we’re experiencing – the amount of data that currently exists doubles every 18 months – the level of uncertainty in data is on the increase.

So we must have better tools than statistics to quantify the uncertainty, then?
Well, no. It may be true that most statistical measures were developed decades ago when ‘Big Data’ just didn’t exist, and that the ‘old’ statistical tests often creak at the hinges when faced with enormous volumes of data, but there simply isn’t a better way of measuring uncertainty than with statistics – at least not yet, anyway.

So why is it that many Data Scientists are insistent that there is no place for statistics in the 21st Century?

Well, I guess if it’s not statistics that’s the problem, there must be something wrong with Data Science.

So let’s have a heated debate…

What is Data Science?

Nobody seems to be able to come up with a firm definition of what Data Science is.
Some believe that Data Science is just a sexed-up term for statistics, whilst others suggest that it is an alternative name for ‘Business Intelligence’. Some claim that Data Science is all about the creation of data products to be able to analyse the incredible amounts of data that we’re faced with.
I don’t disagree with any of these, but suggest that maybe all these definitions are a small part of a much bigger beast.

To get a better understanding of Data Science it might be easier to look at what Data Scientists do rather than what they are.

Data Science is all about extracting knowledge from data (I think just about everyone agrees with this very vague description), and it incorporates many diverse skills, such as mathematics, statistics, artificial intelligence, computer programming, visualisation, image analysis, and much more.

It is in the last bit, the ‘much more’ that I think defines a Data Scientist more than the previous bits. In my view, if you want to be an expert Data Scientist in Business, Medicine or Engineering then the biggest skill you’ll need will be in Business, Medicine or Engineering. Ally that with a combination of some/all of the other skills and you’ll be well on your way to being in great demand by the top dogs in your field.

In other words, if you want to call yourself a Data Scientist you really do need to be an expert in your field as well as having some of the other listed skills.

Are Computer Programmers Data Scientists?

On the other hand – as seems to be happening in Universities here in the UK and over the pond in the good old US of A – there are Data Science courses full of computer programmers that are learning how to handle data, use Hadoop and R, program in Python and plug their data into Artificial Neural Networks.

It seems that we’re creating a generation of Computer Programmers that, with the addition of a few extra tools on their CV, claim to be expert Data Scientists.

I think we’re in dangerous territory here.

It’s easy to learn how to use a few tools, but much much harder to use those tools intelligently to extract valuable, actionable information in a specialised field.

If you have little/no medical knowledge, how do you know which data outcomes are valuable?
If you’re not an expert in business, then how do you know which insights should be acted upon to make sound business decisions, and which should be ignored?

Plug-And-Play Data Analysis

This, to me, is the crux of the problem. Many of the current crop of Data Scientists – talented computer programmers though they may be – see Data Science as an exercise in plug-and-play.

Plug your dataset into tool A and you get some descriptions of your data. Plug it into tool B and you get a visualisation. Want predictions? Great – just use tool C.

Statistics, though, seems to be lagging behind in the Data Science revolution. There aren’t nearly as many automated statistical tools as there are visualisation tools or predictive tools, so the Data Scientists have to actually do the statistics themselves.

And statistics is hard.
So they ask if it’s really, really necessary.
I mean, we’ve already got the answer, so why do we need to waste our time with stats?

Booooring….

So statistics gets relegated to such an extent that Data Scientists declare it dead.”

The original article and discussion –>here


About the Author

Lee Baker is an award-winning software creator with a passion for turning data into a story.
A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!
Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress – but 100 times the fun!”


This post is taken from datascience.central and has been published previously in Innovation Enterprise and LinkedIn Pulse

Next Step in OGD Websites

What DataUsa is doing could be – I guess – the next step in the evolution of Open Government Data websites. It’s the step from offering file downloads to presenting data (and not files) interactively. And it’s a kind of presentation many official statistical websites would surely be proud of.

César A. Hidalgo from MIT discusses the philosophy behind this. More at the end of this post; at first a short look at this website.

snip_datausahome

Bringing data together

Merging data from different sources may have been the most expensive and challenging task and the conditio sine qua non for the existence of this website. And perhaps it’s more an organizational than a technical challenge.

Seven public data sources are accessible via DataUsa

snip_datausa-datasources

Presenting data

Adapting to what internauts normally do, the main entrance is a search bar;

snip_datausa-homesearch

 

Thematical and geographical profiles are available, too. But in a hidden menu.

The presentation of the data is a mix of generated text and various types of graphs.

snip_datausa-graph3

snip_datausa-graph

 

The option above every graph allows to share, embed, download, get a table and even an API for the data.

snip_datausa-data3

 

And finally thematical maps provide other views and insights:
snip_datausa-map

Storytelling

But the fascinating part is Stories
snip_datausa-storiessnip_datausa-stories2

Various authors write stories focussing on special topics and using the presentation techniques of the site.

Background

A glossary explains technical terms and the About Section presents the authors and their aim:
‘In 2014, Deloitte, Datawheel, and Cesar Hidalgo, Professor at the MIT Media Lab and Director of MacroConnections, came together to embark on an ambitious journey — to understand and visualize the critical issues facing the United States in areas like jobs, skills and education across industry and geography. And, to use this knowledge to inform decision making among executives, policymakers and citizens.’

And this leads to the
Philosophy behind 

César A. Hidalgo, one of the websites’ authors explains why they did what they did in a blog post with the title ‘What’s Wrong with Open-Data Sites–and How We Can Fix Them.’

Here’s the design philosophy in a visual nutshell:

snip_datausa-design

 

‘Our hope is to make the data shopping experience joyful, instead of maddening, and by doing so increase the ease with which data journalists, analysts, teachers, and students, use public data. Moreover, we have made sure to make all visualizations embeddable, so people can use them to create their own stories, whether they run a personal blog or a major newspaper.’

And:

‘After all, the goal of open data should not be just to open files, but to stimulate our understanding of the systems that this data describes. To get there, however, we have to make sure we don’t forget that design is also part of what’s needed to tame the unwieldy bottoms of the deep web.’

 

 

Open Data Portals: News

There are new or refurbished open data portals to be announced.

opendata.swiss

Switzerland just published opendata.swiss in a new look for a better presentation of data. See the press release.

snip_opendataswiss

snip_swissopendataabout

europeandataportal.eu

The European Commission published some months ago the European Data Portal.

snip_EuropeanDataPortal

europeandataportal.eu is much more than a collection of open data. It is an ecosystem with lots of documents explaining and promoting open data.

snip_euportalaims

SPARQL inside!

The portal offers metadata as linked open data with an SPARQL endpoint for powerful searching.

snip_sparql

select ?theme (count(?theme) as ?count) where {?s a dcat:Dataset . ?s dcat:theme ?theme} GROUP BY ?theme LIMIT 100  gives all  data categories/themes and their number of datasets .

Impact studies

Most of all these data are already published on other websites. The advantage of such open data portals are a centralized access and clear licence information, A main intention is to attract developers, to foster data usage and with this economic growth.

A Swiss study (January 2014) assesses the economic impact of Open Government Data: ´The report determined that the economic benefits from OGD for Switzerland lie most likely between CHF0.9 B and CHF1.2 B´.

snip_ogdstudie       All the details >>> here  (look for the extended executive summary).

European Study (November 2015) within the context of the launch of the European Data Portal got these results: “The aim of this study is to collect, assess and aggregate economic evidence to forecast the benefits of the re-use of Open Data for the EU28+. Four key indicators are measured: direct market size, number of jobs created, cost savings, and efficiency gains. Between 2016 and 2020, the market size of Open Data is expected to increase by 36.9%, to a value of 75.7 bn EUR in 2020. The forecasted number of direct Open Data jobs in 2016 is 75,000 jobs. From 2016 to 2020, almost 25,000 extra direct Open Data jobs are created. The forecasted public sector cost savings for the EU28+ in 2020 are 1.7 bn EUR. Efficiency gains are measured in a qualitative approach. ”

snip_EUimpactSee the details >>> here

Next: LOD

Open and machine-readable formats help to access data and foster the economic impact. Even better when the data have metadata in a standardized description. Linked Open Data (LOD) in RDF format provide this; europeandataportal.eu uses this format describing the harvested datasets (metadata). The next step will and must be data in this format in order to link masses of data in the linked data cloud.

With data.admin.ch a first step is been made in Switzerland.

snip_dataadmin

Linked Data? In europeandataportal.eu’s ecosystem well made videos present explanations:

snip_learnLOD

 

 

Optimism with Data

What will our future be like? Is there no or some hope that things evolve in a good direction? Will we make progress?

Data play a crucial role in answering these questions.

Steven Pinker (Harvard University, Department of Psychology) in his answer to the EDGE question of 2016 considers that Quantifying Human Progress is the most interesting recent (scientific) news:

But the most interesting news is that the quantification of life has been extended to the biggest question of all: Have we made progress? Have the collective strivings of the human race against entropy and the nastier edges of evolution succeeded in improving the human condition?’

‘Human intuition is a notoriously poor guide to reality. …. But the cognitive and data revolutions warn us not to base our assessment of anything on subjective impressions or cherry-picked incidents. As long as bad things haven’t vanished altogether, there will always be enough to fill the news, and people will intuit that the world is falling apart. The only way to circumvent this illusion is to plot the incidence of good and bad things over time. Most people agree that life is better than death, health better than disease, prosperity better than poverty, knowledge better than ignorance, peace better than war, safety better than violence, freedom better than coercion. That gives us a set of yardsticks by which we can measure whether progress has actually occurred.

The interesting news is that the answer is mostly “yes.” …. Economic historians and development scholars (including Gregory Clark, Angus Deaton, Charles Kenny, and Steven Radelet) have plotted the growth of prosperity in their data-rich books, and the case has been made even more vividly in websites with innovative graphics such as Hans Rosling’s Gapminder, Max Roser’s Our World in Data, and Marian Tupy’s HumanProgress.’

What may be true for the world must not be true for the individuals.
Let’s have a look at these mostly well-known data sites:

Max Roser: Our World in Data

‘Max Roser is the founder of OurWorldInData. He is an economist working at the University of Oxford. His background is in economics, geoscience and philosophy. His research is focusing on the long-term growth and distribution of living standards.’

‘On my website I am presenting the long-term data on how we are changing our world. The idea is to tell the history of our present world – based on empirical data and visualised in graphs.’

snip_worldindata-about-method

‘Most of the long-run trends are positive and paint an optimistic view of our world. Topic by topic, the empirical view of our world shows how the Enlightenment continues to make our world a better place. It chronicles how we are becoming less violent and increasingly more tolerant. The data displays how new ideas continue to improve living standards, allowing us to live a healthier, richer and happier life. It is the story of declining poverty and better food provision in a world we care about.

The empirical view on our world shows how misplaced doom and defeatism is and my aim is to encourage those who work to make our world a better place still. At the same time my hope is also to help to change the mind of those of you who do not think that we are creating a better world. By looking at the empirical data I want to explain why I am optimistic about how we are changing our world and why I think it is worthwhile to engage in the global long-term project of Enlightenment. Although most trends are clearly going in the right direction I also show where this is not the case. In a world of hysteria we cannot focus on what is important, but a fact based view on our world should help us to focus on the topics that are most important.’  http://ourworldindata.org/about/

snip_roser-health

Human Progress.org

Human Progress’ mission statement (http://humanprogress.org/about):

‘Evidence from academic institutions and international organizations shows dramatic improvements in human well-being. These improvements are especially striking in the developing world.
Unfortunately, there is often a wide gap between the reality and public perception, including that of many policymakers, scholars in unrelated fields, and intelligent lay persons. To make matters worse, the media emphasizes bad news, while ignoring many positive long-term trends.

We hope to help in correcting misperceptions regarding the state of humanity through the presentation of empirical data that focuses on long-term developments. All of our wide-ranging data comes from third parties, including the World Bank, the OECD, the Eurostat, and the United Nations. By putting together this comprehensive data in an accessible way, our goal is to provide a useful resource for scholars, journalists, students, and the general public.

While we think that policies and institutions compatible with freedom and openness are important factors in promoting human progress, we let the evidence speak for itself. We hope that this website leads to a greater appreciation of the improving state of the world and stimulates an intelligent debate on the drivers of human progress.

Note: HumanProgress.org is a project of the Cato Institute with major support from the John Templeton Foundation, the Searle Freedom Trust, the Brinson Foundation and the Dian Graves Owen Foundation.’

Some data:

snip_humanprogress-infographics.

Gapminder

And here is top-star Hans Rosling with his gapminder.org where he deconstructs misleading, ’60-years-behind-reality’ opinions with data.

An example: Hans Rosling asks: Has the UN gone mad?

‘The United Nations just announced their boldest goal ever: To eradicate extreme poverty for all people everywhere, already by 2030.
Looking at the realities of extremely poor people the goal seems impossible. The rains didn’t fall in Malawi this year. The poor farmers Dunstar & Jenet, gather a tiny maize harvest in a small pile on the ground outside their mud hut. But Dunstar & Jenet know exactly what they need to break the vicious circle of poverty. And Hans Rosling shows how billions of people have already managed. This year’s “hunger season” may very well be Dunster’s & Jenet’s last.
Up-to-date statistics show that recent global progress is ‘the greatest story of our time – possibly the greatest story in all of human history. The goal seems unrealistic to many highly educated people because their worldview is lagging 60 years behind reality.’

snip_roslingpoverty

snip_gapminder-panic

.

A focussed view: OXFAM’s new study

‘An Economy for the 1%

 Runaway inequality has created a world where 62 people own as much wealth as the poorest half of the world’s population – a figure that has fallen from 388 just five years ago, according to an Oxfam report published on January 18th.
How pivilege and power in the economy drive extreme inequality and how this can  be stopped. The global inequality crisis is reaching new extremes.The richest 1%now have more wealth than the rest of the world combined.
Power and privilege is being used to skew the economic system to increase the gap between the richest and the rest. A global network of tax havens further enables the richest individuals to hide $7.6 trillion.’ -> Methodology
snip-oxfam2016
OXFAM’s conclusion:
‘The fight against poverty will not be won until the inequality crisis is tackled.’

Income distribution. Data on Max Roser

snip_wealt-roser
‘A lesson that that we can take away from this empirical research is that political forces at work on the national level are possibly important for how incomes are distributed. If there was a universal trend towards more inequality it would be in line with the notion that inequality is determined by global market forces and technological progress where it is very hard (or for other reasons undesirable) to change the forces that lead to higher inequality. Inequality would then be inevitable. The reality of different inequality trends within countries suggests that the institutional and political framework in different countries play a role in shaping inequality of incomes.’

 

 

 

 

Blog about Stats 2015 in review

The WordPress.com stats helper monkeys prepared a 2015 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 14,000 times in 2015. If it were a concert at Sydney Opera House, it would take about 5 sold-out performances for that many people to see it.

Click here to see the complete report.