Semantic Web and Official Statistics – an open discussion. Some incomplete remarks about how to bring statistics into the Semantic Web ( I am still learning)
Introductory note: Made for machines
Semantic Web or the Web of Data is made first and foremost for machines. ‘In the future the entire web will be one giant tightly interconnected information asset. Beyond just publishing information for humans, every site will expose its content in a way that’s readable by machines. Those machines will mix, match, filter and aggregate information to greatly improve things for us humans. We’re not there yet, but that’s the vision of the Semantic Web.’ (http://semanticproxy.opencalais.com/about.html)
‘On the Semantic Web, computers do the browsing for us. The “SemWeb” enables computers to seek out knowledge distributed throughout the Web, mesh it, and then take action based on it. … RDF is the W3C standard for encoding knowledge….. RDF applications can put together RDF files posted by different people around the Internet and easily learn from them new things that no single document asserted. It does this in two ways, first by linking documents together by the common vocabularies they use, and second by allowing any document to use any vocabulary. This allows enormous flexibility in expressing facts about a wide range of things, drawing on information from a wide range of sources.’ (Josuah Tauberer, http://www.rdfabout.com/intro/)
‘The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users. A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities‘ (Tim Berners-Lee, James Hendler and Ora Lassila, Scientific American)
So, it’s not surprising that tools for accessing RDF-structured files on the Web are very different from traditional Web pages (see i.e. Tabulator or SPARQL endpoints).
How to bring statistics into the Web of Data
- Statistical data must be converted into a semantic web format, accessible on the Web (via an URI (Uniform Resource Identifier) situated at a location).
- This format may be pure RDF into which data from databases are migrated.
- This format could also be a web page enriched with RDFa. In such pages additional XHTML tags describe the content in a structured RDF manner.
- In both cases 2 and 3 special browsers (like Tabulator) and search tools (SPARQL endpoints) bring relevant and precise results without thousands of (garbage) pages. This because these tools recognize the defined structures and ‘understand’ content.
Official statistics are already very well structured, the community of official statistics is collaborating for years now in defining and harmonizing metadata. Common nomenclatures govern broad fields of statistical observation; these could be seen as vocabularies or ontologies in the field of statistics.
A broadly accepted standard for the exchange of statistical data and metadata is emerging, called SDMX – a common language which makes it possible to “mashup” data from diverse institutions.
But for the time being I cannot detect a lively discussion in official statistics about bringing data into RDF and offering this format to the public or, for instance, a discussion about the relation of SDMX to RDF
But in Academia bringing public statistics to the Semantic Web is a a topic of discussion.
Lee Feigenbaum, a member of W3C Data Access Working Group (now: SPARQL Working Group), wrote a substantial post about this: Modeling Statistics in RDF – A Survey and Discussion.
He discusses various strategies for modeling statistical data in RDF and for making these data searchable via RDF Browsers or SPARQL filterings.
Among these strategies are:
U.S. Census Bureau’s annual Statistical Abstract
Transfering the nearly 1500 spreadsheets that are in the U.S. Census Bureau’s annual Statistical Abstract of the United States in RDF.
The 2000 U.S. Census.
Joshua Tauberer converted the 2000 U.S. Census Data into 1 billion RDF triples. He provides a well-documented perl script that can convert various subsets of the census data into N3. One mode that this script can be run in is to output the schema from SAS table layout files. (http://www.rdfabout.com/demo/census/)
D2R Server (Database to RDF) at FU Berlin.
D2R Server is a tool for publishing relational databases on the Semantic Web. It enables RDF and HTML browsers to navigate the content of the database, and allows applications to query the database using the SPARQL query language. It uses data from EUROSTAT.
Riese is another effort to convert the EUROSTAT data to RDF (riese=RDFizing and Interlinking the EuroStat Data Set Effort).
The Riese schema. (Note: this is a simple RDF schema; no OWL in sight.) models statistics as items that link to times, datasets, dimensions, geo information, and a value (using rdf:value).
(http://riese.joanneum.at/ and http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/EuroStat)
And he shows it in a table collection, see i.e. this table: Tax rate on low wage earners: Tax wedge on labour cost. (using RDFa).