Why pursue linked data • CLIR

… with some cautionary notes

1. Elevator pitches … via Semanticweb.com

Defining the Semantic Web in a Few Sentences, Angela Guess, March 2011

A Quora user posed this challenge to the network: “How do you explain semantic web to a nine-year old child in one sentence?” The challenge was followed by a quote from Albert Einstein:

If you can’t explain it to a six year old, you don’t understand it yourself.

Some of the best responses so far include :

A web where computers better understand the real meaning of the words we use to communicate with them.
Hi Timmy, the web is like one giant big book written by a lot of people. And the semantic web is another book describing how the first book should be read.
Semantic web is like the magic mirror in Shrek-you ask it, ‘Can I go to a pool?’ and it tells you, ‘Yes, you can, because the weather is good and the pool you like to go to is open.’

We at Semantic Web recently offered up a challenge of our own: give us your best elevator pitch answering the question, “What is the Semantic Web?” We’ve received some great pitches so far (listed below) and we’re still accepting new pitches:

2. The knowledge web … Danny Hillis

Hillis wrote his Aristotle: (The knowledge web) essay in 2000. It waspublished on the web by the Edge Foundation, Inc. in May 2004 with an introduction by John Brockman as part of Edge’s The Third Culture. The essay is accompanied by thoughts of more than a dozen commentators under the rubric of Edge’s The Reality Club.

On the state of the web in 2000:

The information in the Web is disorganized, inconsistent, and often incorrect. Yet for all its faults, the Web is good enough to give us a hint of what is possible. Its most important attribute is that it is accessible, not only to those who would like to refer to it but also to those who would like to extend it. As any member of the computer generation will explain to you, it is changing the way we learn.

A better infrastructure for publishing:

The shared knowledge web will be a collaborative creation in much the same sense as the World Wide Web, but it can include the mechanisms for credit assignment, usage tracking, and annotation that the Web lacks. For instance, the knowledge web will be able to give teachers and authors recognition or even compensation for the use of their materials.

What makes the knowledge web different?

One way to think about the knowledge web is to compare it with other publishing systems that support teaching and learning. These include the World Wide Web, Internet news groups, traditional textbooks, and refereed journals. The knowledge web takes important ideas from these systems. These ideas include peer-to-peer publishing, vetting and peer review, linking and annotation, mechanisms for paying authors, and guided learning.

Summary: an idea whose time as come

With the knowledge web, humanity’s accumulated store of information will become more accessible, more manageable, and more useful. Anyone who wants to learn will be able to find the best and the most meaningful explanations of what they want to know. Anyone with something to teach will have a way to reach those who what to learn. Teachers will move beyond their present role as dispensers of information and become guides, mentors, facilitators, and authors. The knowledge web will make us all smarter. The knowledge web is an idea whose time has come.

Hillis’ addendum to his 2000 essay on the knowledge web,appeared in May 2004, as part of Edge’s The Third Culture:

In retrospect the key idea in the “Aristotle” essay was this: if humans could contribute their knowledge to a database that could be read by computers, then the computers could present that knowledge to humans in the time, place and format that would be most useful to them. The missing link to make the idea work was a universal database containing all human knowledge, represented in a form that could be accessed, filtered and interpreted by computers.

One might reasonably ask: Why isn’t that database the Wikipedia or even the World Wide Web? The answer is that these depositories of knowledge are designed to be read directly by humans, not interpreted by computers. They confound the presentation of information with the information itself. The crucial difference of the knowledge web is that the information is represented in the database, while the presentation is generated dynamically.

Like Neal Stephenson’s storybook, the information is filtered, selected and presented according to the specific needs of the viewer.

John, Robert and I started a project, then a company, to build that computer-readable database. How successful we will be is yet to be determined, but we are really trying to build it: a universal database for representing any knowledge that anyone is willing to share. We call the company Metaweb, and the free database, Freebase.com. Of course it has none of the artificial intelligence described in the essay, but it is a database in which each topic is connected to other topics by links that describe their relationship. It is built so that computers can navigate and present it to humans. Still very primitive, a far cry from Neal Stephenson’s magical storybook, it is a step, I hope, in the right direction.

3. Fiscal and social dimensions of linked data outweigh all the fiddly bits

Ed Summers’recent riff on Ranganathan’s “Five laws” addressed criteria for evaluating digital repositories. Tampering a bit with his basic premise:

… the fiscal and social dimension of repositories are a whole lot more important in the long run than the technical bits of how a repository is assembled in the now. I’m a software developer, and by nature I reach for technical solutions to problems, but in my heart of hearts I know it’s true.

One easily can bend his “repository objects” enough to allow for useful reflection on those aspects of “linked data” that would ensure ongoing, consumer-facing, robust discovery and navigation services over the digital content available from the collections and services of libraries, museums, and archives:

Repository Objects Are For Use

We can build repositories that function as dark archives. But it kind of rots your soul to do it. It rots your soul because no matter what awesome technologies you are using to enable digital preservation in the repository, the repository needs to be used by people. If it isn’t used, the stuff rots. And the digital stuff rots a whole lot faster than the physical materials. Your repository should have a raison d’être. It should be anchored in a community of people that want to use the materials that it houses. If it doesn’t the repository is likely to not be good.

Every Reader His/Her Repository Object

Depending on their raison d’être (see above), repositories are used by a wide variety of people: researchers, administrators, systems developers, curators, etc. It does a disservice to these people if the repository doesn’t support their use cases. A researcher probably doesn’t care when fixity checks were last performed, and an administrator generating a report on fixity checks doesn’t care about how an repository object was linked to and tagged in Twitter. Does your repository allow these different views, for different users to co-exist for the same object? Does it allow new classes of users to evolve?

Every Repository Object Its Reader

Are the objects in your repository discoverable? Are there multiple access pathways to them? For example, can someone do a search in Google and wind up looking at an item in your repository? Can someone link to it from a Wikipedia article? Can someone do a search within your repository to find an object of interest? Can they browse a controlled vocabulary or subject guide to find it? Are repository objects easily identified and found by automated agents like web crawlers and software components that need to audit them? Is it easy to extend, enhance, and refine your description of what the repository object is as new users are discovered?

Save the Time of the Reader

Is your repository collection meaningfully on the Web? If it isn’t, it should be, because that’s where a lot of people are doing research today … in their web browser. If it can’t be open access on the web, that’s OK … but the collection and its contents should be discoverable so that someone can arrange an onsite visit. For example, can a genealogist do a search for a person’s name in a search engine and end up in your repository? Or do they have to know to come to your application to type in a search there? Once they are in your repository can they easily limit their search along familiar dimensions such as who, what, why, when, and where? Is it easy for someone to bookmark a search, or an item for later use? Do you allow your repository objects to be reused in other contexts like Facebook, Twitter, Flickr, etc., which put the content where people are, instead of expecting them to come to you?

The Repository is a Growing Organization

This is my favorite. Can you keep adding numbers and types of objects, and scale your architecture linearly? Or are you constrained in how large the repository can grow? Is this constraint technical, social, and/or financial? Can your repository change as new types or numbers of users (both human and machine) come into existence? When the limits of a particular software stack are reached, is it possible to throw it away and build another without losing the repository objects you have? How well does your repository fit into the web ecosystem? As the web changes do you anticipate your repository will change along with it? How can you retire functionality and objects-to let them naturally die, with respect, and make way for the new?

4. Kate Ray’s Web 3.0 video … a story about the semantic web

If you have not seen this 15-minutevideo commentary, do take the time to watch it. In the words of one comment left on the video’s documentationsite:

This is a superb. As the Pew survey reminds us, we need to do a better job of introducing the Semantic Web to people. This film is a great way to do that. It gives a sense of why we are all excited by this.

problem
vision
critics
schism
future

Set the data free, and value will follow, John Battelle

from his Searchblog:

Perhaps the largest problem blocking our industry today is the retardation of consumer-driven data sharing. We’re all familiar with the three-year standoff between Google and Facebook over crawling and social graph data. Given the rise of valuable mobile data streams (and subsequent and rather blinkered hand wringing about samesaid) this issue is getting far worse

Every major (and even every minor) player realizes that “data is the next Intel inside,” and has, for the most part, taken a hoarder’s approach to the stuff.Apple, for example, ain’t letting data out of the iUniverse to third parties except in very limited circumstances. Same forFacebook and even Google, which has made hay claiming its open philosophy over the years.

[snip]

A generation from now our industry’s approach to data collection and control will seem outdated and laughable. The most valuable digital services and companies will be rewarded for what they do with openly shareable data, not by how much data they hoard and control.

Now, I live in the real world, and I understand why companies are doing what they are doing at the moment. Facebook doesn’t want third party services creating advertising networks that leverage Facebook’s social graph-that’s clearly on Facebook’s roadmap to create in the coming year or so (Twitter has taken essentially the same approach). But if you are a publisher (and caveat, I am), I want the right to interpret a data token handed to me by my reader in any way I chose. If my interpretation is poor, that reader will leave. If it adds value, the reader stays, perhaps for a bit longer, and value is created for all. If that token comes from Facebook, Facebook also gets value.

Imagine, for example, if back in the early search days, Google decided to hoard search refer data-the information that tells a site what the search term was which led a visitor to click on a particular URL. Think of how that would have retarded the web’s growth over the past decade.

Scores of new services areemerging that hope to enable a consumer-driven ecosystem of data. Let’s not lock down data early. Let’s trust that what we’re best at doing is adding value, not hoarding it.

6. Use cases … a sampling

a. Information management: a proposal (1989, CERN)

The proposal discusses the problems of loss of information about complex evolving systems and derives a solution based ona distributed hypertext system.

A useful bit of explication in terms of linked data is availablehere.

b. File under: Metaservices, the rise of [cited with commentary here]

John Battellewrites:

… every app, every site, and every service needs to be more than just an application or a content directory. It needs to be a platform, capable of negotiating ongoing relationships with other platforms on behalf of its customers in real time.

c. Use of semantic web technologies on the BBC web sites

BBC’s workposted among the W3C use cases for Semantic Web:

Creating web identifiers for every item the BBC has an interest in, and considering those as aggregations of BBC content about that item, allows us to enable very rich cross-domain user journeys. This means BBC content can be discovered by users in many different ways, and content teams within the organization have a focal point around which to organize their content.

d. W3C semantic web use cases

Case studies and use cases (44 of them) postedhere:

– content discovery (12)

– content management (4)

– customization (3)

– data integration (30)

– domain modeling (21)

– improved search (21)

– modeling (1)

– natural lang. interface (1)

– portal (16)

– provenance tracking (2)

– repair and diagnostic help

– schema mapping (1)

– semantic annotation (11)

– service integration (2)

– simulation and testing (1)

– social networks (4)

– text mining(1)

e. W3C Library Linked Data Incubator Group

The wiki page fordeliverables cites these types ofuse cases:

– bibliographic data

– authority data

– vocabulary alignment

– archives and heterogeneous data

– citations

– digital objects

– social and new uses

– related topics

– notes

f. via Chris Bourg (Stanford University Libraries AUL for Public Services)

From her library services perspective:

on serendipity here, and here, and here:

… Linked Data’s biggest payoff for scholarship will be that it will rationalize and democratize a certain kind of scholarly serendipity

… her vision of what the library of the future could/should be all about

… can our online environments replicate or even improve on that experience of looking at a physical book and deciding quickly if it will be useful to my research?

g. Behavior of the research of the future (BL/JISC project)

Adescription of the project, thereport, and apresentation based on the project’s findings

h. Reading styles … from a Jodi Schneiderpost andpresentation

i. Wing nuts … Christine Connors (Semanticweb.com)posts:

It is the simplest of use cases, easily transformed into any of hundreds of frustrations experienced by knowledge workers today. I won’t take up your time with the back story, but for reasons perfectly logical I find myself researching the history of the wing nut.

j. Use cases at Semantic Technology Conference, 2011

Angela Guess (Semanticweb.com) posts:

The upcomingSemantic Technology Conference this June in San Francisco will feature a number of case studies that highlight real-world semantic technology applications. Here are just a few:

Dynamic Semantic Publishing at the BBC

Details on how the BBC sport site currently uses embedded Linked Data identifiers, ontologies and associated inference plus RDF semantics to improve navigation, content re-use, re-purposing, and search engine rankings.

Enabling Business Users to Define Rules and Configure the Semantic Enterprise at Amdocs

How a team of developers using semantic technology and an expressive business language made a significant breakthrough to help business users create, extend and alter high level business concepts and create natural language rules. We recently had awebcast with Craig Hanson from Amdocs, the speaker on this session.

Improving Web Content Management at the L eading M edia G roup in Brazil

Organizações Globo is a conglomerate that includes TV and radio stations, newspapers and magazines. The company is using semantic technologies to improve the way it organizes and presents multimedia content ranging from hard news to sports coverage to celebrity gossip, using ontologies to annotate content and developing a common software framework that allows each web site to work with semantic annotations to classify and present content in attractive ways.

Linked Data for Digital Archives and Libraries – Recollection

Recollection is a free platform in use at the Library of Congress for generating and customizing views (interactive maps, timelines, facets, tag clouds) that allow users to experience their digital collections. Earlier this year, the creators offered usthis article about Recollection.

Discovering and Using RDF for Books at O’Reilly Media

How the tech publisher ended up with a Linked Data, Semantic, RESTful, URI-based solution mostly by accident and through ruthless pragmatism, and how they began using RDF, SPARQL, and other semantic technology to sell $4 million in Ebooks.

k. ‘Follow your nose’ across the globe … Tom Heath (Talis) posts:

Imagine you’ve just arrived in an unfamiliar place, perhaps on a business trip (or recently beamed down from the Starship Enterprise). One of the first things you’ll probably want to do is find out what things are nearby. Google Maps provides a great “search nearby” function (try entering just a * to get everything), but this is geared more towards businesses, and the data isn’t exactly open, making it hard to reuse in other applications. We wanted to try something similar, using the growing range of liberally licensedLinked Data sets with a geographic component. Here’s what we did…

l. Best Buy … Jennifer Zaino (Semanticweb.com) posts:

Just a few months ago Jay Myers, lead web development engineer at Best Buy, talked to The Semantic Web Blog about using RDFa to mark up the retailer’s product detail pages and more semantic things he’d like to do, including mashing up its online catalog data with some other data sources.

Well, in just the last week he’s been stoking the semantic data foundation-pushing Best Buy’s product visibility and discovery further along with the help of RDFa and pulling in some semantic data too, all geared to building up what he calls the company’s Insight Engine. And there’s more coming soon, as Myers’ has a personal agenda of stretching RDFa just about as far as he can in Best Buy product pages.

m. MESA (Medieval Electronic Scholarly Alliance) [Mellon funded workshop, 2011]

Peter Robinson of Saskatchewan University gave an invigorating presentation on the electronic Dante’s Commedia project he helps to curate and edit.

Peter’s talk focused on the ‘under the hood’ aspects of the project. While the surface looks very similar to the previous version, the content has been completely redone in RDF triples. Peter talked about the significant advantages this application of linked data delivers: RDF triples were employed to the line level (they can be used down to the granularity of a letter and a punctuation mark if desired). In this new scheme, variations across platforms and operating systems become irrelevant; interoperability is inherent. As importantly, the reader/user can create an unlimited number of new ontologies based on preferred topic and theme. This capability allows for a plethora of different interfaces to be created. The main point of the talk was that by adopting the RDF triple scheme to this resource, new scholarship and customized reconstitution of the content is powerfully supported. While it looks the same ‘up front,’ the RDF triples has essentially created a new resource much more attuned to variant scholarly interpretation. Peter opined that as RDF triples become more prevalent, a new for-hire service will be the creation of ontologies in service to teaching and research across a growing array of linked content

n. linkypedia … see how wikipedians are citing and annotating resources from libraries, museums, and archives that have a presence (a URI) on the web. [about] [linkypedia itself]

7. Caution … is the glass half empty?

a. Andy Powell in a March 2011 post Waiter, my resource discovery glass is half-empty provides a thoughtful take on the pace of change in cultural heritage environs:

I had a bit of a glass half-empty moment last week, listening to the two presentations in the afternoon session of the ESRC Resource Discovery workshop …. Not that there was anything wrong with either presentation. But it struck me that they both used phrases that felt very familiar in the context of resource discovery in the cultural heritage and education space over the last 10 years or so (probably longer)-“content locked in sectoral silos”, “the need to work across multiple websites, each with their own search idiosyncrasies”, “the need to learn and understand multiple vocabularies,” and so on.

In a moment of panic I said words to the effect of, “We’re all doomed. Nothing has changed in the last 10 years. We’re going round in circles here.” Clearly rubbish… and, looking at the two presentations now, it’s not clear why I reached that particular conclusion anyway. I asked the room why this time round would be different, compared with previous work on initiatives like the UK JISC Information Environment, and got various responses about community readiness, political will, better stakeholder engagement and whatnot. I mean, for sure, lots of things have changed in the last 10 years-I’m not even sure the alphabet contained the three letters A, P and I back then and the whole environment is clearly very different-but it is also true that some aspects of the fundamental problem remain largely unchanged. Yes, there are a lot more cultural heritage, scientific and educational resources out there (being made available from within those sectors) but it’s not always clear the extent to which that stuff is better joined up, open and discoverable than it was at the turn of the century.

There is a glass half-full view of the resource discovery world, and I try to hold onto it most of the time, but it certainly helps to drink from the Google water fountain! Hence the need for initiatives like the UK Resource Discovery Task Force to emphasize the ‘build better websites’ approach. We’re talking about cultural change here, and cultural change takes time. Or rather, the perceived rate of cultural change tends to be relative to the beholder.

b. Eric Hellman reminds us that well-curated data and links come with associated costs in his The library IS the machine:

Bad data is everywhere. If a publisher asks authors for citations, 10% of the submitted citations will be wrong. If a librarian is given a book to catalog, 10% of the records produced will start out with some sort of transcription error. If a publisher or library is asked to submit metadata to a repository, 10% of the submitted data will have errors. It’s only by imposing the discipline of checking, validating and correcting data at every stage that the system manages to perform acceptably.

Linking real world objects together doesn’t happen by magic. It’s a lot of work, and no amount of RDF, SPARQL, or URI fairy dust can change that. The magic of people and institutions working together, especially when facilitated by appropriate semantic technologies, can make things easier.

c. Georgi Kobilarov (Uberlic Labs) challenges the web data community to sort between problems and solutions:

As an entrepreneur it is my job to create solutions for problems. You could also call that creating products that meet a need. But as a very technical person (despite my business degree), I like to think of problems and solutions. In my previous life as researcher it was my job to come up with and prototype ideas that may be turned into solutions if matched with the right problem. Linked Data is something where the problem & solution matching has gone wrong, and in this post I share what I believe is the reason.

Linking Open Data is one of those projects I got involved in as a researcher. The idea was to interlink data on the web (as opposed to making the services on top of the data talk to each other, as in traditional EAI) and “turn the Web into a database.” Okay. That was back in 2007.

So what happened was that organizations, early on from the media industry in Europe and among the very first my friends at the BBC, began to use Linked Data. So apparently, Linked Data (in particular the early prototypes around DBpedia and a few other data sets) provided a solution to a problem, or at least seemed promising enough as such. Great.

But neither Linked Open Data nor the Semantic Web have really taken off from there yet. I know many people will disagree with me and point to the famous Linked Open Data cloud diagram, which shows a large (and growing) number of data sets as part of the Linked Data Web. But where are the showcases of problems being solved?

If you can’t show me problems being solved then something is wrong with the solution. “we need more time” is rarely the real issue, esp. when there is some inherent network effect in the system. Then there should be some magic tipping point, and you’re just not hitting it and need to adjust your product and try again with a modified approach.

My point here is not that I want to propose any particular direction or change, but instead I want to stress what I believe is an issue in the community: too few people are actually trying to understand the problem that Linked Data is supposed to be the solution to. If you don’t understand the problem you can not develop a solution or improve a half-working one. Why? Well, what do you do next? Which part to work on? What to change? There is no ground for those decisions if you don’t have at least a well informed guess (or better some evidence) about the problem to solve. And you can’t evaluate your results.

And the second, even more important point: don’t confuse the solution with the problem. Turning the Web into a database is not the problem. It may be a possible solution to the problem of people needing to use data in applications they build or use. Without a deep understanding of the user’s need, you don’t know what to build for.

Publishing data on the Web as Linked Data is also not the problem. Because nobody wants to publish data. People may want other people to consume their data, and in order to reach that goal they need to publish data on the Web.

So, Linked Data community, let’s have that discussion and do the analysis about the problems that Linked Data could solve, shall we? And decide about development directions on that basis. People who want to disrupt any industry with an innovation need to do their best to understand their user’s problem, test their assumptions, and adjust along the way. Otherwise you end up with something highly sophisticated that nobody really cares about.

Designers and entrepreneurs understand that process of analysis, prototyping, testing and iteration quite well. Let’s make the research community more aware of it.

d. Mike Ellis (again) from his take on the state of affairs in his post Linked Data: my challenge:

Now, I’m deeply aware that actually I don’t actually know much about Linked Data. But I’m also aware that for someone like me-with my background and interests-to not know much about Linked Data, there is somewhere in the chain a massive problem.

I genuinely want to understand Linked Data. I want to be a Linked Data advocate in the same way I’m an API/MRD advocate. So here is my challenge, and it is genuinely an open one. I need you, dear reader, to show me:

Why I should publish Linked Data. The “why” means I want to understand the value returned by the investment of time required, and by this I mean compelling, possibly visual and certainly useful examples.
How I should do this, and easily. If you need to use the word “ontology” or “triple” or make me understand the deepest horrors of RDF, consider your approach a failed approach.
Some compelling use-cases which demonstrate that this is better than a simple API/feed based approach.

There you go-the challenge is on. Arcane technical types need not apply.

e. Shion Deysarkar (CEO, 8olegs) in comments within a new year’s semanticweb.com post:

In general, I feel the semantic web community as a whole is far too focused on manual curation of data. This means manually updated linked data sets, markup [of] individual web pages or documents, etc. This is not a scalable approach. Additionally, the semantic web is still confined to being understood and used by a very technical group of people. Wide-scale adoption of the semantic web requires providing non-technical users the ability to automatically convert existing documents to semantic form, as well as clearly demonstrating the benefit of doing so. This requires (a) tools to convert data to semantic structures and (b) tools that provide utility to content owners. There are niche providers for both (a) and (b), but I can only assume that the benefits of the semantic web are not general enough yet, since we haven’t yet seen wide-scale adoption.

f. William Mougayar (CEO, Eqentia) in comments within a new year’s semanticweb.com post:

2010 did not turn out to be a killer year for the Semantic Web unfortunately. It’s like Waiting for Godot.

I think that the semantic web will be more “inside” than outside in 2011. Anything semantic is really an enabler for something else. I know for us, we’re lightening the load on semantic extractions because there comes a point of diminishing returns on how much effort you want to exert to precisely tag content. It appears that end-users have not cared so much about the pure semantic label. They want a medium, but not heavy dose of it.

There’s definitely a continued separation between Semantic Technologies and the Semantic Web. As much [as] the Semantic Web’s vision is superb and well articulated by Tim Berners-Lee, it seems to be stuck in a corner due to technical implementation complexities. Its biggest benefits are seemingly with big enterprises, but those have few IT resources that are versed in these technologies. The ray of hope is around Gov 2.0/Open Gov/Open Data initiatives, and it would be good to see significant implementations around these areas. Scientific/medical applications is perhaps another segment that has promise for it.

g. CNI December 2010 (Coalition for Networked Information)

At levels above grass-roots technology activists, the administrators of libraries and other cultural heritage organizations have little exposure to, experience with, or knowledge of linked data. Experience shows that they have real difficulty when they are faced with needing to frame issues or raise substantive questions about linked data futures, and how such might affect their organizations.

A telling example comes from the December 2010 CNI meeting where a session Linked Open Data: t he promises and the pitfalls…where are we and why isn’t there broader adoption? was presented.

Most of the time went to three case studies:

Internet Archive, Kris Carpenter Negulescu
Smithsonian, Martin Kalfatovic
Vivo, Dean Krafft

MacKenzie Smith provided a cogent summary beginning at the 43:30 mark of the video.

After her remarks, she opened the floor for questions near the 52:30 mark…and got no response from the audience of junior and senior administrators of cultural heritage institutions. (Eric Miller, of Zepheira, did ask about difficulties each presenter encountered in their environments).