How to pursue linked data • CLIR

… concepts, not the tool sets

1. Simple, simpler, simplest … ?

Dean Allemang (Semantic web for the working ontologist) comments on a presentation about Facebook’s Open Graph Protocol at an August 2010 Silicon Valley Semantic Web Meetup:

One of the things I found most informative about the talk came in the discussion in response to various questions about design decisions. From the point of view of metatags, the Open Graph Protocol is really simple; just a handful of required tags with a simplified syntax (simpler even than standard RDFa). Even so, Facebook user studies showed that this was almost too complicated.

Even very small complications-additional namespaces, some slightly twisty syntax from RDFa-were found to have a severe damping effect on technology adoption. It seems that even the levels of simplicity we argue for in our Semantic Universe blog entry on technology adoption [and here] are not enough; for some audiences, simple really has to be simple. This is a tough pill for any technologist to swallow; looking at OGP makes it look as if the baby has been thrown out with the bathwater.

But there are now hundreds of millions of new ‘like’ buttons around the web; simplicity pays off. As another commenter pointed out, regardless of the purity (or lack thereof) of the Facebook approach, OGP has still made the biggest splash in terms of bringing semantic web to the attention of the public at large. So who’s the bandwagon, and who’s riding?

2. The zen of open data

Chris McDowall posted this to the New Zealand Open Government Ninjas forum:

Open is better than closed.

Transparent is better than opaque.

Simple is better than complex.

Accessible is better than inaccessible.

Sharing is better than hoarding.

Linked is more useful than isolated.

Fine grained is preferable to aggregated.

(Although there are legitimate privacy and security limitations.)

Optimise for machine readability-they can translate for humans.

Barriers prevent worthwhile things from happening.

“Flawed, but out there” is a million times better than “perfect, but unattainable”.

Opening data up to thousands of eyes makes the data better.

Iterate in response to demand.

There is no one true feed for all eternity-people need to maintain this stuff.

_________
Many people inadvertently contributed to this text. One particularly strong influence was a panel discussion between Nat Torkington, Adrian Holovaty, Toby Segaran and Fiona Romeo at Webstock’09.

3. Enough with all this intertwingularity! … Let’s share what we know

Dan Brickley’s An RDF wishlist email addresses the ennui of some among the linked-data community after a W3C Workshop – RDF Next Steps held at Stanford in June 2010. One contributor remarked:

part of me wants to completely forget about RDF, never think about an ontology or a logic ever gain, and go off and do something completely different, like art or philosophy.

Excerpts from Brickley’s comments:

Is RDF hard to work with? I think the answer remains ‘yes’, but we lack consensus on why. And it seems even somehow disloyal to admit it. If I had to list reasons, I’d leave nits like ‘subjects as literals’ pretty low down. Many of the reasons I think are unavoidable, and intrinsic to the kind of technology and problems we’re dealing with. But there are also lots of areas for improvement. Most of these are nothing to do with fixups to W3C standards documentation. And finally, we can lesson the perception of pain by improving the other side: getting more decent linked data out there, so the suffering people go through is “worth it.”

[a list of reasons why RDF is annoying and hard follows here]

RDF enthusiasts share 99.9% of their geek DNA with the microformats community, with XML experts, with OWL people, … but time and again end up nitpicking on embarrassing details. Someone “isn’t really” publishing Linked Data because their RDF doesn’t have enough URIs in it, or they use unfashionable URI schemes. Or their Apache Web server isn’t sending 303 redirects. Or they’ve used a plain XML language or other standard instead. This kind of partisan hectoring can shrink a community passionate about sharing data in the Web, just at a time when this effort should be growing more inclusive and taking a broader view of what we’re trying to achieve.

The formats and protocols are a detail. They’ll evolve over time. If people do stuff that doesn’t work, they’ll find out and do other things instead. The thing that keeps me involved is the common passion for sharing information in the Web. If we keep that as an anchor point rather than some flavour of some version of RDF, I think a lot of the rest falls into place. I love ” Let’s Share What We Know“-an ancient slogan of the early Web project.[ notes] If we take “Let’s share what we know” as a central anchor, rather than triples, we can evaluate different technical strategies in terms of whether they help by making it easier to “share what we know” using the Web.

Going back to my list, I think the reason to use RDF will simply be that others have also chosen to use it. Nothing more really, it’s about the data, above all. Sure the reason we can all choose to use it and gain value from each others’ parallel decision is the emphasis on linking, sharing, mixing, decentralisation. But when choosing whether to bother with RDF, I think for future decision makers it’ll all be about the data not the implementation techniques.

The reason is *not* the tooling, the fabulous parsers, awe-inspiring inference engines, expressive query languages or cleverly designed syntaxes. Those are all means-to-an-end, which is sharing information about the world. Or getting hold of cheap/free and bulky background datasets, if you prefer to couch it in less idealistic terms.

And why would anyone care to get all this semi-related, messy Web data? Because problems don’t come nicely scoped and packaged into cleanly distinct domains. Whenever you try to solve one problem, it borders on a dozen others that are a higher priority for people elsewhere. You think you’re working with ‘events’ data but find yourself with information describing musicians; you think you’re describing musicians, but find yourself describing digital images; you think you’re describing digital images, but find yourself describing geographic locations; you think you’re building a database of geographic locations, and find yourself modeling the opening hours of the businesses based at those locations. To a poet or idealist, these interconnections might be beautiful or inspiring; to a project manager or product manager, they are as likely to be terrifying.

Any practical project at some point needs to be able to say “Enough with all this intertwingularity! This is our bit of the problem space, and forget the rest for now.” In those terms, a linked Web of RDF data provides a kind of safety valve. By dropping in identifiers that link to a big pile of other people’s data, we can hopefully make it easier to keep projects nicely scoped without needlessly restricting future functionality. An events database can remain an events database, but use identifiers for artists and performers, making it possible to filter events by properties of those participants. A database of places can be only a link or two away from records describing the opening hours or business offerings of the things at those places. Linked Data (and for that matter FOAF…) is fundamentally a story about information sharing, rather than about triples. Some information is in RDF triples; but lots more is in documents, videos, spreadsheets, custom formats, or [hence FOAF] in people’s heads.

4. BBC … a middle way for linked data

Ed Summers’ commentary on the 2 nd London Linked Data Meetup (February 2010) provides a summary of what the BBC has be up to:

The main thing that I took away is how much good work the BBC is doing in this space. Given the recent news of cuts at the BBC, it seems like a good time to say publicly how important some of the work they are doing is to the web technology sector. As part of the Meetup Tom Scott gave a presentation on how the BBC are using Linked Data to integrate distinct web properties in the BBC enterprise, like their Programmes and the Wildlife Finder web sites.

The basic idea is that they categorize (dare I say catalog?) television and radio content using wikipedia/dbpedia as a controlled vocabulary. Just doing this relatively simple thing means that they can create another site like the Wildlife Finder that provides a topical guide to the natural world (and also happens to use wikipedia/dbpedia as a controlled vocabulary), that then links to their audio and video content. Since the two sites share a common topic vocabulary, they are able to automatically create links from the topic guides to all the radio and television content that are on a particular topic.

[snip]

For more information check out the Semantic Web Case Study the folks at the BBC wrote summarizing their approach for the W3C.

5. Chronicling America …. one set of linked data methodologies

Ed Summers’ commentary regarding the “how to” evidenced in the BBC’s work ( 2 nd London Linked Data Meetup) included a sketch of the Library of Congress’s approach to linked data in the Chronicling America project:

The really powerful message that the BBC is helping promote is this idea that good websites are APIs. Tom mentioned Paul Downey’s notion that Web APIs Are Just Web Sites. It’s a subtle but extremely important point that I learned primarily working closely with Dan Krech for a year or so. It’s an unfortunate side effect of lots market driven talk about web2.0, web3.0 and Linked Data in general that this simple REST message gets lost. We took it seriously in the design of the “ API ” at the Library of Congress’ Chronicling America. It’s also something I tried to talk about later in the week at dev8d when I had to quickly put a presentation together:

[snip]

The slides probably won’t make much sense on their own, but the basic message was that we often hear about Linked Data in terms of pushing all your data to some triple store so you can start querying it with SPARQL and doing inferencing, and suddenly you’re going to be sitting pretty, totally jacked up on the Semantic Web.

If you are like me, you’ve already got databases where things are modeled, and you’ve created little web apps that have extracted information from the databases and put them on the web as HTML docs for people around the world to read (queue some mid 1990s grunge music). Expecting people to chuck away the applications and technology stacks they have simply to say they do Linked Data is wishful thinking. What’s missing is a simple migration strategy that would allow web publishers to easily recognize the value in publishing the contents of their database as Linked Data, and how it complements the HTML (and XML, JSON) publishing they are currently doing. My advice to folks at dev8d boiled down to:

Keep modeling your stuff how you like
Identify your stuff with Cool URIs in your webapps
Link your stuff together in HTML
Link to machine friendly formats (RSS, Atom, JSON, etc)
Use RDF to make your database available on the web using vocabularies other people understand.
Start thinking about technologies like SPARQL that will let you query pools and aggregated views of your data.
Consider joining the public-lod discussion list and joining the conversation

6. Capturing context an integral function of publication

Brian O’Leary recently took publishers at O’Reilly Media’s Tools of Change, 2011 conference on a provocative tour at the challenges that agile, web-savvy purveyors of content pose for those whose model for delivering content relies on the traits of traditional physical containers. If we replace O’Leary’s treatment of the problems associated with content being shoehorned into physical book/journal containers, and think instead about the containers from our realm-metadata bounded by the strictures of database records-his analysis of the container-bound publishers’ dilemma provides us with a roadmap for thinking beyond our present day’s record-based (container-based) approach to metadata:

book, magazine and newspaper publishing is unduly governed by the physical containers
those containers … necessarily [ignore] that which cannot or does not fit
the process of filling the container strips out context-the critical admixture of tagged content, research, footnoted links, sources, audio and video background …
only after we fill the physical container do we turn our attention to rebuilding the digital roots of content: the context, including tags, links, research and unpublished material, that can get lost on the cutting-room floor
we can’t afford to build context into content after the fact … doing so irrevocably truncates the deep relationships that authors and editors create
building back those lost links is redundant, expensive and ultimately incomplete …
ultimately, [it’s] a function of workflow

The twenty-minute screencast of O’Leary’s Context first: a unified field theory of publishing provides valuable food for thought in terms of what the Stanford Workshop aims to accomplish.

7. Data first vs. structure first

Stefano Mazzocchi’s post from 2005 is an interesting predictor of the technology at the company where he now works (Freebase):

Data First strategies have higher usability efficiency (all rest being equal) than Structure First strategies.

The reasons are not so obvious:

Data First is how we learn and how languages evolve. We build rules, models, abstractions and categories in our minds after we have collected information, not before. This is why it’s easier to learn a computer language from examples than from its theory, or a natural language by just being exposed to it instead of knowing all rules and exceptions.
Data First is more incrementally reversible, complexity in the system is added more gradually and it’s more easily rolled back.
Because of the above, Data First’s Return on Investment is more immediately perceivable, thus lends itself to be more easily bootstrappable.

But then, one might ask, why is everybody so obsessed with design and order? Why is it so hard to believe that self-organization could be used outside the biological realm as a way to manage complex information systems?

One important thing can be noted: On a local time-scale and once established, “Structure First” systems are more efficient.

This basically means that in any given instant and with infinite energy to establish them, structure first systems are preferable. Problem is that both bootstrapping costs and capacity to evolve over time of any given designed system are endemically underestimated, making pretty much any ‘Structure First’ project appear more appealing over ‘Data First’ ones, at least at design time.

But there is more: we all know that a complete mess is not a very good way to find stuff, so “data first” has to imply “structure later” to be able to achieve any useful capacity to manage information. Here is where things broke down in the past: not many believed that useful structures could emerge out of collected data.

But look around now: the examples of ‘data emergence’ are multiplying and we use them every day. Google’s PageRank, Amazon‘s co-shopping, Citeseer‘s co-citation, del.icio.us and Flickr co-tagging, Clusty clustering, these are all examples of systems that try to make structure emerge from data, instead of imposing the structure and pretend that people fill it up with data.

Some believe that the semantic web is an example of ‘structure first’ but it’s really not the case…. yet, many and many people truly believe that in order to be successful a ‘Structure First’ design (well “ontology first” in this case) is the way you build interoperability.

As you might have guessed, I disagree.

I think that RDF is a good data model for graph-like structures and that complex, real life systems tend to exhibit graph-like structures. I also believe that the value is not in the ontology used to describe the data but in the ability to globally identify (and isolate) information fragments and in the existence (or lack thereof!) of relationships between them.

Don’t get me wrong, some common vocabularies (RDF, RDF Schema and Dublin Core) go a long way in reducing the bootstrapping effort and make basic interoperability happening. At the same time, I believe people will “pick cherries” in the ontology space and when they don’t find anything satisfying they will write their own. Sometimes use and abuse will be hard to tell apart, creating a sort of Babel of small deviations that will have to be processed with a ‘Data First’ approach in mind. An immune system will have to be created, trusted silos established, peer review enforced.

Next time you spend energy writing the ontology, or the database schema, or the XML schema, or the software architecture, or the protocol, that ‘foresees’ problems that you don’t have right now, think about you ain’t gonna need it, do the simplest thing that can possibly work, keep it simple stupid, release early and often, if it ain’t broken don’t fix it, and all the various other suggestions that tell you not to trust design as the way to solve your problems.

But don’t forget to think about ways to make further structure emerge from the data, or you’ll be lost with a simple system that will fail to grow in complexity without deteriorating.

… Isn’t HTTP already the API?

Leigh Dodds regarding RDF data access options:

This is a follow-up to my blog post from yesterday about RDF and JSON.

Ed Summers tweeted to say:

…your blog post suggests that an API for linked data is needed; isn’t http already the API?

I couldn’t answer that in 140 characters, so am writing this post to elaborate a little on the last section of my post in which I suggested that “there’s a big data access gulf between de-referencing URIs and performing SPARQL queries.” What exactly do I mean there? And why do I think that the Linked Data API helps?.

Mo McRoberts’ comment included:

The bigger complaint is of the gulf you’ve described in this and the previous blog post. People who aren’t part of the Linked Data “community” see those two words (or some approximate synonyms) and leap to conclusions: that they need to figure out RDF (probably), that they need to stuff everything they have in one of these newfangled triplestores (maybe), they need to expose a SPARQL endpoint (quite often not), which means leaning SPARQL (again, quite often not) and figuring out how it’s all supposed to work together. To the web developer thinking “hey, this Linked Data stuff sounds pretty cool, maybe my site and its data could join in!” it all seems awfully complicated and offputting.

Ed Summers’ comment included::

Thanks very much for the thoughtful response to my hastily-tapped tweet. I like the catch-phrase “Your website is your API”, it’s kind of like the elevator pitch of RESTful web development I think. But I also think “website” is kind of a nebulous term–that isn’t really part of the terminology of Web standards, architecture, etc. So its utility kind of breaks down once you get past a superficial look at how data should be made available on the web

Leigh Dodds, SemTech thoughts (at Semtech 2011)

Our goal is to make make data as useful in as many different contexts and by as many different developers as possible. You can find the slides for these on Slideshare … Creating APIs over RDF data sources

Linked data and why we (librarians) should care

From a presentation by Emmanuelle Bermes (Bibliothèque Nationale de France) regarding a use case in which:

a publisher provides basic information about a book
a national library adds bibliographic and authority control
my local library adds holdings
some nice guy out there adds links from Wikipedia
my library provides a web-page view of this and related books (subject, bio, wikipedia, amazon)

results:

no crosswalk / mapping

each one uses his own metadata format, all triples can be aggregated

no data redundancy

each one creates only the data he needs, and retrieves already existing information

no harvesting

the data is available directly on the web

no branding issue

the URIs allow to track down the original data whatever its origin

no software-specific developments

everything relies on open standards such as RDF, SPARQL …
no need to learn a new protocol or query language

10. URLs for metadata records and author as bibliographic resource?

From a post by Ed Summers, plus a variety of useful comments: bibliographic records on the web

… there is some desire in the library community to model an author as a Bibliographic Resource and then relate this resource to a Person resource. While I can understand wanting to have this level of indirection to assert a bit more control, and to possibly use some emerging vocabularies for RDA, I think (for now) using something like FOAF for modeling authors as people is a good place to start.

Jakob, now I think we are in agreement. Our bibliographic records really should be just a class of web documents with URLs. I could imagine some linked data advocates making a case for an abstract notion of a bibliographic records (so called Real World Object), so that they could be described. You actually can see some of this approach in the use of viaf:EstablishedHeading, viaf:NameAuthorityCluster, etc., in the Virtual International Authority File ( example). But I don’t think this is necessary, and would ultimately make it difficult for people to put their records on the web.

[ previous] [ next]