Identifiers • CLIR

1. In general

a. Kyle Neath URLdesign

You should take time to design your URL structure. If there’s one thing I hope you remember after reading this article it’s to take time to design your URL structure. Don’t leave it up to your framework. Don’t leave it up to chance. Think about it and craft an experience.

URL design is a complex subject. I can’t say there are any “right” solutions-it’s much like the rest of design. There’s good URL design, there’s bad URL design, and there’s everything in between-it’s subjective. But that doesn’t mean there aren’t best practices for creating great URLs. I hope to impress upon you some best practices in URL design I’ve learned over the years …

Why you need to be designing your URLs
Top level sections are gold
Namespacing is a great tool to expand URLs
Querystrings are great for filters and sorts
Non-ASCII URLs are terrible for English sites
URLS are for humans-not for search engines
A URL is an agreement
Everything should have a URL
A link should behave like a link
Post-specific links need to die

b. Jeni Tennison URL design

Kyle Neath’s post on URL design (go read it) reflects a lot of the thinking that we went through in the design of the legislation.gov.uk URIs and the linked data API as used within data.gov.uk.
[excerpts, as follow]

Content negotiation

It’s very powerful to provide multiple formats at close variants of a given URI (preferably using content negotiation to serve up an appropriate one). Kyle illustrates this with the use of .diff and .patch extensions on /pull URIs within GitHub, which of course complement the usual HTML view.

The biggest benefit of this approach is that the human-readable views of the information that’s being presented is closely bound to the computer-readable views. This aids developers by providing context and descriptive information that helps them to better understand the data the API is providing to them, far better than separate documentation would.

A secondary benefit is that the information required to generate a human-readable version of a page is likely to be useful to other reusers of the underlying data. Basing the HTML pages of a site purely on information available via the API increases the quality of the API.

Fragment identifiers

Kyle touches on the use of fragment identifiers (the bit of a URI after a #) to point to scrollable positions within a page, but they can be used for more than that. The important and useful thing about fragment identifiers is that they are stripped from the URI before it is submitted to the server. You can therefore have multiple fragment identifiers on the same actual page, which can then be served from a (local or intermediate or accelerator) cache without adding load to the server.

c. Rob Styles Choosing URIs, not a five minute task

Chris Keene at Sussex is having a tough time making a decision on his URIs so I thought I’d wade in and muddy the waters a little.

He’s following advice from the venerable Designing URI Sets for the UK Public Sector. An eleven page document from the heady days of October 2009.
[snip]

Chris discusses the choice between data.lib.sussex.ac.uk and www.sussex.ac.uk/library/ in terms of elegance, data merging and running infrastructure. He’s leaning toward data.lib.sussex.ac.uk on the basis that data.organisation.tld is the prevailing wind.

There are many more aspects worth considering, and while data.organisation.tld may be a way to get up and running quickly you might get longer term benefit from more consideration; after all we don’t want these URIs to change.

Thinking about changing responsibilities over time, I have to say I would choose neither. It is perfectly conceivable that the mass observation may at some time move and not be under the remit of the University of Sussex Library, or even the university.

I would choose a hostname that can travel with the archive wherever it may live. Fortunately it already has one, http://www.massobs.org.uk/. Ideally the catalogue would live it something like catalogue.massobs.org.uk or maybe massobs.org.uk/archive or something like that.

My leaning on this is really because this web of data isn’t something separate from the web of documents, it’s “as well as” and “part of” the web as one whole thing. data.anything makes it somehow different; which in essence it’s not.

Postscript
Oh, on just one more thing…

URI type, for example one of:

id – Identifier URI
doc – Document URI, Representation URI
def – Ontology URI
set – Set URI

Personally, I really dislike this URI pattern. It leaves the distinguishing piece early in the URI, making it harder to spot the change as the server redirects and harder to select or change when working with the URIs.

I much prefer the pattern

/container/reference to mean the resource
/container/reference.rdf for the rdf/xml
/container/reference.html for the html

and expanding to

/container/reference.json, /container/reference.nt, /container/reference.xml

and on and on.

My reasoning is simple, I can copy and paste the document URI from the address bar, paste it to curl on the command line and simply backspace a few to trim off the extension. Also, in the browser or wget, this pattern gives us files named something.html and something.rdf by default. Much easier to work with in most tools.

d. Other resources:

2. Books and journals, e.g. ISBN, ISSN, …

a. Jon Orwant, Creating a Trillion-field Catalog: Metadata in Google Books

as reported by Don Hawkins

Problems are encountered with inconsistencies, particularly with multi-volume works and languages using non-Roman character sets. One might think that ISBNs would help, but they are far from unique; in fact, ISBN 753305353 is shared by 1,413 books, and 6,000 ISBNs are associated with more than 20 titles each!

b. Peter Murray eBook Identifier Confusion Shakes Book Industry

Eric Hellman writes about his views of the dysfunction surrounding ISBN assignments for ebooks.

“What problems?” you might ask-Eric writes has an example of how Barnes and Noble was enhancing some ebooks for their Nook platform. By itself, this activity wouldn’t result in assigning a new ISBN. But because publishers are now exerting more control over setting the prices of ebooks (the so-called “ agency model”), the existence of these Nook-enhanced versions needs to cross back-and-forth between the publisher’s and retailer’s electronic systems. The only commonly agreed upon identifier? The ISBN.

And this proliferation of ISBN assignments is making trouble for library’s efforts to effectively identify material-which is to say nothing about what it is doing to our efforts to shoehorn these distinctions between various works into the MARC format used by our catalogs. Is that a separate record for that manifestation with a different ISBN?

c. Other resources:

3. Co-references, reconciliation

a. Stefano Mazzocchi a series of three posts on reconciliation part 1 part 2 part 3

Suppose that you are given two fragments of data, each representing the same objective fact about the same thing (say, the fact that Paris is the capital of France and that the Eiffel Tower is located in Paris) but using different models (aka schemas/ontologies) and different identifiers for the entities described in the data.

Reconciling these fragments means to align the different identifiers given to the same ‘entity’ (in this case ‘Paris’) and fold them together so that the two facts are now related to the same thing (how this is done in practice is not important for now).

This reconciliation activity seems mechanical and artificial at first, but digging deeper into the way natural languages emerge shows some light on the fact that reconciliation can be seen as a form of categorization: we are lumping together all things that indicate “Paris”, just like we do naturally for synonyms or for words with different sounds (imagine “Paris” pronounced in french, ‘pah-ree’).

[snip]

The idea behind RDF (and all syntactic forms of the RDF model like RDFa, Turtle, ntriples, RDF/XML etc.) is that describing data fragments on the web with it (or other things like Microformats that could be easily and mechanically RDFized) allows harvesters to merge data naturally since RDF is, in a sense, already liquid.

There is one problem though: two RDF models always merge… but not necessarily in the way that you would want them to. In the example above, if I had two RDF fragments, written by different people and harvested from different URLs, it is very likely that their identifiers for Paris could be globally unique, but different.

Which means that you don’t know two assertions about Paris, you know one assertion about “urn:france:paris” and another assertion about “http://wikipedia.com/en/paris”… but the RDF engine doesn’t know, unless you load another piece of information that explicitly says so, that these two identifiers are equivalent and they mean to identify the same exact entity in real life.

[snip]

The difference between efforts like Freebase and efforts like Linking Open Data hinges around their model for reconciliation.

Freebase spends considerable amount of resources performing a priori reconciliation of all the bulk loads of data to try to have the most compact and densest possible graph, even at the cost of limiting the rate with which new data can be acquired. On the other hand, Linking Open Data follows the a posteriori reconciliation model where it is assumed that identifier reconciliation is a low-energy point and the world-wide web of data will, once big enough, tend to naturally reconcile identifiers and schemas toward an increased graph density.

Both are huge bets: there is no indication that a priori reconciliation costs are not a function of the quantity of data already contained in the graph (which would eventually saturate its ability to grow); and there is no indication that a denser graph is naturally a lower energy point for unreconciled agglomerations of datasets and that an increase in relational density would happen naturally and spontaneously.

It’s important that I mention explicitly the reason why I stress ‘relational density’ as a critically important property for a web of data: without it there would be very little value in it compared to what traditional search engines are already doing. The problem text-based search engines have is that they have a really hard time emerging from the token soup of their inverted indices even the most trivial of the relationships between data fragments (here is worth mentioning that while Google Squared inspires awe and admiration from data geeks, myself included, it is still a vastly useless tool for any low-tech end user given how noisy its results are).

[from part 2]

Basically, a-posteriori sameAs identifier equivalences is enough to reconcile if and only if the two items being merged are described using the same exact data model. I see this as another form of ‘ abstraction leakage‘: even if the identifiers of items and schemas were mapped and aligned, this operation might not be enough to perform a real reconciliation, as I explained in a previous post on the quality of metadata.

So, in short, while it is fair to say that LOD is indeed nudging people to link to one another’s datasets a priori and make an effort to do so, the result is relationally sparse and ontologically inconsistent… and I don’t think this is because LOD is doing a bad job but because their model resolves around the idea that more data the better, no matter how relationally dense, which is in striking contrast to what Freebase does, focusing more on higher relational density than higher item or domain counts.

[from part 3]

The difference between RDFa and Microdata (syntactic differences aside) is basically the fact that the proponents of the first believe that once everybody naturally starts reusing existing ID schemes and ontologies a densely connected web of semantically reconciled information will come together naturally. The second just want to focus on immediate values and avoid speculating on what’s going to happen next.

This is not different than the debate Exhibit vs. Tabulator: the first is useful (and in use) today and promotes the surfacing of structured data but does little to promote linkage between isolated datasets, the second is much less useful for end users but acted as a catalyst to the concept of “ linkable data,” a methodology where identifiers don’t just identify but can also be used, as-is, as web locators.

They both use the same underlying model (and can even read and write the same syntax)… yet they serve completely different purposes and have radically different aspirations and social dynamics around them: I see the same issue for RDFa vs. Microdata.

The RDFa camp see it as a vector to promote the growth of the web of data, while the Microdata camp focuses on solving practical problems of embedding richer machine-processable information in web pages: the model they use is isomorphic (meaning that, in a closed world scenario, you can always translate one into the other), but their aspirations and the social dynamics they expect around them are different.

It’s not a secret I tend to side with pragmatism and paving-cow-paths strategies on these debates and I find it frankly disheartening that purists still believe that the secret to a useful web of data is already there in the guts of the architecture of the web and that by simply turning a URI into a URL will cause enough social pressure to solve the other issues.

b. Co-referencing, from JISC EXPO meeting, July 11-12, 2011, Pete Johnston

Yesterday I found myself as “scribe” for a discussion on the “co-referencing” question, i.e., how to deal with the fact that different data providers assign and use different URIs for “the same thing.” And these are my rather hasty notes of that discussion.

c. Other resources:

4. Provenance

a. Jeni Tennison Establishing T rust by D escribing P rovenance

One of my favourite tweets from Rob McKinnon (aka @delineator) is this one:

feeling upset RDF enthusiasts oversell
RDF, ignoring creation, provenance,
ambiguity, subjectivity + versioning
problems #linkeddata #london
(9:51 AM Sep 9th from the web)
Delineator
Rob

because it’s one of the things that bugs me on occasion too, and because the issues he mentions are so vitally important when we’re talking about public sector information but (because they’re the hard issues) are easy to de-prioritise in the rush to make data available.
[snip]

This pattern [embedded in discussing two vocabularies Open Provenance Model and the Provenance Vocabulary] for providing provenance information isn’t a complete answer because it doesn’t address how you might assess the provenance of a particular statement. If I went to [the URI] http://statistics.data.gov.uk/id/region/H the only way I could establish that the rdfs:label (say) for the region was generated through the process described above would be to match the URI to the void:uriRegexPattern above, get hold of the original RDF from the cache and work out whether it contains the rdfs:label statement that I’m interested in.

I have a hunch that this would be more viable with named graphs: if statements with different provenance were actually placed in different graphs, then it would be possible with a SPARQL query to identify the graph(s) in which a statement was made, and their provenance.

b. Yves Raimond Named G raphs and Q uad S tores (BBC)

As mentioned in one my previous post, one thing that I am really keen on is the idea of having triples that belong to multiple graphs. This situation is already happening in the wild a fair bit. If you look at a /programmes aggregation feed (all available Radio 4 programmes starting with g), it mentions multiple programmes. All of the statements about each of these programmes also belong to the corresponding programme graph (e.g., Gardener’s Question).

[from resulting comments]

On Tuesday 21 December 2010, 19:59 by Dan Brickley

(genuine question:!)

To what extent is this really an issue with the formal account of named graphs (and SPARQL’s version of it), versus an implementation detail for people actually building RDF storage and query systems? Can’t the redundancy you mention be optimised away internally within the implementation?

On Wednesday 22 December 2010, 11:49 by Yves

@danbri It is an implementation detail 🙂 I don’t think there is nothing wrong with the formal account of Named Graphs (well, apart from the fact that it requires graphs to be named, but that’s another thing). I am just wondering why the ‘quad-store’ trick for implementing named graphs is dominant, when it does seem to lead to a fair lot of data duplication.

c. Scott Meyer A B rief T our of Graphd (Freebase)

Graphd primitives (tuples) are identified by GUIDS which consist of a database id and a primitive id. In a database, primitive ids are assigned sequentially as primitives are written. For example, 9202a8c04000641f8000000000006567 is the guid which corresponds to the one known to you as “Arnold Schwarzenegger.” The front part, 9202a8c04000641f8, is the database id and the back part, 6567, is the primitive id. As you might surmise based on the number of intervening zeros, we’re quite ambitious. Each graphd primitive consists of:

left

A guid, the feathered end of a relationship arrow.

right

A guid, the pointy end of a relationship arrow.

type

A guid, used in conjunction with left and right to specify the type of a relationship.

scope

A guid, identifying the creator of a given primitive.

prev

A guid, identifying the previous guid in this lineage.

value

A string used to carry literal values, strings, numbers, dates, etc.

And a few other odds and ends.

d. W3C Provenance working group

The Provenance Working Group had its first Face-to-Face (F2F) meeting last week in Boston after 3 months of hard work-July 16, 2011. notes

e. Who’s data is it?-a linked data perspective

A post growing out of the UK LOCAH project:

A comment on the blog post announcing the release of the Hub Linked Data maybe sums up what many archivists will think: “the main thing that struck me is that the data is very much for someone else (like a developer) rather than for an archivist. It is both ‘our data’ and not our data at the same time.”

f. Other resources:

5. Persistence

a. Daniel Chudnov Better living through linking (2009) [excerpts from this presentation]

Linked data is:

a way to connect our [library] stuff
draw are stuff deeper into the web
not just files to download, but part of the web

[our stuff] becomes crawl-able, mine-able
this is doing web stuff better
just by doing HTML / HTTP
doesn’t fit the web, it is the web

How … the next part’s hard

how to make it last
if your site breaks when links break, cache and link yourself
so if a remote link breaks, your local links still work
make every cache its own linked data source
if one goes down, the others live on

b. Zepheira PURLz PURL F ederation D evelopment E ffort

The National Center for Biomedical Ontology and Zepheira are pleased to announce work on a PURL Federation. A PURL Federation will allow multiple PURL service operators to cooperate in PURL resolutions, covering for each other in the case of service outages and allowing the persistent resolution of PURLs as funding levels and organizational details change with time.

PURL Federations are intended to enhance the ability of Semantic Web and Linked Data communities to ensure the persistence of their identifiers. Current status here.

c. Other resources:

Ian Davis, …

Hugh Glaser

Andy Powell

Ed Summers

Rob Vesse, …

– purl.org was rejecting ca.20 to 60 requests/second

– DBpedia hosting burden

– Federating purl.org?

– on persistence

– preserve link integrity via hypermedia techniques[slides]

[ previous] [ next]