Vocabulary & Schema • CLIR

[ previous] [ next]

The objective here is to provide a selective scan the wide range of resources that can used to organize data and map it in useful ways. The boundary between schema, vocabulary, and protocol is blurred by intention.

1. Schema.org home getting started

From Michael O’Connor’s (Bing) post of July 20th, 2011:

On June 2nd we announced a collaboration between Bing, Google and Yahoo to create and support a standard set of schemas for structured data markup on web pages. Although our companies compete in many ways, it was evident to us that collaboration in this space would be good for each search engine individually and for the industry as a whole.

In the short time since the announcement we’ve received a lot of great feedback. We have participated in some conferences and we have seen a lot of discussion about our proposal on the web. We’ve also seen the Association of Educational Publishers publicly announce their intention to build an industry specific extension! We’re genuinely pleased to see so much interest in the topic as it is very important to the work we do.

We have been reading all of the feedback, following the discussions and debating amongst our working group many of the concerns and suggestions that have been raised. We have not been able to respond to all of the feedback, but we have incorporated some of it into our site already and we will continue to iterate on that.

Going forward, this blog will serve as a vehicle for the team to share our thoughts, solicit feedback, announce schema updates and so on.

In that spirit, I’m happy to announce that we will be hosting a schema.org workshop to take place on September 21st in Silicon Valley. As a group we are deeply committed to working with the standards communities, tools vendors and organizations committed to driving industry specific extensions so we hope this workshop will be the first of many successful collaborations.

Over the next couple of weeks we’ll be reaching out to the leaders in the appropriate standards communities, amongst the tools vendors and in the vertical industries where extensions make the most sense inviting them to participate. This will be a working session -taking feedback, discussing options and figuring out the best way to incorporate it to make this simple and useful for publishers and the search engines. If you would like to be involved please send an email to workshop@schema.org.

We are really looking forward to these discussions and we will share what we learn here on this blog. In the meantime, please continue to share your feedback.

a. At SemTech 2011

A group of W3C constituents got together informally at SemTech 2011 for a conversation with Kavi Goel (Google): i ntroduction and notes.

Ed Summers was there as was Dan Brickley. Keep in mind that many of the speakers have a deeply vested interest in RDF and RDFa … it did capture the temperature of some in the semWeb/linkedData community immediately after schema.org was announced.

b. A sampling of commentary from early days

Benjamin Nowack, Peter Mika (Yahoo), Mike Bergman, Danny Ayers

Jeni Tennison offered some thoughtful notes of caution, and Harry Halpin helps summarize the conflicts among proponents of various environs and schema that are in play (microdata, schema.org, microformts, RDF, RDFa, rich snippets, …). [the entire W3C email thread is located here]

c. … and on further reflection:

Semanticweb.com summed up the first few weeks of discussion here and Eric Hellman addresses aspects of metadata for books here and here.

As thinking continues to evolve, a number of practical means of addressing the what and how of microdata have begun to appear:

2. Linked data in general

a. Richard Cyganiak Top 100 most popular RDF namespace prefixes

I run prefix.cc, a website for RDF developers where anyone can register and look up the expansion URIs for namespace prefixes such as foaf, dc, qb or void. The site tracks which prefixes gets looked up most often. This allows some insight into the popularity of RDF vocabularies and datasets.
[note the caveats]

b. Dublin Core Metadata Initiative Terms; News; Status R eports

c. Europeana Document L ibrary … technical documents

NB: focus is on OAI-ORE, DublinCore and SKOS

d. Freebase

(1) A very high level snapshot of “schemas”

Jamie Taylor’s introduction to Freebase from the perspective of linked data

(2) Access to detailed maps of schemas within Freebase

(3) John Giannandrea’s 2008 video introduction to Freebase’s design objectives

Part of what makes this open database unique is that it spans domains, but requires that a particular topic exist only once in Freebase. Thus freebase is an identity database with a user contributed schema which spans multiple domains. For example, Arnold Schwarzenegger may appear in a movie database as an actor, a political database as a governor, and in a bodybuilder database as Mr. Universe. In Freebase, however, there is only one topic for Arnold Schwarzenegger that brings all these facets together. The unified topic is a single reconciled identity, which makes it easier to find and contribute information about the linked world we live in.

(4) Andrew Hogue’s 2011 The S tructured S earch E ngine video presentation

e. Schemapedia [ about]

A community built, searchable compendium of RDF schemas for use with Linked Data. Schemas are described in a simple, friendly way and feature examples of how you can …

Service provided by Ian Davis using the Talis Platform. Project hosting by Google Code.

All text and data are in the Public Domain

f. VoID [ about]

VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets. It is intended as a bridge between the publishers and users of RDF data, with applications ranging from data discovery to cataloging and archiving of datasets. This document is a detailed guide to the VoID vocabulary. It describes how VoID can be used to express general metadata based on Dublin Core, access metadata, structural metadata, and links between datasets. It also provides deployment advice and discusses the discovery of VoID descriptions.

3. Specific environs

a. Archives

(1) EAC-CPF ( Encoded Archival Context-Corporate B odies, Persons, and Families)

Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF) primarily addresses the description of individuals, families and corporate bodies that create, preserve, use and are responsible for and/or associated with records in a variety of ways. Over time, other types of contextual entities may evolve under the larger EAC umbrella, but currently its primary purpose is to standardize the encoding of descriptions about agents to enable the sharing, discovery and display of this information in an electronic environment.

(2) EAD Revision ( Encoded Archival Description)

(Timetable estimates draft release December 2012, final release August 2013)

(3) HASSET ( Humanities and Social Science Electronic Thesaurus)

HASSET is our subject thesaurus that has been developed by the UK Data Archive over more than 20 years. It allows you to retrieve data and related documentation accurately and can help you select the most relevant search terms for your area of interest.

HASSET-the Humanities and Social Science Electronic Thesaurus-displays the relationships between terms and so can help you broaden your search or make it more specific. Cross referencing to synonyms will suggest alternative terms as well as provide links to other conceptually related terms.

The thesaurus’ subject coverage reflects the subject content of the UK Data Archive holdings focusing on the social sciences and humanities. Coverage is most comprehensive in the core subject areas of social science disciplines: politics, sociology, economics, education, law, crime, demography, health, employment, and, increasingly, technology and environment.

(4) LOCAH (Linked Open Copac Archives Hub)

Pete Johnston’s several posts on his RDF work related to EADs:

The “things” in EAD: a first cut at a model

Some more “things”: some extensions to the Hub model

Describing the “things”: the RDF terms used (part 1)

Describing the “things”: the RDF terms used (part 2)

(5) SNAC ( Social Networks and Archival Context)

Ed Summers’ post on converting GraphML data from SNAC to Arch flavored RDF:

The SNAC project have aggregated archival finding aid data for manuscript collections at the Library of Congress, Northwest Digital Archives, Online Archive of California and Virginia Heritage. They then used authority control data from NACO/LCNAF, Getty Union List of Artist Names Online (ULAN) and VIAF to knit these archival finding aids using the Encoded Archival Context-Corporate Bodies, Persons, and Families ( EAC-CPF).

I wrote about SNAC here about 9 months ago, and how much potential there is in the idea of visualizing archival collections across institutions, along the axis of identity. I had also privately encouraged Brian to look into releasing some portion of the data that is driving their prototype. So when Brian delivered I felt some obligation to look at the data and try to do something with it. Since Brian indicated that the project was interested in an RDF serialization, and Mark had pointed me at Aaron Rubenstein’s arch vocabulary, I decided to take a stab at converting the GraphML data to some Arch flavored RDF.

b. Books

(1) BIBO ( BIBliographic Ontology) and associated Google Group

(2) British Library Linked-Data Model (July, 2011)

Made public at Talis’ Linked Data and Libraries 2011 in Neil Wilson’s (Head of Metadata Services for the British Library) presentation. A video stream of his remarks is at 1:00:45 in the afternoon videostream. A map of the model is here and an overview by Tim Hodson of Talis who consulted with the BL during development of the model is here. As Hodson notes, the overview is the first in a series of posts to delve deeper into specific aspects of the model and explain some of the thoughts behind the modelling.

(3) OPDS ( Open Publication Distribution System)

A syndication format for electronic publications based on Atomand HTTP. Catalogs enable the aggregation, distribution, discovery, and acquisition of electronic publications. OPDS Catalogs use existing or emergent open standards and conventions, with a priority on simplicity.

(4) RDTF (UK Resource Discovery Taskforce) Metadata Guidelines ( RDTF in general)

Pete Johnston and Andy Powell were asked to submit recommendations.

The process and results are discussed in these posts: one two three four five

c. Museums

(1) CIDOC-CRM (Content Reference Model for CIDOC) [ about]

From the ResearchSpace project’s updates

The ResearchSpace team agreed the final mapping of the Museum’s collection data to the CRM ontology this week. The full data conversion will now take place allowing full data testing to start

(2) Getty has produced a rich set of tools and resources, but not as open data

AAT (Art & Architecture Thesaurus)
CONA (Cultural Objects Name Authority)
MEGA (Middle Eastern Geodatabase for Antiquities)
TGN (Thesaurus of Geographic Names)
ULAN (Union List of Artist Names)

d. Research Data

(1) Datacite metadata schema

via a post by Dorthea Salo:

Or it could just be that the DCMS is a sensible minimum that solves the problem at hand (identifying and citing digital datasets) without gobs of cruft or gobs of oversimplification. They’ve also acknowledged the need to revisit and change the scheme over time, and are working on how that will happen (Open Archives Initiative, I am training laser-eyes on you).

DCMS is not perfect; in my opinion, they’ll need to go beyond DOIs to handles and ARKs and PURLs. (Yes, I know all DOIs are handles; not all handles are DOIs.) But for a first cut, it’s pretty darn good, and it’ll stay that way if they can resist the temptation to cruft it up. Good job, standardistas!

(2) JISC MRD (Managing Research Data) MRDonto group

The MRDonto group agreed to adopt certain standards as ‘uncontroversial’ ORCID for personal identifiers; DataCite DOIs for datasets; and OAI-ORE for packaging composite data objects.

(3) Science … BBC Sci ence Ontology-Ta ke T hree] April 2011

(4) SPAR ( Semantic Publishing and Referencing Ontologies)

Together, they provide the ability to describe far more than simply bibliographic entities such as books and journal articles, by enabling RDF metadata to be created to relate these entities to reference citations, to bibliographic records, to the component parts of documents, and to various aspects of the scholarly publication process.

e. Topics … as in sense / meaning associated with terms

(1) Cyc and openCyc [ summary]

Cyc is an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning. The project was started in 1984 by Douglas Lenat at MCC and is developed by company Cycorp. Parts of the project are released as

OpenCyc, which provides an API, RDF endpoint, and data dump under an open source license

(2) Lexvo.org [ about] and linvoj.org [ about]

(3) Library of Congress Subject Headings [ at id.loc.gov]

An interesting discussion from a recent Code4lib email thread [excerpts as follow]:

[ BD]

Wait, so is it possible to know if “England” means the free-floating
geographic entity or the country? Or is that just plain unknowable.

[ KM]

OK, as a cataloger who has been confused by the jurisdictional/place name
distinction, I’m going to jump in here.
Whether “England” means the free-floating geographic entity or the country
is not quite unknowable-it depends on the MARC codes that accompany it.

The brief answer is this: a field used in a 651$a or a $z should match a 151
in the LC authorities.
If the MARC field is 151 or 651 (let’s just say x51), then the $a should
match a 151 in the authority file.
MARC subfield z ($z) is always a geographic subdivision and should match a
151.

Here’s where it gets tricky …

[snip]

I hope I’m not pointing out the obvious …

[ BD]

> I hope I’m not pointing out the obvious,
That made me laugh so hard I almost ruptured something.

Thank you so much for such a complete (please, god, tell me it’s
complete…) explanation. It’s a little depressing, but at least now I now
why I’m depressed 🙂

[ KM] a more detailed description of the intricacies of encoding with the added factor of substantive change in rules/practice over time

[and in a subsequent post]

[ BD]

OK, so I’ve been trying to follow all of this, and have to say, I’m finding
it all very interesting. I want to give a special shout-out to the cataloger
who have joined in; I (and, I think, much of code4lib) need this kind of
input on a much more regular basis than we’ve been getting it.

At the same time, I’m finding it hard to determine if we’re converging on
“when trying to turn LCSH into reasonable facets, here’s what you need to
do” or “when trying to turn LCSH into reasonable facets, you’ve haven’t got
a freakin’ prayer”. Can someone help me here?

[ SS]

For FAST, see Chan and O’Neill (2010). There are large parts of FAST where
the editors wisely opted to punt on the more intractable parts.

Chan, Lois Mai and O’Neill, Ed (2010). FAST, Faceted Application of Subject
Terminology: Principles and Application. Libraries Unlimited. ISBN:
9781591587224

(4) Panlex (a project of Utilika Foundation and University of Washington’s Turning Center)

( cited with commentary here)

(5) UDC ( Universal Decimal Classification)

From Dan Brickley’s post on Lonclass and RDF

There’s a wealth of meaning hidden inside Lonclass and UDC and the collections they index, a lot that can be added by linking it to other RDF datasets, but more importantly there are huge communities out there who’ll do much of the work when the data is finally opened up…

[snip]

Classification systems with compositional semantics can be enriched when we map their basic terms using identifiers from other shared data sets. And those in the UDC/Lonclass tradition, while in some ways they’re showing their age (weird numeric codes, huge monolithic, hard-to-maintain databases), … are also amongst the most interesting systems we have today for navigating information, especially when combined with Linked Data techniques and companion datasets.

(5) UMBEL (Upper Mapping and Binding Exchange Layer) [ cited and described here ]

(6) Wordnet [ about] [ in Freebase]

4. Other types

a. CULTOS( Cultural Units of Learning – T o ols and Services)

The CULTOS project develops concepts, systems and tools for knowledge publishing. It includes a knowledge model of intertextual studies, a standardised hypermedia document model and a prototype multimedia authoring tool for use in the publishing life-cycle.

In order to arrive at a “semantic web,” information technology needs to provide mechanisms that allow domain experts to express their knowledge, and combine that knowledge with medial presentations and illustrations.

b. DCAT( Data CATalog vocabulary)

An increasing number of government agencies make their data available on-line in the form of data catalogs such as data.gov (see global map of data catalogs at CTIC). Catalogs exist at national, regional and local level; some are operated by official government bodies and others by citizen initiatives; some have general coverage, while others have a specific focus (e.g., statistical data, historical datasets).

c. Historillo ( Ontologies and semantic interoperability for humanities data, 2007)

Show me the way to Historillo …, slides 21 ff.

d. Institution Identifiers ( NISO I 2 … library supply chain entities)

e. KOAN[ about]

Ontologies

BibTeX
Evolution log
KOAN SERVER
Research
VISION

f. Media Resources-From a semanticweb.com post:

The W3C announced yesterday that it has published a last call working draft of “ Ontology for Media Resources 1.0.” The draft was published by the Media Annotations Working Group. Comments are welcome on the draft through March 31, 2011.

g. Music Ontology [ about]

h. Ontopia Knowledge Suite and Navigator Framework [about]

Ontopia is an open source suite of tools for building applications based on Topic Maps, providing features like an ontology designer, an instance data editor, a full-featured query language, web service access points, database storage, and so on.

The product suite is highly mature. Ontopia 1.0 was released in June 2001, and we are now nearing the release of Ontopia 5.1. Ontopia has been in production use in a number of commercial projects on three continents for many years now, and the core engine has been very stable over most of that period.

example: The Italian Opera Topic Map

The purpose of this web site is to demonstrate the use of topic maps to drive web portals. The application is being built using the Ontopia Knowledge Suite and the Ontopia Navigator Framework. It is not yet finished and is therefore not publicly available, so please do not publicise the URLs.

The web site contains no static HTML pages. Instead, every page (including all the links it contains) is generated on the fly, based on information contained in the underlying topic map. The topic map used for this demo is the Italian Opera topic map that is distributed with Ontopia’s free topic map browser, the Omnigator. This topic map (opera.ltm) can also be browsed in the online version of the Omnigator.

The chief difference between the present application and the Omnigator is that the latter is a generic topic map browser, whereas this one is specific to the Italian Opera topic map:

i. OGP (Open Graph Protocol)

The Open Graph Protocol enables you to integrate your Web pages into the social graph. It is currently designed for Web pages representing profiles of real-world things—things like movies, sports teams, celebrities, and restaurants. Including Open Graph tags on your Web page makes your page equivalent to a Facebook Page. This means when a user clicks a Like button on your page, a connection is made between your page and the user. Your page will appear in the “Likes and Interests” section of the user’s profile, and you have the ability to publish updates to the user. Your page will show up in same places that Facebook pages show up around the site (e.g., search), and you can target ads to people who like your content. The structured data you provide via the Open Graph Protocol defines how your page will be represented on Facebook.

A how-to tutorial and an Open Graph protocol checker

j. ORE (Open Archives Initiative Object Reuse and Exchange)

From the library world’s perspective, Michael Witt’s Object Reuse and Exchange for ALA (2010)

NB: included by Europeana

k. Rich snippets Google’s current help page

With rich snippets, webmasters with sites containing structured content—such as review sites or business listings—can label their content to make it clear that each labeled piece of text represents a certain type of data: for example, a restaurant name, an address, or a rating.

(Note: Marking up your data for rich snippets won’t affect your page’s ranking in search results, and Google does not guarantee that markup on any given page or site will be used in search results.)

The properties supported by Google are based on a number of different formats. For more information, see the following articles:

For specific vocabulary and examples, see:

Google also recognizes markup for video content and uses it to improve our search results.

Check your markup using the rich snippets testing tool.

l. VICODI ontology (VIsual COntextualisation of DIgital content)

The Work package 6 aims to the integration of a tailored multilingual module via a user friendly sophisticated access.

Based on SYSTRAN’s machine translation technology, this module will provide also terminology extraction and machine translation customization tools for the construction

and retrieval of personalised metadata within the aim to create new multilingual digital documents and multilingual ontologies in Czech, Polish, Spanish, Portuguese, German, Italian, English, French, Danish, Russian and Serbo-Croatian.

This reports aims to present the integration of the Vicodi ontology developed during a 24-

month IST project Visual Contextualization of Digital Data http://www.vicodi.org applied to the application http://www.eurohistory.net .

For Enrich translation customization purposes, SYSTRAN has extracted the Vicodi ontology distances to build a Enrich-specific dictionary including all the person names and events hierarchised via ontologies.