Knowledge management in general • CLIR

1. The web in general

a. Danny Hillis and David Brin (among several others) at the Edge’s Reality Club discussion of Carr’s “ Is Google making us stoopid?” and fallout from same in sparring between Shirky and Carr.

Hillis We evolved in a world where our survival depended on an intimate knowledge of our surroundings. This is still true, but our surroundings have grown. We are now trying to comprehend the global village with minds that were designed to handle a patch of savanna and a close circle of friends. Our problem is not so much that we are stupider, but rather that the world is demanding that we become smarter. Forced to be broad, we sacrifice depth. We skim, we summarize, we skip the fine print and, all too often, we miss the fine point. We know we are drowning, but we do what we can to stay afloat.

As an optimist, I assume that we will eventually invent our way out of our peril, perhaps by building new technologies that make us smarter, or by building new societies that better fit our limitations. In the meantime, we will have to struggle. Herman Melville, as might be expected, put it better: “well enough they know they are in peril; well enough they know the causes.”

Brin Bullshit makes great fertilizer. But (mixing metaphors a bit) shouldn’t there be ways to eventually let the pearls rise and the worst of the noxious toxins go away, like Phlogistin and Baal worship? More to the point, isn’t that what happens in the older Enlightenment systems-markets, democracy, science and law courts? After argument and competitive discourse in those arenas, aren’t decisions eventually reached, so that people can move on to the next problem, and the next?

The crux: today’s web and blogosphere have only half of the process that makes older Enlightenment “accountability arenas” function. Imagination and creativity are fostered. But we also need the Dance of Shiva, destroying the insipid and vicious and untrue and stupid, to make room for more creativity! No censors or priests or arbiters of taste can do that, but a market could, if today’s Web offered tools of critical appraisal and discourse, in addition to tools of fecund opinionation.

b. Stefano Mazzocchi Freebase G ridworks, D ata- J ournalism and O pen D ata N etwork E ffects

Resonance and traction are great rewarding properties of a successful product launch but they usually only paint the picture of individual interest, at least at first.

As we have seen with the accent on the social aspect of the web in recent years, even the simplest and most trivial of services (say, microblogging services like Twitter) can assume a completely different scope of impact and importance once sustainable network effects come into play. So how does Gridworks fare in the realm of Open Data network effects and what are the obstacles on its path?

First of all, it’s worth noting that all successful and sustainable network effects share a unique property: the system needs to be beneficial for the individual independently of how many others use it. If this is not the case, a ‘chicken/egg’ problem surfaces where the system is beneficial for them only if many people use it but nobody wants to use it until it’s beneficial for them.

The regular web, the blogosphere and microblogging all share this fundamental property: people find expressing themselves rewarding, independently of how many others read what they write. But these systems naturally create self-sustaining network effects: once other people read what you wrote, they often want to write something too; if its easy/cheap enough for them to do so, this starts a chain reaction that sustains the network effect.

Because David [Huynh] and I have been working on untangling chicken/egg problems of the web of data for years (more than 7 now that I think about it) and gained a lot of experience with previous tools (timeline, exhibit and timeplot) that data lovers really liked, we knew that first and foremost a data tool should feel immediately powerful and rewarding even just for individual use and that was our major focus for the first phase of Gridworks.

At the same time, no network effect will emerge unless Gridworks becomes even more useful when others use it too.

It is in that spirit that this tweet today from @delineator made me stop and ponder (emphasis is mine):

@symroe I’m making a lot of use of gridworks too-are you uploading your data back into freebase? not sure if I want to give them the scoop

This is something that was in the back of my mind but I had not put in such clear terms before: the people digging for open data gold might be keen to praise and support all efforts that make more free data and free tools available (as they feel it makes it easier for them to find their digital gold), but while they have clear and established incentives to reveal their findings (what is the story and where they found it, which is the foundation of their credibility as journalists), they do not (yet) have incentives to reveal how they got to it or to share the result of the data curation effort with others. This is because they worry that it might only make it easier for others to find other stories from that pile of already cleaned data and thus, de facto, ‘steal’ it from them.

This is not much different, for example, to what happened with the human genome project when public and private institutions started to race to compile the entire map of the human DNA: only when the costs of DNA sequencing became so low as to make their proprietary advantage in data hoarding marginal, private institutions started to share their data with public efforts.

The principal network effect attractor for Gridworks is the notion that internal consistency, external reconciliation and data integration between heterogeneous datasets are surprisingly expensive even for the most trivial and well covered data domain (this is something Metaweb learned the hard way while building Freebase).

This fact makes “curated open data hoarding” an unstable equilibrium: all it takes is one person to be a little less selfish and share their partially curated datasets in an open shared space in order to share the curation cost with others to disrupt the proprietary advantage of hoarding. This is very similar to the idea of creating a vendor branch of an open source project and making money off of the proprietary fork: it works only if the vendor branch is as effective as the open community to keep up with innovation and evolution of the ecosystem (and history of open software shows this is hardly ever a sustainable business model if the underlying community is healthy and vibrant).

2. Publishing trends

a. Joseph Esposito Publishing after the apocalypse

One of the abiding myths of publishing and scholarly communications is that we are living in a world that is hurtling toward the future. I use the term “myth” not to mean something that is not true but rather as a controlling metaphor for the way we think about things. This thought was prompted by my recent participation in a panel at the 30th Annual Charleston Conference (amusingly identified in Roman numerals, the “XXX Annual Charleston Conference,” to the discomfort of the many librarians in attendance), from which this essay is derived. The name of this panel captures the myth: I Can Hear the Train a-Coming. Good name, taken from a great song. But some of us are likely to reflect that trains are an old technology and if they are taking us anywhere, it is from one point in the past to another point in the past. We might well begin to wonder to what extent we are blinded by the metaphors we use, the hurtling train among them.
[snip]

To sum this up, when the train comes into the station, we will experience post-apocalypse publishing, a new period of relative calm in which investments can be made and profits earned. A future of ongoing disruption is unlikely, as it will not permit the creation of capital to fund the next disruption. The publishing landscape will have two broad models: supply-side publishing, which is effectively the heir to the current open access movement; and demand-side publishing, heir to today’s traditional publishers, which will increasingly move to new forms of attention marketing. The interaction between these two models will be fascinating to watch, but it seems unlikely to me that the supply-side form will be entirely co-opted by demand-side publishers. The Internet is not finished with us yet.

b. Emergent forms of “publication”

c. Journal publication futures

d. University presses

Two major university press e-book initiatives-Project MUSE Editions (PME) and the University Press e-book Consortium (UPeC)-have joined forces. The result of this merger-the University Press Content Consortium (UPCC)-will launch January 1, 2012.

The partnership allows e-books from an anticipated 60-70 university presses and non-profit scholarly presses-representing as many as 30,000 frontlist and backlist titles-to be discovered and searched in an integrated environment with content from nearly 500 journals currently on MUSE.

e. Beyond PDF Workshop, January 19-21, 2011, University of California San Diego

The goal of the workshop was not to produce a white paper! Rather it was to identify a set of requirements and a group of willing participants to develop a mandate, open source code and a set of deliverables to be used by scholars to accelerate data and knowledge sharing and discovery. Our starting point, and the only prerequisite to participating, was the belief that we need to move Beyond the PDF (meant to capture a common philosophy, not necessarily to be taken literally).

Short talks (see program, webcasts, twitterfeed and twitter archive) were followed by discussion and the major issues identified leading to a La Jolla Manifesto (under review) and a set of deliverables to move us towards common goals. One major goal is to use our tools to have some positive impact on the understanding and treatment of spinal muscular atrophy ( SMA).

An emerging list of software tools discussed at the workshop can be found here.

3. Dataculture industries

a. Eric Hellman Inside the D ataculture I ndustry

Some datasets are produced in data factories. I had a chance to see one of these “factories” on my trip to India last month. Rooms full of data technicians (women do the morning shift, men the evening) sit at internet connected computers and supervise the structuring of data from the internet. Most of the work is semi-automated, software does most of the data extraction. The technicians act as supervisors who step in when the software is too stupid to know when it’s mangling things and when human input is really needed.

b. Shashi Gupta and Margaret Boryczka Apex CoVantage

As we grew, we developed other sophisticated computer models to manage workflows, to control quality, to conform data from different systems to client’s needs, and to reduce costs. The most important knowledge we brought to this work was a creative appreciation of man-machine systems and their interfaces: what should be done by machines, what by men/women, where the interfaces between the two should be, and where geographically should the work stream be located in order to produce the most efficient and high quality product.

4. Libraries in transition

a. The Idea of Order: Transforming Research Collections for 21st Century Scholarship

The Idea of Order [ pdf] explores the transition from an analog to a digital environment for knowledge access, preservation, and reconstitution, and the implications of this transition for managing research collections.

The volume comprises three reports.

Can a New Research Library be All-Digital? (Lisa Spiro, Geneva Henry)
explores the degree to which a new research library can eschew print
On the Cost of Keeping a Book (Paul Courant, Matthew “Buzzy” Nielsen)
argues that from the perspective of long-term storage, digital surrogates offer a considerable cost savings over print-based libraries
Ghostlier Demarcations (Charles Henry and Kathlin Smith)
examines how well large text databases being created by Google Books and other mass-digitization efforts meet the needs of scholars, and the larger implications of these projects for research, teaching, and publishing

The reports are introduced by Charles Henry; the volume includes a conclusion by Roger Schonfeld and an epilogue by Charles Henry.

5. Data and text mining

a. Anad Rajaraman and Jeff Ulman Mining of M assive D atasets

At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory.

Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort. The principal topics covered are:

Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.
Similarity search, including the key techniques of minhashing and locality-sensitive hashing.
Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.
The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.
Frequent item set mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.
Algorithms for clustering very large, high-dimensional datasets.
Two key problems for Web applications: managing advertising and recommendation systems.

[ previous] [ next]