Aggregating Web Resources

Interoperability changes how we build collections

May 27, 2010

The Open Archives Initiative Object Reuse and Exchange specification defines a set of new standards for the description and exchange of aggregations of web resources. This presents an exciting opportunity to revisit how digital libraries are provisioned. ORE and its concept of aggregation—that a set of digital objects of different types and from different locations on the web can be described and exposed together as a single, compound entity—may present the next major disruptive technology for librarians who develop and manage collections of digital information.

Speaking in generic terms, an aggregation is simply a group or collection of things. For example, you may aggregate food to prepare a meal. You can begin with recipes that include lists of ingredients and descriptions of how to prepare the dishes you’ve chosen to make. Some of the ingredients may come from different places. You probably have some of them locally in your fridge or cabinet, but you may need to fetch some of them from various remote locations. For example, you may pick up a loaf of bread at the bakery or a bottle of Merlot from your local wine shop. You may even be interested in a particular instance of wine, perhaps from a specific year, that has been recommended to you by a friend.

Everything for your meal has been represented all together above as an aggregation, but you can also view the dishes and their recipes and ingredients as their own aggregations. Aggregations can include other aggregations.

This concept of aggregation is not new to librarians, who have been aggregating content into library collections for centuries. The problem, though, is that most digital libraries have been provisioned for people, not computer programs, to use.

Opening silos

Currently, the management and presentation of digital library collections revolves mostly around the digital library systems that house them. A librarian decides what digital resources go together and then works within the capabilities of the system to present the resources in an appropriate and orderly context. The result is typically a series of web pages that human beings need to navigate in order to find links to resources that meet their information needs. While the system may expose its metadata for harvesting or its index for federated searching, the digital resources themselves are tucked deeply inside proprietary silos.

ORE presents the possibility of breaking down these silos by exposing the semantics of these resources and providing hooks to retrieve them without the need for a human being to read a web page and click on a link. Liberating digital library content from these silos for reuse and exchange may very well explode the construct of the “collection” as we know it today because it will no longer be the exclusive domain of librarians to aggregate digital library resources and dictate the context of their presentation for use. Human beings and machines will be able to assemble their own “collections.”

The need for librarians to help make sense of interoperable digital information by provisioning resources with care and quality metadata and by connecting users to resources—and resources to resources—is greater than ever. In order to capitalize on these technologies, librarians must first understand them and be able to relate them to the professional practice of librarianship.

History of the Open Archives INitiative

 In 1999, Paul Ginsparg, Rick Luce, and Herbert Van de Sompel issued a Call For Participation to bring together developers and managers of e-print repositories to explore possible collaborations. The resulting Santa Fe Convention begat the Open Archives Initiative (OAI), whose goal was stated as being: "to transform scholarly communication by providing a technical and organizational framework to facilitate interoperability among repositories."

Under the leadership of Carl Lagoze from Cornell University and Herbert Van de Sompel from Los Alamos National Labs, the OAI collaboratively developed the OAI-PMH and ORE specifications and grew to include a diverse community of scientists, software developers, repository managers, publishers, and librarians who shared a common interest in facilitating scholarly communication.

The current mission statement of the OAI says that it "develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. Continued support of this work remains a cornerstone of the Open Archives program. The fundamental technological framework and standards that are developing to support this work are, however, independent of the both the type of content offered and the economic mechanisms surrounding that content, and promise to have much broader relevance in opening up access to a range of digital materials. As a result, the Open Archives Initiative is currently an organization and an effort explicitly in transition, and is committed to exploring and enabling this new and broader range of applications. As we gain greater knowledge of the scope of applicability of the underlying technology and standards being developed, and begin to understand the structure and culture of the various adopter communities, we expect that we will have to make continued evolutionary changes to both the mission and organization of the Open Archives Initiative."

You could make an argument that, with some knowledge of this specific institutional repository and collection, you could write a program that is aware of the link behind the Download button and could accomplish this task. You might even be able to reassemble some of the structured metadata by indexing the page or applying some other heuristics, like those that Google uses for ranking relevant search results. But would this program work with a different digital library that presents different representations of its objects? Chances are it wouldn't work with any precision because the splash pages that it would encounter would be constructed differently. For example, the Download button might be located somewhere else on the page, or instead of a Download button, the title of the object might be a link that the user is expected to click to download the object. There are some other important questions that could be asked. Could such a program be able to differentiate between conference posters and other types of objects in the collection? What if you wanted your program to download and assemble all of the posters or their supplementary files from a particular conference and those files were archived across multiple institutional repositories? What if you wanted to move a set of objects from a digital library to a preservation repository or another digital library platform without losing their semantics?

Examples of Aggregations and Applications of ORE

Examples of Aggregations

  • A simple unordered set, or bag, of Resources, such as a collection of favorite images from various web sites.
  • A multi-page, HTML document where the pages are linked together by hyperlinks that provide “previous page” and “next page” access.
  • Information available from “social networking” sites, which contain content and related social activity around that content. An example is Flickr, where each participant has an entry page providing access to images in multiple sizes and resolutions that are organized in sets and collections. All of these entities are separate Resources. These are then linked to additional Resources that are comments and annotations about the images.
  • A scholarly publication stored in an ePrint repository such as arXiv or in a DSpace, ePrints, or Fedora repository. Such a publication may appear on the Web as multiple Resources, each with an individual URI. The set of Resources typically consists of a human readable “splash page”, that links to the body of the publication in multiple formats such as LaTeX, PDF, and HTML. In addition, the publication may have citation links to other publications, each existing as one or more Resources.
  • An overlay journal issue that aggregates multiple scholarly publications as described above, each located
  • in their origin repository, into an issue. Issues may be recursively aggregated themselves into volumes, and then into the journal itself.
  • A semantically-linked group of cellular images—each available as a Resource resident in repositories from
  • research laboratories, museums, libraries, and the like—in the manner implemented in the ImageWeb Project.
  • Published scientific results such as those envisioned by Clifford Lynch that, in addition to the features of the scholarly publication described above, incorporate data plus the tools to visualize and analyze that data.*

Examples of Applications

  • Crawler-based search engines could use such descriptions to index information and provide search
  • results sets at the granularity of the aggregations rather or in addition to their individual parts.
  • Browsers could leverage them to provide users with navigation aids for the aggregated resources, in the
  • same manner that machine-readable site maps provide navigation clues for crawlers.
  • Other automated agents such as preservation systems could use these descriptions as guides to understand a “whole document” and determine the best preservation strategy for the document Compound Object.
  • Systems that mine and analyze networked information for citation analysis/bibliometrics could achieve better accuracy with the knowledge of aggregation structure contained in these descriptions.
  • Institutional repository applications could use them as the basis of interoperability for exchange and service interaction with other institutional repositories.
  • These machine-readable descriptions could provide the foundation for advanced scholarly communication systems that allow the flexible reuse and refactoring of rich scholarly artifacts and their components Value Chains.

—Excerpt from the ORE User Guide: Primer,



Frontline Advocacy Is Everybody’s Job

ALA President Camila Alire’s presidential initiative offers a systematic approach to staff participation

My Artful Diversion

A picture is still worth a thousand words