Rethinking the National Library

Preserving a record of US life, achievement, and history in a digital world

November 1, 2016

Bernard F. Reilly Jr.

The recent appointment of Carla Hayden as Librarian of Congress makes this a good time to reconsider the library’s role in the life of the nation and confront the problems caused by years of stasis at this revered national institution. During the most transformative decades in the history of information since Gutenberg, the Library of Congress (LC)—like other national libraries—was outpaced by the lightning developments in digital technology and now finds itself dwarfed by information behemoths like Amazon, Bloomberg, and Google and struggling to remain relevant.

The end of the James H. Billington era has prompted an outpouring of advice from scholars and pundits on a new agenda for LC, from mass digitization of the library’s book collections and wholesale archiving of websites to relaxing copyright restrictions. Yet many of these prescriptions align poorly with the realities of the digital present and are based on outdated notions of what libraries do. At a time when much of the basic infrastructure of knowledge has been privatized, greater aspirations for the national library are in order. Present circumstances call for a radical renegotiation of the role of the national library in a democratic society.

The dubious wisdom of digitizing the past

One idea put forward—mass digitization of the library’s collections—made sense before the arrival of Amazon and Google Books, when information was not ubiquitous. In fact, two decades ago the library did create the vast American Memory digital library, making terabytes of books, manuscripts, photographs, newspapers, and films available online. American Memory became a model for national libraries at the time and is regularly mined by educators, researchers, and ordinary citizens.

But today such an effort would be—to paraphrase hockey great Wayne Gretzky—skating to where the puck was, rather than where it is going to be. Since digital publishing is relatively inexpensive, adding more to the oceans of free content would accomplish little. Practical obstacles like author and publisher copyrights also stand in the way. The National Library of the Netherlands, after spending more than $200 million to digitize its collection of historic broadcasts, found itself in a quagmire of rights issues that prevented its putting more than a small fraction of the new digital corpus online. Other complexities surround opening up sensitive materials such as the papers of former secretaries of state and recent Supreme Court justices.

Relaxing copyright or, as has been suggested, encouraging authors to relinquish protections on their intellectual property would enable LC to expose more of its holdings on the web but would also raise economic and political issues. As Hayden pointed out in her April Senate hearing, the interests of the creative community have already been harmed by exploitative business models in the technology sector. Should writers, musicians, photographers, and other content creators be expected to relinquish even more revenue and creative control?

At a time when much of the basic infrastructure of knowledge has been privatized, greater aspirations for the national library are in order.

Weakening copyright protections could also be perceived as hostile to those who contribute important public goods to society and damage a relationship that has served the public well. As a trusted repository of the papers of distinguished American composers, choreographers, and political cartoonists, LC would probably not want to be a party to undermining the viability of cultural and intellectual production.

The myth of “archiving” the web

The idea that LC should mimic the example of certain national libraries abroad and aggressively collect born-digital materials is simplistic as well. True, the British Library, Bibliothèque nationale de France, and other national libraries are systematically harvesting ebooks, websites, and other electronic content, while more than two decades since The New York Times launched its website, LC still lacks the ability to systematically capture the critical text, multimedia, and still and moving-image web content that is the new “first, rough draft of history,” as many have said of newspapers.

Unfortunately, substantial investment abroad in this arena has paid meager returns. Copyright prevents most electronic works collected from being accessed outside most national libraries’ premises—not a winning solution, given how much learning and research takes place online. LC did in fact wade into this territory. Between 2000 and 2007 its National Digital Information and Infrastructure Preservation Program (NDIIPP) invested close to $55 million in an effort to develop new strategies and technologies for preserving websites and other born-digital media. To date, LC’s website harvests, like those of other national libraries, have produced seriously flawed data sets: Sites captured are often incomplete and rife with broken links.

The heart of the matter

The problem is that digital media by nature defy traditional notions of archiving, which are based on the idea of libraries as repositories of discrete physical works, like books, manuscripts, and films. Digital works change constantly as enabling technologies evolve, and they are often platform dependent. LC’s recent experience with preserving social media is instructive. In 2010, plans were announced for LC to archive all of Twitter. Five years later, Politico reported that the project was in limbo and that LC was “still grappling with how to manage an archive that amounts to something like half a trillion tweets.”

Scale was not the only problem. When social media feeds are removed from their native environment, documentary integrity is undermined. Metadata on the source, timing, and geolocation of posts are often lost; information on follows, retweets, and other circulation indicators disappear. The explosive growth of network analysis suggests that metadata is as important as the content itself. No surprise, then, that creating a separate preservation platform to replicate and maintain the true functionality of Twitter content turns out to be a heavy lift.

A new division of labor

The old paradigms at work here sell libraries short. More than just repositories, libraries have historically maintained a longstanding symbiotic relationship with the creative sector that has served the citizenry well. In simple terms, writers, musicians, and other creative individuals produced works, and publishers and the media bore the costs of editing, publishing, and distributing them. Libraries provided a secondary market for those works and made them available to a wider public.

This was good for libraries and publishers. Publishers earned revenue on sales to libraries, and out-of-print works remained on library shelves. This enabled public libraries to level the playing field for Americans, providing all citizens access to useful and practical information, knowledge, and culture.

Central to this arrangement was the mechanism of copyright. The US Copyright Office, based at LC, helped authors and publishers protect their investment in intellectual, scientific, and creative activity. And the deposit requirement for copyright protection was a powerful engine that built many of LC’s unparalleled collections.

This longstanding symbiotic relationship no longer functions as it once did. The internet has rerouted the information supply chain. Today libraries are where people go to access scholarly, legal, genealogical, and business databases, which are hosted by publishers. Under these circumstances, how then does LC fulfill its responsibility to preserve “a comprehensive record of American life, achievement, and history”?

Public knowledge and the cloud

The Washington Post and others have asserted, not without cause, that LC is poorly equipped to confront the challenges of digital information, citing its failure to invest adequately in IT. But the problem may be structural: the growing asymmetry in IT capability between the private and public sectors. Cloud-based information providers like Amazon, Bloomberg, Google, and YouTube sit atop a vast new infrastructure scaled to meet the demands of Big Data and its users. By many estimates, more content now resides in the cloud than in the national libraries of all major countries combined.*

That content is melded with tools and analytical capabilities that enable users to mine it for meaningful patterns, trends, and new knowledge. Proprietary systems enrich the content with metadata—subject tags, geospatial coordinates, timestamps, and information on authorship and rights—endowing it with powerful functionality. LC will have to engage the aggregators and cloud services in the project of ensuring access to knowledge in ways that serve the interests of all citizens and future generations.

For starters, it might be appropriate for the new Librarian of Congress to step into the public conversation on net neutrality. The interests of the general information consumer are not well represented, and the playing field favors telecommunications giants and major platform and content providers.

The Library of Congress may not set the terms for access to knowledge worldwide but can at least be a force for equalizing those terms.

The library could also attempt to broker something akin to national site licenses to make key legal, financial, and public affairs databases available to all citizens. In an age of income inequality, providing access to such data for entrepreneurs, proprietors of small businesses, independent scholars, and students could be empowering. The nation’s public libraries, speaking with one voice through LC and wielding the bargaining power of the single payer, might well democratize access to important knowledge.

Such leverage might yield another social good: privacy for information consumers. US libraries rigorously protect circulation records and other data they collect about what people read, view, and listen to. But when content is served from the cloud, publishers and platform providers are privy to that information and sell and trade it in myriad ways. The library might negotiate greater transparency and even reasonable curbs on such practices.

The lever of copyright

The long-tail economics of digital content create incentives for providers to keep their content alive, a task once left to libraries. Yet corporations fail, and libraries tend to endure. For the wealth of cloud-based content to remain accessible and functional over time, applications and supporting technical platforms must be able to survive provider failure or abandonment. Perhaps copyright registration could be refashioned to accommodate escrow of the code for the enabling delivery platforms for nationally licensed databases. Then the Copyright Office would once again be an engine for creating public assets and central to the national digital preservation apparatus.

Perhaps LC could also prevail upon Congress to offer tax incentives for media companies that maintain content in ways that make it easy to eventually release to the public domain. Together, LC and the technology community might succeed in enabling full functionality of digital works on multiple platforms in much the same way LC enlisted publishers in the 1970s to adopt acid-free paper.

An obligation to protect

There was another important dimension to the historic division of labor. In the past, LC was a repository of last resort for records and evidence unlikely to be preserved in the private sector. Shortly after World War II, LC took custody of a small collection of photographic negatives by famed photographer Ansel Adams. The negatives documented the confinement of thousands of Japanese Americans in internment camps. Sympathy for the internees was scarce at the time, and copies of Adams’s 1944 book Born Free and Equal were burned publicly. Adams turned to LC for the safekeeping of his negatives.

The library has also been a refuge for politically sensitive documentation from outside the US. During the Cold War, its Overseas Operations Division (OvOp) preserved newspapers, political posters, government reports, and other materials from developing regions and conflict zones. OvOp staff and their agents were “on the street” in Islamabad, Jakarta, Kabul, and Nairobi, where gathering such materials could be difficult and even dangerous. While national libraries in many countries function as tools of the regime, as likely to suppress the literature of dissent as to preserve it, LC documented political opinion and ideology of all stripes.

Unfortunately, the OvOp program has suffered in recent years from funding cuts and shortsighted policy decisions. And as the web emerged as the new “street,” agents on the ground became less effective. Congressional policymakers must now rely on web monitoring by private-sector operations like the SITE Intelligence Group and costly databases from commercial providers like the Economist Intelligence Unit and Bloomberg LLC. Because so much information resides in the cloud, corporate behaviors like the alleged engineering of bias into the Facebook and Twitter platforms can distort the public record. These developments threaten to weaken policy research in the public realm and tilt the playing field toward private-sector entities: lobbyists, trade associations, and political action committees. Again, engaging key providers could help increase transparency and affordability in the market for critical information.

LC may not set the terms for access to knowledge worldwide but can at least, as it once was, be a force for equalizing those terms. The challenges are formidable, and must be met in a time of dwindling resources. But the discussion needs to move beyond dated strategies like mass digitization and web archiving, strategies that are at best rear-guard actions.

Crafting a new curatorial role for LC will require imagination and rigor, capabilities that it has found in the past and can surely summon once again. At stake is nothing less than the integrity of vital information and evidence, for the nation and for the world.

*In 2012 LC reported three petabytes of digital collections. Sebastian Anthony on the ExtremeTech blog (“How Big is the Cloud?”) estimated May 23, 2012, that Amazon, Facebook, and Microsoft held 500 to 1,400 petabytes. So conservatively, those companies plus Google could hold over 100 LCs.