Given the recent attention given to book digitization projects, it is time to step back and consider developments to date. This post will first describe the projects launched by Google and the Open Content Alliance, and the consider some of the legal issues raised by Google Print, which is the subject of two major lawsuits. What follows is somewhat lengthy, but it has taken some space to do this interesting topic justice.

I will first describe Google Print, which is the digitization project that has made most progress (and attracted the most controversy), and will briefly consider other initiatives, including the Open Content Alliance. I will then consider the legal controversy raised by Google Print, which revolves around whether its practice of scanning books held in participating libraries, without the explicit permission of the publisher or author holding copyright in those books, is copyright infringement. As presented below, there is a strong argument under United States copyright law that Google’s current practice may in fact not infringe copyright in those works.

Book digitization initiatives

Google Print is an initiative launched by Google in December 2004. The idea behind Google Print is to enable searching within the text of books in the public domain and protected by copyright, and to assist users to locate copies of those books. The content of Google Print comes from two main resources: publishers and libraries. At issue is Google’s policy regarding copyright-protected books that are being scanned via the Library Project.

The Google Print Publisher Program allows publishers holding copyright in particular books to specify whether those works should be included (or not included) in the Google Print database. Terms and conditions apply to publishers’ involvement in Google Print, placing restrictions on Google’s use of copyrighted content, and providing for profit sharing in contextual advertising placed on search result pages with the relevant publisher (as well as, it seems, a percentage of the revenue generated by Google AdWords on those pages. Publishers who join Google Print may choose to opt-in or opt-out with respect to specific works, and receive a percentage of the advertising revenue for only those works it authorizes Google to scan. Participating publishers may remove particular titles from the Publisher Program at any time.

To create the database, Google is scanning entire books. However, users are able to view only limited excerpts, namely the page where the user’s search term is located, and a page or so on either side. Google Print displays only three sample areasof the copyrighted book that displays the search term.

Although Google Print currently only includes titles in English, it is in the process of expanding its range to include foreign-language publications. It should also be noted that Google Print is part of Google.com, the US website.

The Google Print Library Project is being undertaken in partnership with five libraries (the ‘Google 5’), with the aim of creating ‘a comprehensive, searchable, virtual card catalog of all books in all languages that helps users discover new books and publishers find new readers.’ The Library Project was launched in December 2004. The Google 5 are the libraries of Harvard University, Stanford University, the University of Michigan, and the University of Oxford, as well as The New York Public Library . (The University of Michigan has made available online its contract with Google.)

The Library Project differs from Google Print in a few significant ways:

— Google will scan the entirety of books in the partner libraries into the Library Project database (unless, in the case of copyrighted works, the publisher has specifically opted-out);
— where scanned works are no longer under copyright, and have entered into the public domain, users will have unlimited access to the whole work;
— where works are still protected under copyright, access to images will be more restricted than is the case with books submitted under the Publisher Program, and will consist of ‘snippets’ of text including the search terms, with a few sentence on either side (with a maximum of three snippets from any single work), and not a full page as with copyright-protected books in the Publisher Program;
— the participating library will receive a digital copy of each book scanned from their collection, enabling the creation of a digital archive.

The format of book page screenshots in Google Print and the Library Project is accordingly somewhat different.

The controversy surrounding Google Print lies in its policy for including copyright-protected works in the Library Project. As discussed above, publishers specifically opt-in or opt-out their titles for inclusion in the Google Print database. By contrast, Google initially intended to scan all books held by partner libraries into the Library Project database, whether or not they are in the public domain (ie, whether or not they are still protected by copyright). So the pilot digitization of works held by Stanford, Harvard, Oxford, the University of Michigan, and the New York Public Library includes not only works in the public domain, but many still protected by copyright.

The only way for a publisher to prevent copyrighted books from being scanned via the Library Project is to submit details about which titles it wishes excluded-which encourages, but does not require, participation in the Publisher Program. Only by explicitly ‘opting out’ for those titles will they be excluded from the Library Project. Although the digital copies of the in-copyright works held by the participating libraries are fully searchable, results are displayed in the form of ‘snippets’ only. Nevertheless, entire works are being scanned to enable this facility, unless publishers opt out. This policy places the burden on publishers rather than on Google itself, and has understandably caused some consternation .

In response to opposition to the ‘opt out’ policy, Google announced on 11 August 2005 that it had temporarily frozen Library Project scanning of in-copyright works, until a ‘deadline’ of 1 November 2005 (today):

If you’re in the Publisher Program (or you decide to join it), you can now give us a list of the books that, if we scan them at a library, you’d like to have added immediately to your account. This way you can have your books in Google Print, which will put them into Google.com search results, direct potential buyers to your website, provide ongoing reports about user interest in your books, and your books will also earn revenue from contextual advertising ‘“ even if they are out of print.

We think most publishers and authors will choose to participate in the publisher program in order to introduce their work to countless readers around the world. But we know that not everyone agrees, and we want to do our best to respect their views too. So now, any and all copyright holders ‘“ both Google Print partners and non-partners ‘“ can tell us which books they’d prefer that we not scan if we find them in a library [through the Library Project]. To allow plenty of time to review these new options, we won’t scan any in-copyright books from now until this November.

This delay has not satisfied opponents. On 20 September, the Authors Guild brought an action against Google, alleging copyright infringement, to be followed by a similar suit on 19 October from Association of American Publishers. These lawsuits are considered below.

The Open Content Alliance (‘OCA’) is a Yahoo-backed book digitization project, which includes various contributors, including: Adobe Systems Inc., Hewlett-Packard Co., the Internet Archive, O’Reilly Media Inc., the University of California, Columbia University, Rice University, the University of Toronto, and the National Archives of Britain . (Interestingly, while Harvard University is a Google Print Library Project partner, certain of its collections are contributors to the OCA. The OCA was inspired by Brewster Kahle, the founder of the Internet Archive, which itself is an initiative to build a database of web pages. Access to the Alliance’s database is not due until sometime during 2006.

On 25 October, Microsoft announced that it too was joining the OCA, while at the same time providing its own service, called MSN Book Search. Microsoft has made the largest contribution to the OCA to date (US$5 million, or enough to scan about 150,000 books). Microsoft’s move to join the Open Content Alliance has been met with skepticismby some, including Tim O’Reilly, who has interpreted the company’s involvement in the OCA as part of its opposition to Google generally.

The business models and format of the OCA and MSN Book Search facilities have not yet been finalized. But the contributors to the OCA, Microsoft included, have decided not to scan the contents of books that are protected by copyright unless the rights holder grants explicit permission. Moreover, the OCA is currently negotiating with publishers to determine how best to make available material still under copyright.

There are other digitization projects afoot, including one by 19 European libraries.

Legal issues raised by Google Print’s Library Project

Is Google’s scanning of copyright-protected works in the Library Project copyright infringement? Or is this conduct protected by the fair use doctrine? To date, two lawsuits have been brought alleging copyright infringement. My focus here is on the scanning of full in-copyright works by Google, and not on the use of these copies, which is to display snippets or other excerpts from those works, as the latter is clearly captured by the fair use defense.

On 20 September 2005, the Authors Guild and certain published authors filed a complaint in federal court-the United States District Court in the Southern District of New York-on behalf of themselves and unnamed plaintiffs, ie, a class action. This alleged class is defined as ‘all persons or entities that hold the copyright to a literary work that is contained in the library of the University of Michigan.’ (It is not explained why, of the Google 5, Michigan is specified, although it is likely because that university is the only one to have made public its contract with Google.)

The basic allegations made are that Google has copied copyrighted works in Michigan’s collection without the permission of the copyright holders, and has thus infringed copyright in those works. As a result, the plaintiffs assert, Google has reduced the value of those works to the rights holders, caused lost profits, and damaged the goodwill and reputation of those rights holders. While the plaintiffs note that Google stands to gain from scanning these works, potential arguments regarding the fair use defense are not considered. (Interestingly, Google alone is named as a defendant, and not the University of Michigan, although that university arguably authorised copyright infringement by participating in the Library Project.)

The suit brought on 19 October 2005 by certain publishers (McGraw-Hill, Pearson, Penguin USA, Simon & Schuster, and John Wiley & Sons), which are all members of the Association of American Publishers, is similar to the Authors Guild action in that it alleges that Google is infringing copyright in published works held by the University of Michigan by scanning them without the rightsholders’ permission via the Library Project, and has been brought in the same federal court. This action differs from the Authors Guild suit primarily in that it is not a class action. It also provides some more analysis, particularly of Google’s policies regarding works in the Publisher Program and the Library Project, but also fails to consider any potential fair use defense.

General opinions as to the merits of these lawsuits have been published on various blogs, including by Tim O’Reilly (who is supportive of Google, notwithstanding the fact that O’Reilly Media is a contributor to the OCA), and Daniel Brandt (who is significantly less enamored with Google Print). These arguments are not so much focused on the legal issues, than with the merits (or not) of Google’s commercial approach.

In ‘The Authors Guild v. The Google Print Library Project‘, Jonathan Band considers the legal issues in depth, particularly whether Google might succeed on either fair use or implied license arguments. Section 107 of the Copyright Act of 1976 specifies four factors to be considered when considering whether doing acts comprised in copyright without authorization from the rights holder is permitted under the fair use doctrine:

(1) the purpose and character of the use (ie, is it for a commercial use);
(2) the nature of the copyrighted work (ie, is it more creative or fact-based);
(3) the amount and substantiality of the portion used; and
(4) the effect of that use on potential markets for the work, or its value.

Band argues that Kelly v Arriba Soft, 336 F.3d 811 (9th Cir. 2003), in which a United States federal court found that an online database of images created by copying pictures from websites without express permission was a fair use of those images, and thus did not constitute copyright infringement, applies to the Google case. In Kelly, although the full image was copied, these copies were not disseminated. Instead, thumbnail images were generated, and only those versions were displayed by the search engine; users who wished to obtain the full-size image could only do so from an authorized source.

Band’s (rather convincing) conclusion is that, as in Kelly v Arriba Soft, the use of the full text of in-copyright books in the Library Project is likely to be found a fair use, and thus not copyright infringement. Although the Authors Guild (and the publishers’) suit was brought in the Second Circuit rather than the Ninth Circuit, Band points out that the precedent has some weight, as the Kelly court relied heavily on the Supreme Court’s decision in Campbell v Acuff-Rose Music, Inc., 510 U.S. 569 (1994).

Nevertheless, it may also be necessary to consider Rakoff J’s ruling in UMG Recordings, Inc. v MP3.com, Inc, 92 FSupp 2d 349 (SDNY 2000)-significantly, a Second Circuit decision. In that case, the court held that it was copyright infringement for MP3.com to copy music CDs without permission to create an online catalog of songs that users could play from any computer connected to the Internet. The defendant’s fair use argument was not successful.

Which precedent ultimately has more weight in the Google Print disputes will depend in great part on the court’s evaluation of the facts and understanding of the technology involved.

Band also argues that an implied license theory might also work in Google’s favor. The substance of the ‘implied license’ argument is typically made in the context of posting material on the Web: if someone posts material on the Internet, it is a reasonable and foreseeable consequence that someone will make use of that material, for example by linking to it. If a website operator does not want the general public to link to a particular web page, they will make a statement to that effect, require a login or subscription to access the page, or use other technological means to prevent free and open access.

I am not so convinced by Band’s implied license argument, not least because it is untested. To my knowledge, the only decision to even mention that an implied license might exist to enable communication via the Internet is a Canadian federal court case: Guillot v Arvic Search Services Inc, No T-119-98 (2001) FCT 799 (Fed Ct Canada).

The lack of judicial authority aside, my general understanding of an implied license is this: permission to do an act that would otherwise be prohibited by law is inferred from particular types of conduct. In the context of material placed on the Internet for all to access without restriction, the argument for an implied license to view and link to such material is pretty convincing. But I cannot see that the fact a book is published is reason to infer permission to copy the book without permission (which is clearly, absent a fair use defense, copyright infringement).

Conclusion

Who stands to gain from Google Print? Everyone does, to differing degrees. Once complete, Google Print’s database will be a valuable resource, allowing users to locate books in libraries or from online bookstores that might have been difficult to obtain before. Libraries and users alike will benefit from the archive of published works created by the database, which will provide a backup to hard copies in case of their destruction by fire or otherwise. Google’s process of scanning texts may also identify the owners of ‘orphaned’ works, by encouraging them to identify themselves, and, if users indicate interest, to re-publish titles that may be out of print. For publishers and authors this database may also be valuable, if the inclusion of titles leads to increased sales; participating publishers will receive a portion of the advertising revenue generated by advertisements on search result pages. Finally, Google is set to profit from a portion of the contextual advertisement revenue, from not only contextual advertising by from Google AdWords.

The controversial aspects of Google Print may also be its most valuable. As commented by Tim O’Reilly, the Google Library Project approach is the only one to date that tackles the problem of ‘lost’ books ‘“ ie, books that are still protected by copyright, but out of print. According to an interesting paper by the Online Computer Library Center on mass digitization initiatives in general (and Google Print in particular), between 66% and 82% of the books held by the Google 5 are still protected by copyright. Tim O’Reilly, referring to this study, observes:

This 20% of books out of copyright is the realm of efforts like OCA. Meanwhile, another 10-20% are under copyright, in print, and being commercially exploited. This is the realm of titles opted in by publishers to programs like Google Print or Amazon Search Inside the Book. That leaves 60-70% of all titles ever published in the twilight zone, out of print, but still under copyright. For many of these books, no one even knows any longer who owns the rights, and there is no commercial incentive to figure it out, making the publishers’ request for ‘opt in’ a fig leaf that will ultimately lead only to continued neglect.

O’Reilly’s conclusion is that if Google had an ‘opt in’ policy (as does the OCA), a significant percentage of books to which no rights are asserted would be skipped over in the scanning process.

I should point out that, as the OCLC paper indicated, to say that only 20% of books held by the Google 5 are in the public domain may be understating things somewhat; that figure could be as high as 34%. Secondly, from what data does O’Reilly determine that 10-20% of the titles are protected by copyright and currently in print? That figure is significant, as it determines the important figure-how many titles are protected by copyright, but not in print and not easily accessible. If this percentage is large, the public interest arguments for Google Print’s approach are particularly strong’¦although such arguments have little to do with legal liability, of course.