Thursday, October 24, 2013

10-31 Nunberg, Geoffrey. (2009) Google Books: A metadata train wreck.

33 comments:

  1. 1. A large number of commenters on the site as well as myself immediately leapt to ‘crowdsourcing’ as a possible solution to some of the metadata errors on Google Books, but there are numerous problems with this, starting with vetting of crowdsourcing efforts which can end as inaccurately as metadata in some cases. However, does it seem like a viable addition to the metadata crisis presented were Google to open that process?

    2. In looking briefly forward from 2009, I located a paper from by James and Weiss (http://www.tandfonline.com/doi/abs/10.1080/19386389.2012.6525662012 since html isn't allowed) which suggests that the errors in Google Books metadata remain very prevalent; they found an error rate of 36%. It is difficult to tell if this is an improvement over the original post, and if it is, what rate we should expect. How much incentive does Google have to fix flawed metadata?

    3. There is some suggestion (James and Weiss, commenters) that with full-text searching, the metadata is not as important to have. This does not seem terribly accurate, as the author explained--even searching for a fondly-remembered phrase will produce any source that quotes it and even things that come close. Full-text searches remain quite useful, but can they usefully compare to or replace good metadata entries?

    ReplyDelete
  2. 1. I think Dan Clancy’s suggestion on crowsourcing the task of metadata task is reasonable. It is similar to the practice of folksonomy, which takes advantages of the intelligence of the crowd and breaks the tasks into small pieces. But I don’t think correcting the metadata is the responsibility of the end users. And Google should provide some incentive mechanism, for example a gamified system, to engage the users to participate.


    2. I think the automated extraction of the pub dates still has huge values in it, since it could handle the majority of the books. My suggestion is to put automated extraction as the first step, and then follow up other steps validate the data. The google team can conclude all the types of errors found, against which they can validate the automated extracted data. For example, one category of the error is that pub date is a different era from which the author is/ was alive. So the google team can detect the error data by comparing the all the authors’ alive period with the pub dates.

    3. I can foresee there would be more projects like Google Books to digitize all the books. So I wonder what lessons book publishers and libraries can learn from all these errors, to avoid the troubles for the digitizer (Google Book engineer team)?

    ReplyDelete
  3. 1. A thought I had, partly in thanks to all the discussion we had about Google and relevance in the last class time; if Google goes through such rigors to train their raters to help refine their search algorithm, why not do something similar for Google Books? While the work of hunting out all of these errors seems kind of tedious, it seems simple enough to do given the examples for some of these mistakes.

    2. That said, again I have to wonder; if Google's automated function to create metadata can be so screwy, are there any out there that have proven to do better to be viable support options for people?

    3. What about giving the catalog information to Google has these libraries withholding permission to post it, when on somewhere like WorldCat, it could be easily found?

    ReplyDelete
  4. 1 - While I find it amusing and hilarious that Google Books is in the midst of a metadata nightmare (I've had music librarians tell me that iTunes is just as terrible) I would like to know more about how users are engaging with Google Books right now before I get terribly worked up. In an archive/repository, search functions are not as finely-tuned, so metadata needs to be clean in order to connect users with the correct information. However, with Google Books, the only time I've ever used it is to look up a book by title/author, and then I navigate to the front page of the book to obtain relevant information such as publication date, etc. In the grand scheme of things, how many people are say, browsing by topic in Google Books?

    2 - The Google Books project as a whole has engaged in some questionable behavior - e.g. scanning most of the books at the Benson and now refusing the library access to the scanned book copies. I'm curious as to how libraries can "pick up the slack" to either be competitive with Google Books, or to work together in a more effective way to provide decent metadata. I'm also confused as to why OCLC and WorldCat aren't involved, since they could easily match a book by isbn or barcode with metadata that has been approved and generally used/accepted in most libraries, at least in the case of copy cataloguing.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. 1. Nunberg, though he spends a good deal of the blog post bashing Google books for their pretty horrendous mistakes, seems to think highly of the organization. He claims “[Google Books] is almost certainly the Last Library”. Is this true? I’m pretty sure the library is the Last Library. Yes, everything is going digital, but the use of digital libraries (excluding Google) is becoming more and more widespread. I don’t think Google Books is going to be the last institution to carry the cultural and literary history of…the world…while the libraries wither away. Right?

    2. Nunberg also tells readers that a proposed solution to Google’s trainwreck was user input. He states that “Dan Clancy suggested that it should fall on users to take some of the responsibility for fixing these errors, presumably via some kind of independent cataloging effort”. That’s an idea, but wouldn’t it just give the potential for a bigger mess to occur? I mean, if Google engineers were so careless as to attribute Madame Bovary to Henry James, what makes people think that users would be so much better at incorporating metadata? Or are the Google Book people just lazy – and users wouldn’t be? But again, where’s the guarantee that users can be trusted?

    3. I recently attended a catalogers meeting in which representatives from UT libraries came together to discuss how the implementation of RDA (replacement to MARC) was going. One of the new “hot topics” was linked data and how library catalogs would probably be looking more and more like Wikipedia, in that if you clicked on an author or title on the record, it would take you to the authority record or WorldCat record for that person/title. Would the linked data system be useful for Google? I mean, the metadata would have to be correct in the first place in order for this to work, but if Google Books scanned in a new book and were able just to link it to an existing authority record, wouldn’t this method solve a lot of problems?

    ReplyDelete
    Replies
    1. 3. I am interested in the feasibility for a university library to incorporate the scanned books from Google into their own catalog. I think that it would be very beneficial for libraries to use the resources that Google has put the money into having digitized. This could cut down on costs of digitizing by the library or in buying books for collections if Google is providing the book for free. I am wondering how this would work for the time and costs for the catalogers. Would it be worth the time for university catalogers to catalog books that aren't in their collection? Or by seeing that a book is available in Google, will they not catalog their record as much, and focus on their materials that Google doesn't provide? Would the availability of certain books on Google guide collection development at a library to gather materials that Google hasn't posted as this will give the library new relevance?

      Delete
  7. 1. The author writes about the disastrous metadata that is found on Google Books, but is this metadata automatically generated or is it taken from a source like the publishers? How much of the fault really lies with Google Books?

    2. Depending on the sources for the metadata, is the problem that people supplying the metadata aren’t aware of the importance of making things easily findable? Is there, even years after these articles, a lack of ignorance on how to accurately provided metadata or relevant information about a specific item? Should anyone describing data for work be required to take some kind of crash course in metadata, so that they’re more aware of the effects created by their halfhearted descriptions? Or maybe people understand the concepts of metadata, but don’t understand the material they’re providing metadata for so they have no scope of reference?

    3. If Google Books is the future, as the author states in the beginning of his blog post, but they’re not ‘doing it right’, should there be steps taken to either fix the blunders or initiate alternative projects with similar goals? I know that towards the end, he stated that Google had committed (somewhat) to fixing the rampant errors, but is that truly sufficient? Does Google need to hire metadata librarians/specialists specifically to ensure that their products are more easily findable and the information about them correct?

    ReplyDelete
  8. 1. I find it interesting to consider this article in conjunction with the readings we did on crowdsourcing previously. With so many large scale projects like this in existence it seems that many of these errors could be cleaned up over time by volunteers (librarians anywhere?). With any large scale project you are inevitably going to have massive issues at the outset but that doesn’t seem to be a reason to not do them.

    2. One particular comment from Jon Orwant was very informative in dealing with the authors complaints about the 1899 date showing up consistently. He mentioned that the project had received a large catalog of work that used 1899 as a placeholder date which caused over 250,000 books to be mislabeled. If issues like this are causing such confusion wouldn’t it be logical to have someone overseeing this metadata before it goes into the catalog to ensure that massive errors like this are spotted before going out into the wild?

    3. A project like GB raises some really interesting possibilities for the future of library search. For example, it would be amazing if library’s could access all of the scholarly material from one source where metadata could be formalized in some way leaving the amount of access up to the library instead of having to deal with many providers with different login procedures and restrictions. A system like this could simply say “sorry you don’t have access, would you like to request that your university pay for material?” instead of a complicated explanation and a inter library loan form.

    ReplyDelete
  9. 1 The author mentions that with the help with users of Google Books might be a solution to the recent issue. I think that crowdsourcing or tagging might be helpful for Google to find their errors and problems. However, if let the users correct the errors may still lead more problems since different background and understanding of books from different users.

    2 The author also mentions the conflict between machine extracting technique and scholar needs. Is that caused the limitation of technology or the recent problem of metadata system that cannot satisfy us?

    3 The problem whether Google’s engineers should be trusted to make all the decisions about metadata design and implementation actually does not exist, I think. What Google is doing is just lead us a way of thinking of this world, not ruling the world.

    ReplyDelete
  10. 1. It seems that there are so many mistakes about metadata of google books. Why are there lots of errors in date and classification of books? Where does the metadata come from? Why doesn’t google check it out when creating metadata? How could we control the quality of metadata?

    2. ‘While Google's machine classification will certainly improve, extracting metadata mechanically simply isn't sufficiently reliable for scholarly purposes.’ I totally agree that. I think we could find a new way to create metadata. I think a good way is metadata auto-extracting technology. But there may also be some insufficient. So how could we solve this problem?

    3. I have some questions about the process of creating metadata. Who will build the metadata frame? What are the duties of users in this process? If there are many sources of metadata, which one is the best and how could we decide it is the best?

    ReplyDelete
  11. Google Books
    1. Google invests a lot of money into projects that will eventually make them money in return. It seems to me that Google books are just that - a way of developing technology, testing and improving this technology with a large set of content that users want access to. But what is Google's real intention for the technology being developed? Why are they investing the time and money on a public good as a private company?

    2. In this article we see how automated metadata occurring has created many errors - for example, using the date of the publishing company’s birth as the copyright date of the work. In a previous article there was discussion about using visual recognition technology to auto-detect page types. If this technology was applied here, mistaking a nameplate for a copyright page hopefully would not have been an option.

    3. Google books tried the BISAC cataloging schema to classify a selection of the books. In doing research about BISAC, it appears that there have been different editions of BISAC since 1975, evolving as needed to keep up with bookstore evolution. Is it not foolish to use a classification system that changes frequently, based upon marketing and sales?

    ReplyDelete
  12. 1) What does Nunberg mean when he refers to Google Books as the "Last Library"? Surely he can't mean the end all be all collection of published materials, as we will inevitably find many formats in the future that will suit the needs of that moment.

    2) Would crowd sourcing help this metadata train wreck? I think it may help a lot, but Nunberg also points out that common terms and linguistical oddities change over time.

    3) Who should Google be hiring to come up with the right search terms? How can they have such an amazing search engine but such terrible metadata in Google Books?

    ReplyDelete
  13. 1. It seems like the author thought that Google Book has to provide advanced search engine helping users find information not only on a topic but also on data, editions and other aspects. However, I think the main purpose of Google Book is to offer books to readers but not those scholars. If scholars want the different editions of the same book, they could use the Google search box as well as other tools like Google Scholar. Similarly, it is hard to find every edition of one book in a library. I think it might not worth enough for Google Book collecting everything when they build their metadata frame.

    2. For those mistakes in Google Book, such as the data errors and classification errors, I think letting users take some of the response for fixing these errors is a good idea. It might be difficult to evaluate those users’ work, but it could be an approach we will take into account. Users could add tags and report errors, which will help us to refine the metadata frame.

    3. I’m a little bit confused with the categories and metadata. The author said: the metadata and classifications are simply too poor. However, in my opinion, if the classifications have some problem, it could be very related to the metadata. It seems not very appropriate to discuss these problems separately.

    ReplyDelete
  14. 1. I read the comments under the article on the web page and notice some of them mentioned using crowdsourcing as a way to bump up the correct rate of Google Book metadata. I also think this is a possible solution. Would it work or be an additional approach to make the metadata better?

    2. Did the product managers of Google Books analyze the causes resulting in the problem mentioned in the article? And what is the major one among them?

    3. Regarding the Google Books, I'm wondering if the metadata of books could come out in some automatic ways, which would also cut down the number of wrong metadata.

    ReplyDelete
  15. Nunberg singles out automated processing from the OCR’d texts as being the culprit behind a lot of the wrong dates given to the books he mentions. Until such time as there is an acceptable margin of error for the work done by the automation, would having a crowdsourced quality control be a valid option?

    Despite the claims made by Dan Clancy, it seems like the classification errors come from Google itself. If this is the case, what justification could be made for applying the BISAC system to a collection which exceeds what I would call the intended purpose and reach of the BISAC system? Google had to have realized there were some outlandish classifications assigned by the BISAC system when they used it on the library.

    Nunberg mentions that Google knows of the problems but they note that Google claims fixing these problems is not “a priority.” Given the nature of metadata and the goal they are hoping to accomplish, shouldn’t the generation of accurate metadata be the second highest priority on whatever list they have? I assume simply scanning the books is first on the list, but it would seem counterproductive to ignore inputting the correct metadata just to scan the next book in the queue.

    ReplyDelete
  16. 1. Google books’ metadata is said to be messed up and incorrect in many situations. Even if this is causing a lot of misinterpretations, shouldn’t we consult another source as well, to check if it matches, in case we need a literature for important purposes? How important is accurate metadata in this case?
    2. As mentioned, a google engineer claims that the dates were provided by libraries. Who is responsible for the maintenance of accurate metadata? It is always possible that the libraries have inaccurate metadata as well. Is it really google’s fault or should there be some other organization to look into these errors and maintain records?
    3. Does the future entirely depend on google books as mentioned? How can we conclude that? There can always be an organization that comes up and scans these books. 15 years back, the world did not imagine something like google would come up. In a never constant world, how can we conclude that google books is the ‘last library’?

    ReplyDelete
  17. 1. Crowdsourcing understandably comes with its challenges, however in the case of Google Books, it seems like one of the most logical choices in addressing the metadata madness happening. Is there a reason Google Books hasn't offered this option yet? If not, outside of crowdsourcing, what other effective ways are there to address the issues contained within such a large volume of material?
    2. Thinking about the inaccuracy and mislabeling of materials in Google Books--I wonder how often metadata is intentionally manipulated to hide information? It seems as though we make a gross assumption that metadata is used exclusively to help retrieve and contextualize information, instead of any alternative. To that end, have there been any instances of this?
    3. How does the notion of accountability fit into the process of creating and using metadata? Ultimately, who is responsible for it, and who is responsible for validating its accuracy?

    ReplyDelete
    Replies
    1. 2 - Well, a good example of metadata being screwed with to hide information would be the SB-5 vote in the Texas state senate a few months ago: http://www.theatlanticwire.com/politics/2013/06/watch-texas-state-senator-anti-abortion-bill-filibuster/66583/
      You bring up a very interesting point about how powerful metadata is or can be: incorrect metadata can render an object invisible or completely inaccessible. A somewhat quirky version of this is a digitized book at the Benson with a typo in the name is the Libro De Professiones, which is in the catalog as Libro de Profesiones (http://catalog.lib.utexas.edu/record=b7013004). While the item isn't impossible to find, I'd imagine that with issues like Google Books, an object could virtually disappear into the ether (into the cloud?) when the metadata is wrong. This is one of the primary reasons why metadata/taxonomy has become such a hot-topic within the for-profit sector, if you lose a document or valuable trade secrets because your company tagging system sucks, you've essentially flushed money down the drain.

      Delete
  18. 1. I’m intrigued by this idea of relying on the users to fix the problems. Doesn’t this assume that users will be able to recognize all of the problems? Not that this will necessarily be the case, and probably not giving users due credit, but say that there’s not a single user of a certain item who knows that the date is wrong. Will the assumption not be made from then on with every use of the item that that date is correct and doesn’t this pose some serious problems not only for authenticity of the object but for people’s understanding and use of it?

    2. At one point, it is mentioned that Google Books will be “the universal library for a long time to come”. Do you see it heading in this direction or is this an exaggeration, especially in consideration of the Chief Engineer who seems eager to blame its inefficiencies and problems on others, namely the library system?

    3. I’m not familiar with Google Books and while I’m very much bothered by their approach to this issue of metadata, I am interested to know, is there a problem with having its design be more like a that of a bookstore, assuming its done correctly? The author talks about how applying bookstore headings is not necessarily appropriate, and even writes that “Google has taken the great research of the English-speaking world and returned them in the form of a suburban mall bookstore”. The tone is negative, indeed, but in my opinion, this set-up could make the site more accessible, granted they deal with their metadata problems.

    ReplyDelete
  19. 1. This article brings up the issues with wrong metadata in Google book records. I am wondering though what the typical error rate for library metadata is. Many more books are available to people because of Google books, so it the benefit of bringing books to more people outweigh the metadata errors?

    2. The article suggests that Google books will become a universal library, but questions what the consequences are that Google has it’s own financial and marketing agenda. What kind of problems should we be watching for with Google books? For many centuries libraries were restricted to certain patrons and limited the types or subjects of books on their shelves. Is it fair to question Google’s motives when libraries in the past had their own agendas as well?

    ReplyDelete
  20. 1. So Google Books received metadata from a Portuguese metadata provider. I presume that this is a company that provides metadata as a business and that GB paid for the information. Is this correct? If so, how is it that a business that provides metadata for a charge can't figure out the correct dates for 250,000 books such as "Christine" by Stephen King and furthermore, why would GB use such an inept provider?

    2. Orwant never specifically responds to Nunberg's suggestions on crowd-sourcing. It seems that since GB has the huge task of providing metadata for 170,000,000 books, they would jump at the chance for free outside assistance. Has anyone explained their resistance to the idea?

    3. Nunberg's blog mentions orphan books and one commenter mentioned something along the lines of many of these are not really orphan books. What exactly are orphan books and how do they fit in with this very long discussion on metadata errors?

    ReplyDelete
  21. 1) I was expecting “trainwreck” to be an exaggeration, but it really wasn’t! It strikes me as very strange that Google didn’t simply port in the library-provided metadata in the first place, as this would at least have given them a good baseline—and the fact that they used automated OCR to pull metadata instead means it’s almost incomprehensible that Clancy blames the errors on the institutions that provided the materials. What was their reasoning for building their metadata in this way?

    2) The use of commercial book categorizations (i.e. the BISAC system) in Google Books seems like a similarly bizarre choice, especially given how many materials are scholarly or simply too vintage or unusual to fit neatly into commercial subject headings. Even if the metadata had been applied perfectly, why was the BISAC system used instead of one of the more established subject-heading systems?

    3) I am curious how Google developed and implemented its Books system. How much input did it draw from, e.g., librarians or archivists? Was it mostly programmers who developed it? Is Nunberg correct in his implications and did commercial interests have significant input? There may or may not be a way to answer these questions, but I would be interested in knowing.

    ((And a quick bonus question: has the metadata on Google Books improved much since the article was written?))

    ReplyDelete
  22. 1. This post talks a lot about fixing the metadata errors. Wouldn't it be fairly easy to fix at least some of the easier errors like dates? Surely there are other records of these books that are accurate. Could they not just search and compare and make changes according to the more trusted source?

    2. What motivation does Google have to ensure the accuracy of the metadata in it's Books service? Obviously if the service is so poor then it won't be used but for a lot of the examples this author highlights, I wouldn't have known there was an error otherwise.

    3. I agree with the author that it should be Google's responsibility to fix these errors but the author offers no suggestions for ways in which to fix these errors. This makes this post seem a lot more like a rant than anything that deserves serious thought.

    ReplyDelete
  23. 1. Nunberg says that “Dan Clancy, the Chief Engineer for the Google Books project, said that the erroneous dates were all supplied by the libraries and publishers” How does Google Books decide which libraries and publishers can supply data to Google Books? Also, how does GB deal with cataloguing reprints of books?

    2. GB scans books into its collection. Are these books new and have manufactured labels from libraries and publishers? Or are they all old books bought on the cheap and labels are created for them? What does Google do with the books after they have been scanned? Is it a physical or all online process? Both?

    3. I noticed that Google Books connects also to Google’s search engine. What then is the point of even having GB if you can use one of the world’s most powerful search engines to find a book you are looking for? In all my years as a student I have never used Google Books and now that I am aware of how awful it is I really can’t see myself using it or suggesting others to use it.

    ReplyDelete
  24. 1. Dan Clancy said the errors in the Google books were all supplied by the libraries. But if the information from libraries is wrong, what source could other teams rely on when they want to digitize so many books?

    2. I am quite interesting on the suggestion that it should fall on users to take some of the responsibility for fixing these errors. Google can learn something from the collaborative tagging which we are now reviewing. So how could it work for this specific Google books?

    3. I am wondering whether the metadata is generated automatically by computer or manually. If it is generated automatically, it could be additional evident for bad accuracy of auto-generated metadata which is also mentioned in other readings. What could we do to improve the quality of it?

    ReplyDelete
  25. 1. Google is a for-profit company providing a service (GoogleBooks) for free. This, to me, brings up a lot of interesting paths of speculation regarding the economics, ethics, and efficiency of crowd-sourcing corrections to its wildly incorrect metadata. Should users provide free labor, a la Wikipedia? Should Google pay information specialists/librarians/subject specialists to make these corrections?

    2. Now that Google is aware of the incompetence of their metadata schema (or piecemeal lack-thereof), what is their plan in future for GoogleBooks scans, cataloging, and access? Would it be better to scan and make available before supplying (correct) metadata, or does it make more sense to institute quality control on each item before it is made public? (both of these options are presented with special consideration of the vast volume of texts and pages); is there a third option?

    3. Who is working at Google these days (er, in 2009), that they can't do a better job of this?! Amazon gets it mostly right with, presumably, privately-developed and/or publisher-provided metadata. Are any enterprising young iSchool graduates shopping specially developed metadata scheme, software, and/or training to Google?

    ReplyDelete
  26. 1. Dan Clancy, the Chief Engineer for the Google Books project, said that the erroneous dates were all supplied by the libraries. Although the author pointed out later that a very large proportion of the errors are clearly Google's doing, errors might be very possibly caused by original data or former data managers, so I am wondering that do information professionals need to take a responsibility to recognize, or even to revise these data errors?

    2.In this article, the author spent much time criticizing Google Books' machine classification, and I believe that GB have also noticed this problem. Dan also suggested that it should fall on users to some of the responsibility for fixing these errors. However, why not GB just take classification tags from libraries for this function? And, is there anything impeding its openness for creating and fixing tags to users, as they have been aware of this problem, but no actions?

    3. Actually, I do not use Google Books very often. And, after reading this article, I am curious about how does Google Books attract their users with these problems? In other words, what advantages does GB have, comparing to libraries and other e-book web sites, to make them competitive?

    ReplyDelete
  27. 1. In this blog post the author points out that there seem to be a number of problems with the categories that are assigned to the books in the Google Books database. The author stated that several of the books are in categories that are either completely unrelated to the book or seem to be tangentially related. However in the case of the categories that seem tangentially related isn’t this a subjective issue? There might be someone who wants that book to be in that category. Who should be the person to decide which category a book belongs in?
    2. In this article the author points out that the Google Books database organizes their books according to the BISAC categories. He argues that organizing the books into these categories is not a good idea because they were created for retail bookstores and are not relevant to older book collections. Do you agree that the Google Books database should not have used the BISAC categories? If not what are the benefits of using the BISAC categories in this collection?
    3. In this article the author says that someone needs to step up and fix the problems that exist in the Google Books database. He states that Dan Clancy, in a panel discussion on this topic, suggested that users could step up and contribute towards fixing these errors. He refutes this suggestion by stating that there are simply too many errors for the users to fix them all and that it should be the Google engineers who fix the problem. How can the Google engineers, who are probably much fewer than the number of people who use Google Books, do a better job of fixing all of the errors in Google Books when there are too many errors for the users?

    ReplyDelete
  28. 1. Is it really fair to claim that no one is going to scan these books again, and furthermore that they are the definitive texts scholars will be researching? While I understand that the world is full of backlogged materials waiting to be scanned and digitized, there is, on the other hand, the inclination to upgrade projects like these, or to migrate them as technology changes. Especially if a project has received funding and attention once, it seems all the more likely to get funding to update it again later. And these updates are a prime time to smooth out kinks like metadata. (I just saw this precise series of events come to pass with an audio collection in New York, so I know it happens.) I realize Google Books is happening on a huge scale, but still.

    2. Towards the end of the article, Nunberg asks why google decided to use the BISAC headings in the first place, but I didn’t understand the parenthetical aside about google competing with Amazon that I feel was his attempt at answering this question. What was he trying to say? Because I also would like to know why they’re using BISAC.

    3. Is this one of those articles where the author complains without even gesturing at a solution? I kept waiting for Nunberg to offer some insight into tackling google’s clear oversights in metadata, but it never came. Instead, at the end of the article he posits that, “the larger question is whether Google's engineers should be trusted to make all the decisions about metadata design and implementation for what will probably wind up being the universal library for a long time to come, with no contractural obligation, and only limited commercial incentives, to get it right. That's probably one of the questions the Antitrust Division of the Justice Department should be asking as it ponders the Google Books Settlement over the coming month.” Huh?

    ReplyDelete
  29. 1. In the section where the Chief Engineer for the Google Books project was blaming libraries for erroneous dates, the author mentions that a few collections were systematically misdated and mentions a large group of Portuguese-language works dated 1899. Why or how they were systematically misdated? Were the dates not know and this was given as a best guess or was this date randomly chosen? I just wonder why you would systematically misdate something; if you don’t know can’t you put a value of date unknown or provide a range of possible dates?
    2. Misdating books because of automated metadata produced from OCR’d text made me question the process of how Google was digitizing these materials. Wouldn’t it be simple for them to enter the most basic metadata elements like date and author while they are digitizing the items? Isn’t it be prudent to get these basic elements right the first time if it is a possibility that the item may never be digitized again as stated in the beginning of the article?
    3. I don’t mind the suggestion by the Chief Engineer of the project to have users fix the metadata, but that is only possible if the item is found in the first place. If the metadata is poor or nonexistent and the classification isn’t up to par for the large collection a title could be lost forever making the scan useless for any purposes. If the user can’t find the object how can they fix it?

    ReplyDelete
  30. Q1: How do they get the stats data about metadata error? The authors didn`t present the methodologies they use to conduct the experiments and also didn`t indicate how large the sample is, so is that result accurate and persuasive?
    Q2: The author mentioned that most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text. If we only rely on enhancing the accuracy of technology, the errors can be reduced, but cannot be eliminated. How do we tackle this problem?

    ReplyDelete
  31. 1. I suppose this answers my previous question about automated metadata assessment. Lucky for us, I guess, this leaves a lot of open areas for further research and development!

    2. The author suggests crowd sourcing as a means of correcting mistaken metadata, but in the next breath asks whether Google engineers, with limited financial incentive, are best suited for this kind of work. How can you justify suggesting getting paid won't get it done, but people will surely do it for free?

    3. Could future problems like this be solved by systemically coordinating or standardizing publication conventions? For instance, might print books published in the future all feature identically laid out title pages/pages devoted specifically to metadata, to facilitate machine reading?

    ReplyDelete