DGG, I appreciate your points. Would we be so motivated by this thread if it weren't a complex problem?
The fact that all of this is quite new, and that there are so many unknowns and gray areas, actually makes me consider it more likely that a body of wikimedians, experienced with their own form of large-scale authority file coordination, are in a position to say something meaningful about how to achieve something similar for tens of millions of metadata records. > OL rather than Wikimedia has the advantage that more of the people > there understand the problems. In some areas that is certainly so. In others, Wikimedia communities have useful recent experience. I hope that those who understand these problems on both sides recognize the importance of sharing what they know openly -- and showing others how to understand them as well. We will not succeed as a global community if we say that this class of problems can only be solved by the limited group of people with an MLS and a few years of focused training. (how would you name the sort of training you mean here, btw?) SJ On Thu, Aug 13, 2009 at 12:57 AM, David Goodman<dgoodma...@gmail.com> wrote: > Yann & Sam > > The problem is extraordinarily complex. A database of all "books" > (and other media) ever published is beyond the joint capabilities of > everyone interested. There are intermediate entities between "books" > and "works", and important subordinate entities, such as "article" , > "chapter" , and those like "poem" which could be at any of several > levels. This is not a job for amateurs, unless they are prepared to > first learn the actual standards of bibliographic description for > different types of material, and to at least recognize the > inter-relationships, and the many undefined areas. At research > libraries, one allows a few years of training for a newcomer with just > a MLS degree to work with a small subset of this. I have thirty years > of experience in related areas of librarianship, and I know only > enough to be aware of the problems. > For an introduction to the current state of this, see > http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf. > > The difficulty of merging the many thousands of partial correct and > incorrect sources of available data typically requires the manual > resolution of each of the tens of millions of instances. > > OL rather than Wikimedia has the advantage that more of the people > there understand the problems. > > David Goodman, Ph.D, M.L.S. > http://en.wikipedia.org/wiki/User_talk:DGG > > > > On Wed, Aug 12, 2009 at 1:15 PM, c<y...@forget-me.net> wrote: >> Hello, >> >> This discussion is very interesting. I would like to make a summary, so >> that we can go further. >> >> 1. A database of all books ever published is one of the thing still missing. >> 2. This needs massive collaboration by thousands of volunteers, so a >> wiki might be appropriate, however... >> 3. The data needs a structured web site, not a plain wiki like Mediawiki. >> 4. A big part of this data is already available, but scattered on >> various databases, in various languages, with various protocols, etc. So >> a big part of work needs as much database management knowledge as >> librarian knowledge. >> 5. What most missing in these existing databases (IMO) is information >> about translations: nowhere there are a general database of translated >> works, at least not in English and French. It is very difficult to find >> if a translation exists for a given work. Wikisource has some of this >> information with interwiki links between work and author pages, but for >> a (very) small number of works and authors. >> 6. It would be best not to duplicate work on several places. >> >> Personally I don't find OL very practical. May be I am too much used too >> Mediawiki. ;oD >> >> We still need to create something, attractive to contributors and >> readers alike. >> >> Yann >> >> Samuel Klein wrote: >>>> This thread started out with a discussion of why it is so hard to >>>> start new projects within the Wikimedia Foundation. My stance is >>>> that projects like OpenStreetMap.org and OpenLibrary.org are doing >>>> fine as they are, and there is no need to duplicate their effort >>>> within the WMF. The example you gave was this: >>> >>> I agree that there's no point in duplicating existing functionality. >>> The best solution is probably for OL to include this explicitly in >>> their scope and add the necessary functionality. I suggested this on >>> the OL mailing list in March. >>> http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html >>> >>>>>>>>> *A wiki for book metadata, with an entry for every published >>>>>>>>> work, statistics about its use and siblings, and discussion >>>>>>>>> about its usefulness as a citation (a collaboration with >>>>>>>>> OpenLibrary, merging WikiCite ideas) >>>> To me, that sounds exactly as what OpenLibrary already does (or >>>> could be doing in the near time), so why even set up a new project >>>> that would collaborate with it? Later you added: >>> >>> However, this is not what OL or its wiki do now. And OL is not run by >>> its community, the community helps support the work of a centrally >>> directed group. So there is only so much I feel I can contribute to >>> the project by making suggestions. The wiki built into the fiber of >>> OL is intentionally not used for general discussion. >>> >>>> I was talking about the metadata for all books ever published, >>>> including the Swedish translations of Mark Twain's works, which >>>> are part of Mark Twain's bibliography, of the translator's >>>> bibliography, of American literature, and of Swedish language >>>> literature. In OpenLibrary all of these are contained in one >>>> project. In Wikisource, they are split in one section for English >>>> and another section for Swedish. That division makes sense for >>>> the contents of the book, but not for the book metadata. >>> >>> This is a problem that Wikisource needs to address, regardless of >>> where the OpenLibrary metadata goes. It is similar to the Wiktionary >>> problem of wanting some content - the array of translations of a >>> single definition - to exist in one place and be transcluded in each >>> language. >>> >>>> Now you write: >>>> >>>>> However, the project I have in mind for OCR cleaning and >>>>> translation needs to >>>> That is a change of subject. That sounds just like what Wikisource >>>> (or PGDP.net) is about. OCR cleaning is one thing, but it is an >>>> entirely different thing to set up "a wiki for book metadata, with >>>> an entry for every published work". So which of these two project >>>> ideas are we talking about? >>> >>> They are closely related. >>> >>> There needs to be a global authority file for works -- a [set of] >>> universal identifier[s] for a given work in order for wikisource (as >>> it currently stands) to link the German translation of the English >>> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex... >>> to its metadata entry [or entries]. >>> >>> I would prefer for this authority file to be wiki-like, as the >>> Wikipedia authority file is, so that it supports renames, merges, and >>> splits with version history and minimal overhead; hence I wish to see >>> a wiki for this sort of metadata. >>> >>> Currently OL does not quite provide this authority file, but it could. >>> I do not know how easily. >>> >>>> Every book ever published means more than 10 million records. >>>> (It probably means more than 100 million records.) OCR cleaning >>>> attracts hundreds or a few thousand volunteers, which is >>>> sufficient to take on thousands of books, but not millions. >>> >>> Focusing efforts on notable works with verifiable OCR, and using the >>> sorts of helper tools that Greg's paper describes, I do not doubt that >>> we could effectively clean and publish OCR for all primary sources >>> that are actively used and referenced in scholarship today (and more >>> besides). Though 'we' here is the world - certainly more than a few >>> thousand volunteers have at least one book they would like to polish. >>> Most of them are not currently Wikimedia contributors, that much is >>> certain -- we don't provide any tools to make this work convenient or >>> rewarding. >>> >>>> Google scanned millions of books already, but I haven't heard of >>>> any plans for cleaning all that OCR text. >>> >>> Well, Google does not believe in distributed human effort. (This came >>> up in a recent Knol thread as well.) I'm not sure that is the best >>> comparison. >>> >>> SJ >> >> -- >> http://www.non-violence.org/ | Site collaboratif sur la non-violence >> http://www.forget-me.net/ | Alternatives sur le Net >> http://fr.wikisource.org/ | Bibliothèque libre >> http://wikilivres.info | Documents libres >> >> _______________________________________________ >> foundation-l mailing list >> foundation-l@lists.wikimedia.org >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l >> > > _______________________________________________ > foundation-l mailing list > foundation-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l