Thank you for the help, I will see where this leads me.
On Nov 23, 2011, at 10:01 AM, Michael Sokolov <soko...@ifactory.com> wrote: > In my experience, books and other semi-structured text documents are best > handled as XML. There are many many different XML "vocabularies" for doing > this, each of which has benefits for different kinds of documents. You > probably should look at TEI, NLM Book, and DocBook though - these are some > widely-used standard formats for capturing structured book-type texts. There > are other standards for journal articles and other kinds of documents. > > The question of how to store, index and retrieve the kind of information and > structure captured by XML documents has gotten a lot of attention, too. > There are XML-specific data stores such as MarkLogic and eXist (which uses > Lucene for full text search). Or you could consider "rolling your own" with > something like Solr/Lucene as a search index. Because you're posting on this > list, I assume you're considering the last option, which is a good one, but > will require some development effort as you consider how to map document > structures into indexes, how to preserve document structure when you > highlight query terms, etc. > > -Mike Sokolov > > PS - if you are interested in professional help, please consider our platform > (pubfactory.net) and drop me an e-mail. > > On 11/17/2011 3:46 PM, logic.cpp wrote: >> tl;dr version: >> >> We're converting tons (hundreds of thousands?) of books into digital text. >> >> What is the best format/markup/ebook standard/document standard/other to use >> for easiest and best text search support? >> >> *** >> >> Longer version; >> >> The following are some desired user experience features of the project, >> these probably influence the way in which the content should preferably be >> stored; >> >> - Granular access to the text content. >> Users would be able to fetch a specific phrase in a specific paragraph in a >> specific page in a specific chapter in a specific book. (A 'document' may >> consist of a single chapter of a book). >> >> - Cross referencing. >> Most likely achieved through a RDBMS, users should have references to/from >> content that refers or mentions a topic or quotes related content in other >> books. >> (Similar to Wikipedia articles linking to one-another.) >> >> - Full text search >> This is probably where Lucene comes in. >> >> >> So which format/markup/standard would allow for software to easily fetch and >> cross-reference granular bits of data, as well as be easily indexable by >> Lucene? >> >> Would it maybe be better to store all the books' digital text straight into >> the RDBMS? In which case, can Lucene index such data? >> >> Thanks >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org