Re: Best document format / markup for text indexing?

logic.cpp Wed, 23 Nov 2011 17:12:39 -0800

Thank you for the help, I will see where this leads me.



On Nov 23, 2011, at 10:01 AM, Michael Sokolov <soko...@ifactory.com> wrote:

> In my experience, books and other semi-structured text documents are best 
> handled as XML.  There are many many different XML "vocabularies" for doing 
> this, each of which has benefits for different kinds of documents.  You 
> probably should look at TEI, NLM Book, and DocBook though - these are some 
> widely-used standard formats for capturing structured book-type texts.  There 
> are other standards for journal articles and other kinds of documents.
> 
> The question of how to store, index and retrieve the kind of information and 
> structure captured by XML documents has gotten a lot of attention, too.  
> There are XML-specific data stores such as MarkLogic and eXist (which uses 
> Lucene for full text search).  Or you could consider "rolling your own" with 
> something like Solr/Lucene as a search index.  Because you're posting on this 
> list, I assume you're considering the last option, which is a good one, but 
> will require some development effort as you consider how to map document 
> structures into indexes, how to preserve document structure when you 
> highlight query terms, etc.
> 
> -Mike Sokolov
> 
> PS - if you are interested in professional help, please consider our platform 
> (pubfactory.net) and drop me an e-mail.
> 
> On 11/17/2011 3:46 PM, logic.cpp wrote:
>> tl;dr version:
>> 
>> We're converting tons (hundreds of thousands?) of books into digital text.
>> 
>> What is the best format/markup/ebook standard/document standard/other to use 
>> for easiest and best text search support?
>> 
>> ***
>> 
>> Longer version;
>> 
>> The following are some desired user experience features of the project, 
>> these probably influence the way in which the content should preferably be 
>> stored;
>> 
>> - Granular access to the text content.
>> Users would be able to fetch a specific phrase in a specific paragraph in a 
>> specific page in a specific chapter in a specific book. (A 'document' may 
>> consist of a single chapter of a book).
>> 
>> - Cross referencing.
>> Most likely achieved through a RDBMS, users should have references to/from 
>> content that refers or mentions a topic or quotes related content in other 
>> books.
>> (Similar to Wikipedia articles linking to one-another.)
>> 
>> - Full text search
>> This is probably where Lucene comes in.
>> 
>> 
>> So which format/markup/standard would allow for software to easily fetch and 
>> cross-reference granular bits of data, as well as be easily indexable by 
>> Lucene?
>> 
>> Would it maybe be better to store all the books' digital text straight into 
>> the RDBMS? In which case, can Lucene index such data?
>> 
>> Thanks
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best document format / markup for text indexing?

Reply via email to