Re: Best document format / markup for text indexing?

Michael Sokolov Wed, 23 Nov 2011 07:01:27 -0800

In my experience, books and other semi-structured text documents arebest handled as XML. There are many many different XML "vocabularies"for doing this, each of which has benefits for different kinds ofdocuments. You probably should look at TEI, NLM Book, and DocBookthough - these are some widely-used standard formats for capturingstructured book-type texts. There are other standards for journalarticles and other kinds of documents.

The question of how to store, index and retrieve the kind of informationand structure captured by XML documents has gotten a lot of attention,too. There are XML-specific data stores such as MarkLogic and eXist(which uses Lucene for full text search). Or you could consider"rolling your own" with something like Solr/Lucene as a search index.Because you're posting on this list, I assume you're considering thelast option, which is a good one, but will require some developmenteffort as you consider how to map document structures into indexes, howto preserve document structure when you highlight query terms, etc.


-Mike Sokolov

PS - if you are interested in professional help, please consider ourplatform (pubfactory.net) and drop me an e-mail.


On 11/17/2011 3:46 PM, logic.cpp wrote:

tl;dr version:

We're converting tons (hundreds of thousands?) of books into digital text.

What is the best format/markup/ebook standard/document standard/other to use 
for easiest and best text search support?

***

Longer version;

The following are some desired user experience features of the project, these 
probably influence the way in which the content should preferably be stored;

- Granular access to the text content.
Users would be able to fetch a specific phrase in a specific paragraph in a 
specific page in a specific chapter in a specific book. (A 'document' may 
consist of a single chapter of a book).

- Cross referencing.
Most likely achieved through a RDBMS, users should have references to/from 
content that refers or mentions a topic or quotes related content in other 
books.
(Similar to Wikipedia articles linking to one-another.)

- Full text search
This is probably where Lucene comes in.


So which format/markup/standard would allow for software to easily fetch and 
cross-reference granular bits of data, as well as be easily indexable by Lucene?

Would it maybe be better to store all the books' digital text straight into the 
RDBMS? In which case, can Lucene index such data?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best document format / markup for text indexing?

Reply via email to