Thank you for the help, I will see where this leads me.
On Nov 23, 2011, at 10:01 AM, Michael Sokolov wrote:
> In my experience, books and other semi-structured text documents are best
> handled as XML. There are many many different XML "vocabularies" for doing
> this, each of which has benefits for different kinds of documents. You
> probably should look at TEI, NLM Book, and DocBook though - these are some
> widely-used standard formats for capturing structured book-type texts. There
> are other standards for journal articles and other kinds of documents.
>
> The question of how to store, index and retrieve the kind of information and
> structure captured by XML documents has gotten a lot of attention, too.
> There are XML-specific data stores such as MarkLogic and eXist (which uses
> Lucene for full text search). Or you could consider "rolling your own" with
> something like Solr/Lucene as a search index. Because you're posting on this
> list, I assume you're considering the last option, which is a good one, but
> will require some development effort as you consider how to map document
> structures into indexes, how to preserve document structure when you
> highlight query terms, etc.
>
> -Mike Sokolov
>
> PS - if you are interested in professional help, please consider our platform
> (pubfactory.net) and drop me an e-mail.
>
> On 11/17/2011 3:46 PM, logic.cpp wrote:
>> tl;dr version:
>>
>> We're converting tons (hundreds of thousands?) of books into digital text.
>>
>> What is the best format/markup/ebook standard/document standard/other to use
>> for easiest and best text search support?
>>
>> ***
>>
>> Longer version;
>>
>> The following are some desired user experience features of the project,
>> these probably influence the way in which the content should preferably be
>> stored;
>>
>> - Granular access to the text content.
>> Users would be able to fetch a specific phrase in a specific paragraph in a
>> specific page in a specific chapter in a specific book. (A 'document' may
>> consist of a single chapter of a book).
>>
>> - Cross referencing.
>> Most likely achieved through a RDBMS, users should have references to/from
>> content that refers or mentions a topic or quotes related content in other
>> books.
>> (Similar to Wikipedia articles linking to one-another.)
>>
>> - Full text search
>> This is probably where Lucene comes in.
>>
>>
>> So which format/markup/standard would allow for software to easily fetch and
>> cross-reference granular bits of data, as well as be easily indexable by
>> Lucene?
>>
>> Would it maybe be better to store all the books' digital text straight into
>> the RDBMS? In which case, can Lucene index such data?
>>
>> Thanks
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org