Re: Best document format / markup for text indexing?

2011-11-23 Thread logic.cpp
Thank you for the help, I will see where this leads me. On Nov 23, 2011, at 10:01 AM, Michael Sokolov wrote: > In my experience, books and other semi-structured text documents are best > handled as XML. There are many many different XML "vocabularies" for doing > this, each of which has ben

Lucene on Android: indexing, searching and highlighting

2011-11-23 Thread Ilya Zavorin
Hello everyone, I need to write a Lucene-based search and retrieval app for Android. Unfortunately, I am new to both Android development and Lucene, so I am going up two learning curves at the same time. My app needs to do the following: 1. I have a collection of docs that I index 2. I have a s

Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-11-23 Thread Michael Sokolov
could use simply index every term with a namespace prefix like: Q::term where Q is the namespace and term the term? Then when you do spell corrections, submit each candidate term with the namespace prefix prepended -Mike On 11/23/2011 9:28 AM, E. van Chastelet wrote: I currently have an id

Re: Best document format / markup for text indexing?

2011-11-23 Thread Michael Sokolov
In my experience, books and other semi-structured text documents are best handled as XML. There are many many different XML "vocabularies" for doing this, each of which has benefits for different kinds of documents. You probably should look at TEI, NLM Book, and DocBook though - these are som

Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-11-23 Thread E. van Chastelet
I currently have an idea to get it done, but it's not a nice solution. If we have an index Q with all documents for all namespaces, we first extract the list of all terms that appear for the field namespace in Q (this field indicates the namespace of the document). Then, for each namespace n

Re: Fuzzy Search Sorting

2011-11-23 Thread Ian Lea
You'll have to delve in to the output from IndexSearcher.explain, or the details of the Levenshtein (edit distance) algorithm used by FuzzyQuery to figure out why Smath is beating Smith. But the general way of making sure that exact matches come top is to add an exact match clause to your query,

Re: highlighter by using term offsets

2011-11-23 Thread Ian Lea
I know nothing about highlighting or TermPositionVector, but first step on debugging NPEs on complex lines of code should be to break it down and find out exactly what is causing the exception. Is reader null? hits? Some other problem? -- Ian. On Tue, Nov 22, 2011 at 1:35 PM, starz10de wrote: