RE: how to preserve whitespaces etc when tokenizing stream?

2012-01-16 Thread Ilya Zavorin
Yes, I ended up doing essentially that. No need to tokenize, I basically split the input string into a sequence of alternating "word" and "nonword" tokens based on Character.isLetter() and then looked up the words Ilya -Original Message- From: Danil ŢORIN [mailto:torin...@gmail.com]

Creating an IndexReader for a subset from original IndexReader object

2012-01-16 Thread mikaza
Hi! I am trying to extend "mahout lucene.vector" driver, so that it can be feeded with arbitrary key-value constraints on solr schema fields (and generate only a subset for mahout vectors, which seems to be a regular use case). So the best (easiest) way I see, is to create an IndexReader implemen

Query building performance

2012-01-16 Thread David Olson
I have a situation where there are users that create n keywords. I'm storing them as individual DB fields for aggregating scores and then building the query from the fields. Is it faster for Lucene to parse a query of terms that are OR'd together or to build it up as a loop of BooleanQuery marked a

Re: LUCENE_35 index keyword analyzer only doesn't like indexed sentences

2012-01-16 Thread Ian Lea
Hard to believe this ever worked. KeywordAnalyzer '"Tokenizes" the entire stream as a single token' i.e. there will only be one term. So your document contains:ba foo would only match a search on ba foo, not a search on foo. Are you sure you should be using KeywordAnalyzer? Not usually used on s

LUCENE_35 index keyword analyzer only doesn't like indexed sentences

2012-01-16 Thread ejblom
Dear Lucene-developers, I switched to using Lucene 3.5 a few weeks ago and suddenly sentences are not correctly indexed anymore. Basically, fields can be correctly queried if they contain one term but if there are multiple terms, the analyzer fails (i use the latest Luke for testing). So my quer

Re: Is Lucene a good candidate for a Google-like search engine?

2012-01-16 Thread Cheng
greate thanks On Mon, Jan 16, 2012 at 5:56 AM, findbestopensource < findbestopensou...@gmail.com> wrote: > Check out the presentation. > http://java.dzone.com/videos/archive-it-scaling-beyond > > Web archive uses Lucene to index billions of pages. > > Regards > Aditya > www.findbestopensource.com

Re: best query for one-box search string over multiple types & fields

2012-01-16 Thread Ian Lea
Welcome to the list. This is hard with no quick and easy answers. For a similar index, but books rather than music, I index author and title separately into 2 fields, author and title combined into another field, author and title and blurb and whatever all combined into yet another field. Each s

Re: how to preserve whitespaces etc when tokenizing stream?

2012-01-16 Thread Danil ŢORIN
Or you may simple store the field as is, but index it in whatever way you like (replacing some tokens with other, or maybe storing both words with position increment = 0). On Mon, Jan 16, 2012 at 13:23, Dmytro Barabash wrote: > I think you need index this field with > org.apache.lucene.document

Re: custom scoring

2012-01-16 Thread Ian Lea
Some values in the norm/boost area are stored encoded with some loss of precision. Details in the javadocs somewhere. What values do you get when you change the boost? -- Ian. 2012/1/14 ltomuno : > the following message comes from  Explanation explain >  0.09375  = (MATCH) fieldWeight(name:85

Re: how to preserve whitespaces etc when tokenizing stream?

2012-01-16 Thread Dmytro Barabash
I think you need index this field with org.apache.lucene.document.Field.TermVector != NO - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Is Lucene a good candidate for a Google-like search engine?

2012-01-16 Thread findbestopensource
Check out the presentation. http://java.dzone.com/videos/archive-it-scaling-beyond Web archive uses Lucene to index billions of pages. Regards Aditya www.findbestopensource.com On Fri, Jan 13, 2012 at 4:31 PM, Peter K wrote: > yes and no! > google is not only the search engine ... > > > Just c

Re: how to preserve whitespaces etc when tokenizing stream?

2012-01-16 Thread Danil ŢORIN
Maybe you could simply use String.replace()? Or the text actually needs to be tokenized? On Fri, Jan 13, 2012 at 18:44, Ilya Zavorin wrote: > I am trying to perform a "translation" of sorts of a stream of text. More > specifically, I need to tokenize the input stream, look up every term in a > s