Hints on implementing XQuery full-text search

2010-01-13 Thread Paul J. Lucas
Hi - I've used Lucene on a previous project, so I am somewhat familiar with the API. However, I've never had to do anything "fancy" (where "fancy" means things like using filters, different analyzers, boosting, payloads, etc). I'm about to embark on implementing the full-text search feature of

RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba, Did you completely re-index? If you did, then there is some other problem - can you share (more of) your code? Do you know about Luke? It's an essential tool for Lucene index debugging: http://www.getopt.org/luke/ Steve On 01/13/2010 at 8:34 PM, AlexElba wrote: > > Hello, >

Re: RangeFilter

2010-01-13 Thread AlexElba
Hello, I change filter to follow RangeFilter rangeFilter = new RangeFilter( "rank", NumberTools .longToString(rating), NumberTools .longToString(10), true, true); and change index to store rank the same way.

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Actually I meant to say indexes... However when optimize(numsegments) is used they're segments... On Wed, Jan 13, 2010 at 3:05 PM, Otis Gospodnetic wrote: > I think Jason meant "15-20GB segments"? >  Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > _

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Right... It all blends together, I need an NLP analyzer for my emails On Wed, Jan 13, 2010 at 3:05 PM, Otis Gospodnetic wrote: > I think Jason meant "15-20GB segments"? >  Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > > From: Jaso

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Otis Gospodnetic
I think Jason meant "15-20GB segments"? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch From: Jason Rutherglen To: java-user@lucene.apache.org Sent: Wed, January 13, 2010 5:54:38 PM Subject: Re: Max Segmentation Size when Optimizing Index Ye

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Yes... You could hack LogMergePolicy to do something else. I use optimise(numsegments:5) regularly on 80GB indexes, that if optimized to 1 segment, would thrash the IO excessively. This works fine because 15-20GB indexes are plenty large and fast. On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittu

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Seems like optimize() only cares about final number of segments rather than the size of the segment. Is it so? On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > There's a different method in LogMergePolicy that performs the > optimize... Right, so normal mer

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
There's a different method in LogMergePolicy that performs the optimize... Right, so normal merging uses the findMerges method, then there's a findMergeOptimize (method names could be inaccurate). On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong wrote: > Do you mean MergePolicy is only used

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Do you mean MergePolicy is only used during index time and will be ignored by by the Optimize() process? On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Oh ok, you're asking about optimizing... I think that's a different > algorithm inside LogMergePolicy.

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Oh ok, you're asking about optimizing... I think that's a different algorithm inside LogMergePolicy. I think it ignores the maxMergeMB param. On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong wrote: > Thanks, Jason. > > Is my understanding correct that LogByteSizeMergePolicy.setMaxMergeMB(10

Re: Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread Erick Erickson
Ooooh, isn't that easier. You just prompted me to think that you don't even have to do that, just index the pairs as single tokens (KeywordAnalyzer? but watch out for no case folding)... On Wed, Jan 13, 2010 at 4:30 PM, Digy wrote: > How about using languages as fieldnames? > Doc1(Ra): >

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Thanks, Jason. Is my understanding correct that LogByteSizeMergePolicy.setMaxMergeMB(100) will prevent merging of two segments that is larger than 100 Mb each at the optimizing time? If so, why do think would I still see segment that is larger than 200 MB? On Wed, Jan 13, 2010 at 1:43 PM, Jaso

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Hi Trin, There was recently a discussion about this, the max size is for the before merge segments, rather than the resultant merged segment (if that makes sense). It'd be great if we had a merge policy that limited the resultant merged segment, though that'd by a rough approximation at best. Jas

Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Hi, I am trying to optimize the index which would merge different segment together. Let say the index folder is 1Gb in total, I need each segmentation to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy *and setMaxMergeMB(100) to ensure no segment after merging would be 200Mb. How

RE: Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread Digy
How about using languages as fieldnames? Doc1(Ra): Java:5 C:2 PHP:3 Doc2(Rb) Java:2 C:5 VB:1 Query:Java:5 AND C:2 DIGY -Original Message- From: TJ Kolev [mailto:tjko...@gmail.com] Sent: Wednesday, January 13, 2010 11:00 PM To: jav

Re: Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread Erick Erickson
One approach would be to do this with multi-valued fields. The idea here is to index all your E fields in the *same* Lucene field with an increment gap (see getPositionIncrementGap) > 1. For this example, assume getPositionIncrementGap returns 100. Then, for your documents you have something like

Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread TJ Kolev
Greetings, Let's assume I have to index and search "resume" documents. Two fields are defined: Language and Years. The fields are associated together in a group called Experience. A resume document may have 0 or more Experience groups: Ra{ E1(Java,5); E2(C,2); E3(PHP,3);} Rb{ E1(Java,2); E2(C,5);

Re: RangeFilter

2010-01-13 Thread AlexElba
Thanks Steve. Mike for now I can not upgrade... -- View this message in context: http://old.nabble.com/RangeFilter-tp27148785p27151315.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscri

Re: Performance Results on changing the way fields are stored

2010-01-13 Thread Paul Taylor
Grant Ingersoll wrote: On Jan 5, 2010, at 7:44 AM, Paul Taylor wrote: So currently in my index I index and store a number of small fields, I need both so I can search on the fields, then I use the stored versions to generate the output document (which is either an XML or JSON representatio

Re: NullPointerExc in CloseableThreadLocal... (Lucene 3.0.0)

2010-01-13 Thread Konstantyn Smirnov
Thanks for the answer Mike indeed it is possible, but practically... I start the loop immediately after searcher.search(), and with my index size of 3 MB, the whole operation takes max 100 ms. Given the rate of like 50 updates - addDocument()/expungeDeletes()/IR.reopen() per day, the probability

Re: RangeFilter

2010-01-13 Thread Michael McCandless
Actually, as of Lucene 2.9 (if you can upgrade), you should use NumericField to index numerics and NumericRangeQuery to do range search/filter -- it all just works -- no more padding. Mike On Wed, Jan 13, 2010 at 1:17 PM, Steven A Rowe wrote: > Hi AlexElba, > > The problem is that Lucene only kn

RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba, The problem is that Lucene only knows how to handle character strings, not numbers. Lexicographically, "3" > "10", so you get the expected results (nothing). The standard thing to do is transform your numbers into strings that sort as you want them to. E.g., you can left-pad the

RangeFilter

2010-01-13 Thread AlexElba
Hello, I am currently using lucene 2.4 and have document with 3 fields id name rank and have query and filter when I am trying to use rang filter on rank I am not getting any result back RangeFilter rangeFilter = new RangeFilter("rank", "3", "10", true, true); I have documents which are in

Re: Extracting contact data

2010-01-13 Thread Erick Erickson
Before answering, how to you measure "proximity"? You can make Lucene work with locations (there's an example in Lucene In Action) readily enough though HTH Erick On Wed, Jan 13, 2010 at 11:39 AM, Ortelli, Gian Luca < gianluca.orte...@truvo.com> wrote: > Hi community, > > > > I have a genera

Re: Extracting contact data

2010-01-13 Thread Karl Wettin
Lucene will probably only be helpful if you know what you are looking for, e.g. that you search for a given person, a given street and given time intervals. Is this what you want to do? If you instead are looking for a way to really extract any person, street and time interval that a docum

Re: NullPointerExc in CloseableThreadLocal... (Lucene 3.0.0)

2010-01-13 Thread Michael McCandless
Is it possible you are closing the searcher before / while running that for loop? Mike On Wed, Jan 13, 2010 at 9:26 AM, Konstantyn Smirnov wrote: > > Hi all > > Consider the following piece of code: > > Searcher s = this.getSearcher() > def hits = s.search( query, filter, params.offset + params.

Extracting contact data

2010-01-13 Thread Ortelli, Gian Luca
Hi community, I have a general understanding of Lucene concepts, and I'm wondering if it's the right tool for my job: - I need to extract data like e.g. time intervals ("8am - 12pm"), street addresses from a set of files. The common issue with this data unit is that they contain spaces and

Re: Field creation with TokenStream and stored value

2010-01-13 Thread Andrzej Bialecki
On 2010-01-13 15:29, Benjamin Heilbrunn wrote: Thanks! Didn't know that it's so easy ;) 2010/1/13 Uwe Schindler: Why not simply add the field twice, one time with TokenStream, one time stored only? Internally stored/indexed fields are handled like that. Actually, you can implement your own F

Re: Field creation with TokenStream and stored value

2010-01-13 Thread Benjamin Heilbrunn
Thanks! Didn't know that it's so easy ;) 2010/1/13 Uwe Schindler : > Why not simply add the field twice, one time with TokenStream, one time > stored only? Internally stored/indexed fields are handled like that. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaph

NullPointerExc in CloseableThreadLocal... (Lucene 3.0.0)

2010-01-13 Thread Konstantyn Smirnov
Hi all Consider the following piece of code: Searcher s = this.getSearcher() def hits = s.search( query, filter, params.offset + params.max, sort ) for( hit in hits.scoreDocs[ lower..http://www.poiradar.ru www.poiradar.ru http://www.poiradar.com.ua www.poiradar.com.ua http://www.poiradar.com

RE: Field creation with TokenStream and stored value

2010-01-13 Thread Uwe Schindler
Why not simply add the field twice, one time with TokenStream, one time stored only? Internally stored/indexed fields are handled like that. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Benjamin Heilb

Re: Field creation with TokenStream and stored value

2010-01-13 Thread Benjamin Heilbrunn
Sorry for pushing this thing. Would it be possible to add the demanded constructor or would it break anything of lucenes logic? 2010/1/11 Benjamin Heilbrunn : > Hey out there, > > in lucene it's not possible to create a Field based on a TokenStream > AND supply a stored value. > > Is there a rea

Re: Is there any difference in a document between one added field with a number of terms and a field added a number of times ?

2010-01-13 Thread Paul Taylor
So not much help here, (I wonder if its because I posted 3 questions in one day) but Ive made some progress in my understaning. I understand there is only one norm per field and I think Lucene does no differentiating between adding the same field a number of times and adding mutiple text to th

Re: Is there a way to limit the size of an index?

2010-01-13 Thread Michael McCandless
On Sun, Jan 10, 2010 at 7:33 AM, Dvora wrote: > > I'm storing and reading the documents using Compass, not Lucene directly. I > didn't touch those parameters, so I guess the default values are being used > (I do see cfs files in the index). OK. If your index directory has *.cfs files, then you a

Re: Text extraction from ms word doc

2010-01-13 Thread Michael McCandless
We could also fix WhitespaceAnalyzer to filter that character out? (Or you could make your own analyzer to do so...). You could also try asking on the tika-user list whether Tika has a solution for mapping "extended" whitespace characters... Mike On Mon, Jan 11, 2010 at 3:04 PM, maxSchlein wrot

Re: Supported way to get segment from IndexWriter?

2010-01-13 Thread Michael McCandless
Indeed, getReader is an expensive way to get the segment count (it flushes the current RAM buffer to disk as a new segment). Since SegmentInfos is now public, you could use SegmentInfos.read to read the current segments_N file, and then call its .size() method? But, this will only count as of the

Re: lucene index file randomly crash and need to reindex

2010-01-13 Thread Michael McCandless
If you follow the rules Otis listed, you should never hit index corruption, unless something is wrong with your hardware. Or, if you hit an as-yet-undiscovered bug in Lucene ;) Mike On Wed, Jan 13, 2010 at 1:11 AM, zhang99 wrote: > > what is the longest time you ever keep index file without req