Re: best practice: 1.4 billions documents

2010-11-22 Thread eks dev
Am I the only one who thinks this is not the way to go, MultiReader (or MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on one machine is a big number, does not matter how you partition them (or you have some really expensive hardware at your disposal). Did I miss the poin

Re: TSDC, TopFieldCollector & co

2009-09-30 Thread eks dev
t to speak about > documentation. > > About clear(Object sentinel) - is it still a question (now that you > understood getSentinelValue())? I think we should not make it final anyway. > It restricts PQ extensions unnecessarily ... > > Shai > > On Wed, Sep 30, 2009 at 8:41

Re: TSDC, TopFieldCollector & co

2009-09-30 Thread eks dev
forget the question about initialize(), reading javadoc before asking already answered questions helps a lot, sorry for the noise. ...NOTE in getSentinelObject() javadoc... - Original Message > From: eks dev > To: java-user@lucene.apache.org > Sent: Wednesday, 30 September

Re: TSDC, TopFieldCollector & co

2009-09-30 Thread eks dev
o be sentinels again. And of course add a reset() method to TSDC. > > On Wed, Sep 30, 2009 at 5:26 PM, eks dev wrote: > > > Thanks Mark, Shai, > > I was getting confused by so many possibilities to do the "almost the same > > thing" ;) > > > > But have f

Re: TSDC, TopFieldCollector & co

2009-09-30 Thread eks dev
> You also do want to specify whether or not to collect docs in order if > > you care about performance: > > > > public static TopScoreDocCollector create(int numHits, boolean > > docsScoredInOrder) > > > > ie: > > > > TopScoreDocCollector.create(

Re: TSDC, TopFieldCollector & co

2009-09-30 Thread eks dev
> You also do want to specify whether or not to collect docs in order if > > you care about performance: > > > > public static TopScoreDocCollector create(int numHits, boolean > > docsScoredInOrder) > > > > ie: > > > > TopScoreDocCollector.create(

Re: TSDC, TopFieldCollector & co

2009-09-30 Thread eks dev
. - Original Message > From: eks dev > To: java-user@lucene.apache.org > Sent: Wednesday, 30 September, 2009 11:43:26 > Subject: TSDC, TopFieldCollector & co > > Hi All, > > What is the best way to achieve the following and what are the differences, >

TSDC, TopFieldCollector & co

2009-09-30 Thread eks dev
Hi All, What is the best way to achieve the following and what are the differences, if I say "I do not normalize scores, so I do not need max score tracking, I do not care if hits are returned in doc id order, or any other order. I need only to get maxDocs *best scoring* documents": OPTION 1:

Re: Loading an index into memory

2009-07-23 Thread eks dev
I do not know much about RAM FS, but I know for sure if you have enough memory for RAMDirectory, you should go for it. That gives you the fastest and the most stable performance, no OS swaps, no sudden performance drops... Uwe's tip is very good, if you/OS occasionally need RAM for other things

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
> > How do you handle stop words in phrase queries? ok, good point! You found another item for list of BADs... but not for me as we do not use phrase Qs to be honest, I do not even know how they are implemented... but no, there are no positions in such cache... well, they remain slowe

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
t exist with new Lucene... > >> > I did not verify it again on the old one, but hey, who cares. Trunk is > clean > >> and, at least so far, our favourite QA team has nothing to complain about > >> ... > >> > > >> > They will keep it u

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
a while... so if somethings comes up you > will hear from me... > > Thanks again to all. > > > > Cheers, Eks > > > > > > > > - Original Message > >> From: eks dev > >> To: java-user@lucene.apache.org > >> Sent: T

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
up you will hear from me... Thanks again to all. Cheers, Eks - Original Message > From: eks dev > To: java-user@lucene.apache.org > Sent: Thursday, 16 July, 2009 14:40:26 > Subject: Re: speed of BooleanQueries on 2.9 > > > ok new facts, less chaos :) >

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
ok new facts, less chaos :) - LUCENE-1744 fixed it definitely; I have it confirmed Also, we found another example of the Query that was stuck (t1 t2 t3)~2 ... this is also fixed with LUCENE-1744 Re: "some queries are 4X slower than before". Was that a different issue? (Because this issu

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
I am getting lost as well, maybe I managed to confuse myself and everybody else here. But all agree, it would be good to know why it works now Re. Query rewriting. This Query gets printed with /// BooleanQuery q; q.toString() search(q, null, 200): /// => this is the Query that enters

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
Trace taken on trunk version (with fixed Yonik's bug and LUCENE-1744 tha fixed the problem somehow) full trace is too big (3.5Mb for this list), therefore only beginning and end: Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:mar

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
well, QA team is not there, and I am "abusing" cutomer's sysadmin, and it will cost me only a beer if I stop now :) Will post traces tomorrow, daylight does better ... I will have them done on trunk version (fixed two bugs) ... - Original Message > From: Michael McCandless > To

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
warmduscher :) good night - Original Message > From: Uwe Schindler > To: java-user@lucene.apache.org > Sent: Thursday, 16 July, 2009 1:06:30 > Subject: RE: speed of BooleanQueries on 2.9 > > Same here, too late! Good night! > And the blood glucose level is very low, too - very bad

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
t; NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 NAME:pirkarski^0.22073482 > NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 > NAME:polikarski^0.20172001 > NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 > NAME:siekarski^0.20281483))^2.0) > > > > >

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
I jut do not see how... Also not really expected, but this query runs over BS2, shouldn't +( whatewer whatever1...) run as BS? what does it mean to have MUST +() at the top level? it is a bit late here, I am going to bed ... Thanks a lot to all involved! Eks - Original Message -

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
)^2.0) - Original Message > From: eks dev > To: java-user@lucene.apache.org; yo...@lucidimagination.com > Sent: Wednesday, 15 July, 2009 23:57:22 > Subject: Re: speed of BooleanQueries on 2.9 > > > > it works with current trunk, 10 Minutes ago built?! >

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
it works with current trunk, 10 Minutes ago built?! if I put lucene from yesterday, the same symptoms like yesterday... Mike's instrumented version is running ... - Original Message > From: Yonik Seeley > To: java-user@lucene.apache.org > Sent: Wednesday, 15 July, 2009 23:34:29

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
DocIdSetIterators. The ones from Lucene core > all implement the new API and do it more effective than the example code :-) > > Or does Eks Dev use custom DocIdSetIterators? > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen >

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
> If I make a patch that adds verbosity to what BS is doing, can you run > it & post the output? can do, it can take some time - Original Message > From: Michael McCandless > To: java-user@lucene.apache.org > Sent: Wednesday, 15 July, 2009 20:54:25 > Subject: Re: speed of BooleanQuer

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
g > >> >> Sent: Wednesday, 15 July, 2009 17:16:23 > >> >> Subject: Re: speed of BooleanQueries on 2.9 > >> >> > >> >> So now I'm confused. Since your query has required (+) clauses, the > >> >> setAllowDocsOutOfOrder should

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
> Is it possible for you to make the problem happen such that we get > line numbers in this traceback? sure, I will build lucene trunk with debug/line numbers enabled and ask customer's QA to run it again... > Is CPU pegged when it's stuck? Yes!, One core was 100% hot - Original Mes

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
, 2009 at 7:04 PM, eks devwrote: > >> > > >> > I do not know exactly why, but > >> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, > >> > but > with > >> setAllowDocsOutOfOrder(false); no problems whatsoever > >> &

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
whatsoever > > > > not really scientific method to find such bug, but does the job and makes > > me > happy. > > > > Empirical, "deprecated methods are not to be taken as thoroughly tested, as > they have short life expectancy" > > >

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev
something weird happening w/ BooleanScorer... indeed, my first impression was jvm bug triggered on some rare conditions... but we tried old jvm (1.5).. the latest 1.6 U14 , -client instead of -XBatch -serverno changes We never managed to wait so long to see it finish, so I am not sure if

Re: speed of BooleanQueries on 2.9

2009-07-14 Thread eks dev
ot to be taken as thoroughly tested, as they have short life expectancy" - Original Message ---- > From: eks dev > To: java-user@lucene.apache.org > Sent: Wednesday, 15 July, 2009 0:24:43 > Subject: Re: speed of BooleanQueries on 2.9 > > > Mike, we are definit

Re: speed of BooleanQueries on 2.9

2009-07-14 Thread eks dev
earch(Unknown Source) org.apache.lucene.search.Searcher.search(Unknown Source) - Original Message > From: eks dev > To: java-user@lucene.apache.org > Sent: Monday, 13 July, 2009 13:28:45 > Subject: Re: speed of BooleanQueries on 2.9 > > Hi Mike, > > getMa

Re: speed of BooleanQueries on 2.9

2009-07-13 Thread eks dev
Hi Mike, getMaxNumOfCandidates() in test was 200, Index is optimised and read-only We found (due to an error in our warm-up code, funny) that only this Query runs slower on 2.9. A hint where to look could be that this Query cointains two, the most frequent tokens in two particular fields

Re: OOM with 2.9

2009-07-13 Thread eks dev
Hi Mike, thanks for looking into it... I am now positive, it was definitely a problem for OS to map() large continuous chunk of process memory... if I use this machine for a while as a desktop, eclipse,... I get the same problem again... but after cold restart, mapping succeeds. The proble

speed of BooleanQueries on 2.9

2009-07-12 Thread eks dev
Is it possible that the same BooleanQuery on 2.9 runs significantly slower than on 2.4? we have some strange effects where the following query runs approx 4(ouch!) times slower on 2.9, test done by 1000 times executing the same Query... But! if I run test from some real Query log with mixed Qu

Re: OOM with 2.9

2009-07-12 Thread eks dev
-Xms Xms were set to the same value imo, the problem was to convince OS (Win XP) to map huge continuous block... there were no jvm processes running at the same time, just this one... but after killing some desktop processes and restarting machine it worked. hmm, MMapDirectory has support for

Re: OOM with 2.9

2009-07-12 Thread eks dev
for tips Uwe. > > > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > > -Original Message- > > > From: eks dev [mailto

Re: OOM with 2.9

2009-07-12 Thread eks dev
> http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: eks dev [mailto:eks...@yahoo.co.uk] > > Sent: Sunday, July 12, 2009 1:24 PM > > To: java-user@lucene.apache.org > > Subject: Re: OOM with 2.9 > > > > > >

Re: OOM with 2.9

2009-07-12 Thread eks dev
Stack trace java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Un

OOM with 2.9

2009-07-12 Thread eks dev
Hi, We just upgraded to 2.9 and noticed some (to me) not expected OOM. We use MMapDirectory and after upgrade, on exactly the same Index/machine/jvm/params/setup... we cannot start index as mapping screams "No memory" any explanation why this could be the case? ---

Re: Scaling out/up or a mix

2009-06-29 Thread eks dev
depends on your architecture, will you partition your index? What is max expected size of your index (you said 128G and growing..) what do you mean with growing? You have in both options enogh memory to load it into RAM... I would definitly try to have less machines and alot of memory, so that

Re: Optimizing unordered queries

2009-06-26 Thread eks dev
also see, http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/BooleanQuery.html#getAllowDocsOutOfOrder() - Original Message > From: Nigel > To: java-user@lucene.apache.org > Sent: Friday, 26 June, 2009 4:11:53 > Subject: Optimizing unordered queries > > I recently pos

Re: Optimizing unordered queries

2009-06-26 Thread eks dev
You omitNorms(), did you also omitTf()? when something like https://issues.apache.org/jira/browse/LUCENE-1345 gets commited, you will have a posibility to see some benefits (e.g. by packing single postings lists as Filters). The code there optimises exactly that case as filters contain no Sco

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread eks dev
another performance tip, waht helps "a lot" is collection sorting before you index. if you can somehow logically partition your index, you can improve locality of reference by sorting. What I mean by this: imagine index with following fields: zip, user_group, some text if typical query

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread eks dev
We've also had the same Problem on 150Mio doc setup (Win 2003, java 1.6). After monitoring response time distribution over time for couple of weeks, it was clear that such long running response times were due to bad warming-up. There were peeks short after index reload (even comprehensive warmi

Re: Reloading RAM Directory from updated FS Directory

2009-06-10 Thread eks dev
there is one case where MMAP does not beat RAM, initial warm-up after process restart. With MMAP it can take a while before you get up to speed. MMAP with reopen is the best, if you run without restart. - Original Message > From: Uwe Schindler > To: java-user@lucene.apache.org >

Re: Binary indexing / query efficiency

2009-04-14 Thread eks dev
you can store binary value? e.g. with: Field(String name, byte[] value, Field.Store store) You could store all your fields as byte[], so you get them back as byte[]. How you index them is just another problem, but you are having no problems with speed in your case, leave it as it is. try simp

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread eks dev
Have you tried NGram SpellChecker + Query expansion? This is quite similar to your proposal, you have your priority queue in SpellChecker - Original Message > From: mark harwood > To: java-user@lucene.apache.org > Sent: Wednesday, 18 February, 2009 11:54:18 > Subject: Re: Lucene sear

Re: Are there any Lucene optimizations applicable to SSD?

2008-08-20 Thread eks dev
The simplest sorting would be to sort your collection before indexing, because Lucene will preserve order of added documents I think nutch sorts index afterward somehow, but I do not know how this works by omitTf() I mean the new feature in the trunk version, see https://issues.apache.org/ji

Re: Are there any Lucene optimizations applicable to SSD?

2008-08-19 Thread eks dev
hi Cedric, has nothing to do with SSD... but > > All queries involves a Date Range Filter and a Publication Filter. > We've used WrappingCachingFilters for the Publication Filter for there > are only a limited number of combinations for this filter. For the > Date Range Filter we just let it r

Re: Fastest way to get just the "bits" of matching documents

2008-07-22 Thread eks dev
no, at the moment you can not make pure boolean queries. But 1.5 seconds on 10Mio document sounds a bit too much (we have well under 200mS on 150Mio collection) what you can do: 1. use Filter for high frequency terms, e.g. via ConstantScoreQuery as much as you can, but you have to cache them (C

Re: How to avoid duplicate records in lucene

2008-07-21 Thread eks dev
you could maintain your bloom filter and check only "positives" if they are not false positives with exact search, if you have small percentage of duplicates (unique documents dominate updates) this will help you a lot on performance side - Original Message > From: markharw00d <[EMA

Re: Boolean expression for no terms OR matching a wildcard

2008-07-18 Thread eks dev
Analyzer that detects your condition "ALL match something", if possible at all... e.g. "800123456 80034543534 80023423423" -> 800 than you put it in ALL_MATCH field and match this condition against it... if this prefix needs to be variable, you could extract all matching prefixes to this fiiel

Re: Mixing non scored an scored queries

2008-07-15 Thread eks dev
do not forget that Filter does not have to be loaded in memory, not any more since LUECEN-584 commit! Now it is only skipping iterator what you need. translated, you could use: ConstantScoreQuery created with Filter made from TermDocs (you need to implement only DocIdSet / DocIdSetIterator, thi

Re: document retrieval 100 times slower after finishing some heavy disk operation

2008-06-29 Thread eks dev
yes, we have seen this many times. The problem is, especially on windows ,that some simple commands like copy make havoc of File System cache, as matter of fact, we are not sure it is the cache that is making problems, generally all IO operations start blocking like crazy (we have seen this effe

Re: index corruption with latest lucene

2008-05-05 Thread eks dev
hmm, if I am not wrong, it looks awfully similar to the Exception we have seen and concluded it is some black magic with corrupt memory chip or waht-not, but the fact we are not alone makes me wonder now... Subject of this thread was "Strange Exception"... we were able to use this very same inde

Re: Using Lucene to find duplicate/similar names

2008-04-16 Thread eks dev
NGrams will do ok, depends a lot on what you are up to, if there is a person looking at result lists making decision, it will work fine as default TF/IDF similarity will give you ok order of hits, but if you need to set some cutoff value to decide automatically if this is a match or not, then y

Re: Lucene Compression

2008-04-02 Thread eks dev
the example you have sent is too small for the type of compression implemented in lucene. The problem is that you have to store decoding symbol table , header ...* for each* document you compress. The best you can do for this would be to use some compressor with static decoding table (some ent

Re: Solid State Drives vs. RAMDirectory

2008-03-13 Thread eks dev
>>Upping the amount of RAM does not help us when the index is replaced before we pass the 50.000 queries. have you seen https://issues..apache.org/jira/browse/LUCENE-1035 , It would be interesting to see if this one changes HD numbers . You have plenty of free memory in this setup...

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread eks dev
you said, if an Index is optimized, isDeleted() does not present performance problem? I think there is still check for null in synchronized method, can jvm optimize this, I doubt it? - Original Message From: German Kondolf <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tu

Re: Does Lucene support partition-by-keyword indexing?

2008-03-02 Thread eks dev
I did not follow this discussion from the start, but I guess you could cleanly achieve this by implementing org.apache.lucene.index.FilterIndexReader have fun. e. - Original Message From: 仇寅 <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, 2 March, 2008 3:05:05 AM Sub

Stored Field vs "offset plus external file"?

2008-02-13 Thread eks dev
I would like to try to replace our external storage of documents with Lucene stored field, so a few questions before we proceed: Background: We store currently complete documents in a simple binary file and only keep offsets into this file as a Stored field in Lucene index. Documents (compre

Re: Spell checking street names

2008-01-31 Thread eks dev
Otis, I think it was proposed to have spell checker that works on multiple tokens / Document: where field to be searched with SpellChecker" looks like "lucene search library" does not get tokenized and then fed to the SpellChecker, rather having this as a "single token" that gets chopped int

Re: Amount of RAM needed to support a growing lucene index?

2007-08-12 Thread eks dev
300k documents is something I would consider very small. Anything under 10Mio documents IMHO is small for Lucene (meaning, commodity hardware, 1G RAM should give you well under second response times). The number of words is not all that important, much more important would be the number of uniqu

Re: Investigating Lucene's Applicability to [Unusual?] Use Case

2007-06-14 Thread eks dev
sounds easy (I said sounds :), e.g. your Statement becomes Document in Lucene lingo, you make it with 3-4 Lucene fields, CONTENT (Tokenized, not stored) OFFSET(not indexed, stored) - offset in file of the first byte of your statement DOC_LENGTH(not indexed, stored) - if you have no END-OF-Statem

Re: Lucene for name matching

2007-04-06 Thread eks dev
I've been doing this in past couple of years, and yes we use Lucene for some key parts of the problem. Basically, the problem you face is on how to run extremely high recall without compromising precision, hard! the key problem is performance, imagine you have DB with 10Mio persons you need to

Re: Performance between Filter and HitCollector?

2007-03-15 Thread eks dev
: Performance between Filter and HitCollector? eks dev and others - have you tried using the code from LUCENE-584? Noticed any performance increase when you disabled scoring? I'd like to look at that patch soon and commit it if everything is in place and makes sense, so I'm curious if y

Re: Performance between Filter and HitCollector?

2007-03-14 Thread eks dev
just to complete this fine answer, there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584) that could bring the best of both worlds via e.g. ConstantScoringQuery or another abstraction that enables disabling Scoring (where appropriate) - Original Message From: Ch

Re: Stop long running queries

2007-02-21 Thread eks dev
have a look at LuceneQueryOptimizer.java in nutch - Original Message From: Tim Johnson <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 21 February, 2007 3:34:36 PM Subject: Stop long running queries I'm having issues with some queries taking in excess of 500 secs t

Re: How to improve document retrieval speed.

2006-11-04 Thread eks dev
I would strongly suggest not storing these fields in lucene, just keep them as files and store some kind of url to get them latter. that will boost your speed heavily. If you really, really need to store documents in lucene, try some compression Also, so many fields hurt performance, any chance

Re: "Catalog" backend for document stored fields?

2006-10-20 Thread eks dev
1- is there someone out there that already wrote an extension to Lucene so that 'stored' string for each document/field is in fact stored in a centralized repository? Meaning, only an 'index' is actually stored in the document and the real data is put somewhere else. 2- If not, how ha

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread eks dev
have you considered hadoop "light" mesagging RPC, should have significantly smaller latencies than RMI - Original Message From: Simon Wistow <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 4 October, 2006 3:26:38 PM Subject: Re: Searching documents on big index by u

Re: Re[2]: how to enhance speed of sorted search

2006-09-26 Thread eks dev
Paul's Matcher in Jira will almost enable this, indirectly but possible - Original Message From: karl wettin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 26 September, 2006 11:30:24 PM Subject: Re: Re[2]: how to enhance speed of sorted search On 9/26/06, Chris Hoste

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
d special linguistic tricks are anyhow not so relevant for most situations for searching. Regular stemmer makes much greater distorsion than this Must find this code somewhere, I probably left something out in these emails - Original Message From: eks dev <[EMAIL PR

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
Hi Otis, Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated. for the first case: Build SuffixTree with your dictionary (hope you

Re: WildcardFilter

2006-09-05 Thread eks dev
I would rather use this BitSet bits = new BitSet(reader.maxDocs()); //Not sure of exact method, lucene is not on this PC... instead of = new BitSet(reader.maxDocs()) - Original Message From: Mark Miller <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 5 September, 200

Re: Stemmer Implementation Strategy - feedback?

2006-08-08 Thread eks dev
I would suggest you to have a look at Egothor stemmer (http://www.egothor.org/book/bk01ch01s06.html), can be trained rather easily (if your only use of "roots" is for searching) I have only heard of it as a good thing, never tried it On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote: > > > >

Re: running a lucene indexing app as a windows service on xp, crashing

2006-08-06 Thread eks dev
XP Proffesionall / win 2003 Server, we had this issue on JVMs 1.5/1.6. It seams it this happens "not so often" on 1.6/Win2003, but we have this in production only for 2 weeks. We have single update machine that builds index in batch and replicates to many Index readers, so at least customers ar

Re: running a lucene indexing app as a windows service on xp, crashing

2006-08-04 Thread eks dev
This is windows/jvm issue . Have a look at how ant is dealing with it, maybe we could give it a try with something like that (I have not noticed ant having problems). We are not able to reproduce this in our environment systematically, so it would be great if you could patch your lucene with th

Re: Fastest Method for Searching (need all results)

2006-07-21 Thread eks dev
have you tried to only collect doc-ids and see if the speed problem is there, or maybe to fetch only field values? If you have dense results it can easily be split() or addSymbolsToHash() what takes the time. I see 3 possibilities what could be slow, getting doc-ids, fetching field value or do

Re: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread eks dev
Did not check it, but solr is using SkippingFilter which is not yet commited in Lucene... so this will maybe not work? By the way, any reason today not to commit SkippingFilter to Lucene? I actually see nothing to do for this, but to commit existing SkippingFilter. If there is something I do

Re: question with spellchecker

2006-06-06 Thread eks dev
try your query like ((ducted^1000 duct~2) +tape) Or maybe (duct* +tape) or even better you could try to do some stemming (Porter stemmer should get rid of these ed-suffixes) and some of the above if this does not help, have a look at lingpipe spellChecker class as this looks like exactly what yo

Re: Lucene in Action

2006-06-06 Thread eks dev
Grab it now, it is worth all this money. - Original Message From: digby <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 6 June, 2006 11:59:53 AM Subject: Lucene in Action Does everyone recommend getting this book? I'm just starting out with Lucene and like to have a b

Re: Lucene search optimization

2006-05-31 Thread eks dev
or you could try n-gram approach with Spellchecker (you will find it contrib area). get suggestSimilars() and form your query, or even better ConstantScoringQuery via Filter. It works OK. Or if you have not so many Terms (could spare to load all terms in memory), you could try TernarySearch

Re: MMapDirectory vs RAMDirectory

2006-05-28 Thread eks dev
If you can use all that memory for index, I would say RAM. For long running indexes (to get os cache populated), MMAP will do just as good if you have any file system worth using. - Original Message From: Michael Chan <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, 28

Re: Search precondition: matching area

2006-05-16 Thread eks dev
try: 1. query-string: "hello +area:home" to get Filtering effect 2. to minimize scoring use boosts: "(hello)^HIGH_BOOST +(area:home)^LOW_BOOST" 3. If scoring via boosts does not work good enough for you, or is slow, use Filter interface from your code... search this list for Filter - Or

Re: Compressed BitSet

2006-03-09 Thread eks dev
Just a short one, it rocks in some cases (when actual BitSet/IntSet is compressable, long runs of set or clear bits...). Very good general BitSet representation I have tried it and found no bugs so far (+- 2 months of using it) Unfortunately, there is an issue with Licence (not ASF compatible :(

Filter to support DocNrSkipper interface

2005-12-23 Thread eks dev
Hi, Would it be OK to add one method in Filter class that returns DocNrSkipper interface from Pauls's "Compact sparse Filter" in jira LUCENE-328 This would be the first step for: - smooth integration of compact representations of the underlaying BitSet in Filter (VInt and sorted int[]). They are

Re: A lot of short documents, optimal query?

2005-11-12 Thread eks dev
Hi Hoss, Good to hear that, I felt a bit fuzzy trying to grasp all the possibilities. I've read discussion from Doug's proposal for implementing non-scoring Query features, ConstantScoreQuery, Paul's FilteredQuery patch. And in summary options to avoid scoring: 1. There is a consensus that

Re: A lot of short documents, optimal query?

2005-11-11 Thread eks dev
Everything is perfect with your suggestion, scoring is not needed. I am going to try all also approach with ChainedFilter, but for this I need to think a bit more on how to get it right. The Query in the example is just one variation on the same topic and there are a few more cases I need to cover

Re: A lot of short documents, optimal query?

2005-11-10 Thread eks dev
Thanks Hoss, I've looked intio it and you were absolutely right, could not be simpler. Two quick ones on the same topic (my personal education like questions): - What is the purpose of hasCode and equals methods in XxxFilter? (this is a question about actual usage in Lucene, not java elementary

A lot of short documents, optimal query?

2005-11-09 Thread eks dev
(currently using HitCollector) and score is not needed, any way to avoid scoring (would that help at all?) Befor adding ZIPS:12* part of the query, Lucene worked like a charm, a lot under 1 second on 25Mio collection! Now it jumped into 10 second range. Trunk is ok for me. Thanks a lot! eks dev

RAMDirectory without positions or frequencies?

2005-06-20 Thread eks dev
Hi, I have a need for minimum memory footprint of the index during search (would like to have it in RAM). Good thing in the story, similarity calculation is not necessary, only pure boolean model is OK. I am sure I have seen somewhere one explanation from Doug about disabling norms... but cannot f

Re: fresh indexing bug?

2005-03-08 Thread eks dev
works like a charm, thanks! as a side note, the latest patch with properly disabled coord helped me a lot as well, made coord usable. --- Doug Cutting <[EMAIL PROTECTED]> wrote: > eks dev wrote: > > When I reindex with the lucene from the latest svn > > snapshot, a lot o

fresh indexing bug?

2005-03-08 Thread eks dev
When I reindex with the lucene from the latest svn snapshot, a lot of .tii files that are deletable appear (checked with luke). This was not happening with previous version using exactly the same code for indexing. At the end of indexing Optimize was succesfully finished. Is this a bug? WinXP,

Not storing norms and term positions info, possible?

2005-03-03 Thread eks dev
Hi, Is there a way to create index which does not store norms (.f..) and Positions (.prx files)? In the case I need to support, no length normalisation is needed, the same is with positional info. (Similarity.encodeNorms(float) returns 0; and Term.SetPositionalIncrement(0) is used) >From the size