Re: Maximum index file size

2009-10-22 Thread Jake Mannix
Hi Hrishi, The only way you'll know is to try it with some subset of your data - some queries can be very expensive, some are really easy. It'll depend on your document size, the vocabulary (total number and distribution of terms), and kinds of queries, as well as of course your hardware. I wo

RE: Maximum index file size

2009-10-22 Thread Hrishikesh Agashe
Thanks Jake. I have around 75 TB data to be indexed. So even though I do the sharding, individual index file size might still be pretty high. And that's why I wanted to find out whether there is any limit as such. And obviously whether such a huge index files can be searched at all. >From your

help needed improving lucene concurret search performance

2009-10-22 Thread Wilson Wu
Dear Friend, I have encountered some performance problems recently in lucene search 2.9. I use a single IndexSearcher in the whole system, It seems perfect when there is less than 10 threads doing search concurrenty. Bu if there is more than 100 threads doing concurrent search,the average resp

Re: Maximum index file size

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 10:29 PM, Hrishikesh Agashe < hrishikesh_aga...@persistent.co.in> wrote: > Can I create an index file with very large size, like 1 TB or so? Is there > any limit on how large index file one can create? Also, will I be able to > search on this 1 TB index file at all? > Leav

Maximum index file size

2009-10-22 Thread Hrishikesh Agashe
I am running Ubuntu 9.04 on 64 bit machine with NAS of 100 TB capacity. JVM is running with 2.5 GB Xmx. Can I create an index file with very large size, like 1 TB or so? Is there any limit on how large index file one can create? Also, will I be able to search on this 1 TB index file at all?

Re: 2.9 per segment searching/caching

2009-10-22 Thread John Wang
HI Michael: I understand exactly what you mean. I have done some experiments with the multiQ approach by carrying over the bottom to next segment. (which would need to extend the ScoreDocComparator api to support the same type of "convert", the difference here is that it is optional, sup

Re: 2.9 per segment searching/caching

2009-10-22 Thread Mark Miller
Yes - in many cases, the other wins outweigh the queue transition cost - in some cases it does not. But we are talking degradation as you add more segments, not pure speed. Degradation is worse now in the sort case. John Wang wrote: > With many other coding that happened in 2.9, e.g. the PQ api e

Re: 2.9 per segment searching/caching

2009-10-22 Thread John Wang
With many other coding that happened in 2.9, e.g. the PQ api etc., sorting is actually faster than 2.4. -John On Thu, Oct 22, 2009 at 5:07 AM, Mark Miller wrote: > Bill Au wrote: > > Since Lucene 2.9 has per segment searching/caching, does query > performance > > degrade less than before (2.9) a

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Chris Lu
All previous suggestions are very good. It's usually just the database. Lucene itself are faster enough. Previously when I used Pentium III years ago, the indexing speed matters. But upgrading the CPU to Xeon etc, the indexing bottle neck is on database side. Basically use the simplest SQL as

Re: How to loop through all the entries for a field

2009-10-22 Thread Mark Miller
But with Lucene 2.9 you would want to use StringHelper.intern right? adviner wrote: > Thank you > > > Uwe Schindler wrote: > >> Use this one: >> >> >> >> String fieldname="BookTitle"; >> >> >> >> fieldname = fieldname.intern(); // because of this we need no >> String.equals() >> >> TermEnum

RE: Question about the extends the query parser to support NumericField on Lucene 2.9.0

2009-10-22 Thread Uwe Schindler
If you look into the testcase I provided with my QueryParser example, you will see, that the negative numbers have a problem in newTermQuery. "-" is a control character in QueryParser, which means to do a "NOT" on this term. Because of this the syntax of the query is wrong. To hit the negative num

Question about the extends the query parser to support NumericField on Lucene 2.9.0

2009-10-22 Thread java8964 java8964
Hi, I have a problem to work support the NumericField in query parser. My environment is like this: Windows XP with C:\work\> java -version java version "1.6.0_10" Java(TM) SE Runtime Environment (build 1.6.0_10-b33) Java HotSpot(TM) Client VM (build 11.0-b15, mixed mode, sharing) I am using

RE: How to loop through all the entries for a field

2009-10-22 Thread adviner
Thank you Uwe Schindler wrote: > > Use this one: > > > > String fieldname="BookTitle"; > > > > fieldname = fieldname.intern(); // because of this we need no > String.equals() > > TermEnum te = IndexReader.terms(new Term(fieldname, "")); > > do { > > Term term = te.term(); > >

RE: How to loop through all the entries for a field

2009-10-22 Thread Uwe Schindler
Use this one: String fieldname="BookTitle"; fieldname = fieldname.intern(); // because of this we need no String.equals() TermEnum te = IndexReader.terms(new Term(fieldname, "")); do { Term term = te.term(); if (term == null || term.field() != fieldname) break; System

Re: How to loop through all the entries for a field

2009-10-22 Thread adviner
nevermind I figured it out. I did this: while ((term = termEnum.Term()) != null) { if (!term.Field().Equals("BookTitle")) break; map = new SearchResultMap(); map.Title = term.Text

Re: XorReader?

2009-10-22 Thread Karl Wettin
22 okt 2009 kl. 20.00 skrev Chris Hostetter: : I'm thinking a decorator with deletions on top of the original reader, merged : with the clone reader using a MultiReader. But this would still require a new you don't really mean a clone do you? ... you should just need a very small index c

Re: How to loop through all the entries for a field

2009-10-22 Thread adviner
How do you know if your on your last term? I tried it and it does work but continues. How do you know to check if its the last entry? Thanks Erick Erickson wrote: > > Try something like > TermEnum te = IndexReader.terms(new Term("BookTitle", "")); > do { > Term term = te.term(); > if

Re: How to loop through all the entries for a field

2009-10-22 Thread Erick Erickson
Try something like TermEnum te = IndexReader.terms(new Term("BookTitle", "")); do { Term term = te.term(); if (! term.field().equals("BookTitle")) break; System.out.println(term.text()); } while (te.next()); Note that next() will merrily continue beyond the last term for the field "Bo

Re: XorReader?

2009-10-22 Thread Chris Hostetter
: I'm thinking a decorator with deletions on top of the original reader, merged : with the clone reader using a MultiReader. But this would still require a new you don't really mean a clone do you? ... you should just need a very small index containing the new versions of the docs, in a MultiRea

Re: Maximum index file size

2009-10-22 Thread Chris Hostetter
: Subject: Maximum index file size : References: : In-Reply-To: http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the

How to loop through all the entries for a field

2009-10-22 Thread adviner
I have a field in called BookTitle. I want to loop through all the entries without doing a search. I just want to get the list of BookTitle's that is in this field: I tried IndexReader but MaxDocs() doesnt work because it returns everything and I have other fields in their which is allot bigger

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
This is basically what LuSql does. The time increases ("8h to 30 min") are similar. Usually on the order of an order of magnitude. Oh, the comments suggesting most of the interaction is with the database? The answer is: it depends. With large Lucene documents: Lucene is the limiting factor (worsen

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Thomas Becker
Profile your application first hand and find out where the bottlenecks really are during indexing. For me it was clearly the database calls which took most of the time. Due to a very complex SQL Query. I applied the Producer - Consumer pattern and put a blocking queue in between. I have a threadpo

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Marcelo Ochoa
Hi Paul: Mostly of the time indexing big tables is spent on the table full scan and network data transfer. Please take a quick look at my OOW08 presentation about Oracle Lucene integration: http://docs.google.com/present/view?id=ddgw7sjp_156gf9hczxv specially slides 13 and 14 wh

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Paul Taylor
Glen Newton wrote: You might want to consider using LuSql, which is a high performance, multithreaded, well documented tool designed specifically for moving data from a JDBC database into Lucene (you didn't say if it was a JDBC-accessible db...) http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswik

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Erick Erickson
Besides the other suggestions, I'd really, really, really put some instrumentationin the code and see where you're spending your time. For a fast hint, put a cumulative timer around your indexing part only. This will indicate whether the time is consumed in querying your database or indexing..

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Ian Lea
See also http://wiki.apache.org/lucene-java/ImproveIndexingSpeed. That includes some info on merge and buffer factors, and recommends multiple threads. When I've done this sort of thing in the past it has tended to be the database that is the problem, but maybe your database is faster than mine.

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
You might want to consider using LuSql, which is a high performance, multithreaded, well documented tool designed specifically for moving data from a JDBC database into Lucene (you didn't say if it was a JDBC-accessible db...) http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql Di

Performance tips when creating a large index from database.

2009-10-22 Thread Paul Taylor
I'm building a lucene index from a database, creating 1 about 1 million documents, unsuprisingly this takes quite a long time. I do this by sending a query to the db over a range of ids , (10,000) records Add these results in Lucene Then get next 10, and so on. When completed indexing I the

Re: Searching slow *after* an optimize (lucene.net 2.3.2)

2009-10-22 Thread Erick Erickson
Check this out with the .net port folks, but in the Java world, whenyou open an IndexReader (which I presume you do after optimizing), the first few queries fill various caches etc. and do run slowly. One solution is to fire a few warmup queries at the newly opened reader before letting your main a

Re: Handling + as a special character in Lucene search

2009-10-22 Thread Koji Sekiguchi
Or you can use MappingCharFilter if you are using Lucene 2.9. You can convert "c++" into "cplusplus" prior to running Tokenizer. Koji -- http://www.rondhuit.com/en/ Ian Lea wrote: You need to make sure that these terms are getting indexed, by using an analyzer that won't drop them and using

Special Characters and QueryParser

2009-10-22 Thread Amin Mohammed-Coleman
Hi I am looking at handling special characters in the query as using certain characters cause an exception. I looked at QueryParser.escape(..) to handle this. It works to a certain extent for example using '!' doesn't cause an exception. However when I use a wildcard then the wildcard is ignored.

Re: 2.9 per segment searching/caching

2009-10-22 Thread Mark Miller
Bill Au wrote: > Since Lucene 2.9 has per segment searching/caching, does query performance > degrade less than before (2.9) as more segments are added to the index? > Bill > > I think non sorting cases are actually faster now over multiple segments - though you will still see performance degrad

RE: Searching slow *after* an optimize (lucene.net 2.3.2)

2009-10-22 Thread George Aroush
Please post your question to: lucene-net-user[AT]incubator.apache.org for Lucene.Net related topics. See http://incubator.apache.org/lucene.net/ for subscription info. -- George -Original Message- From: ShibbyUK [mailto:lewis_...@hotmail.com] Sent: Thursday, October 22, 2009 7:17 AM To:

Maximum index file size

2009-10-22 Thread Hrishikesh Agashe
Hi, I am running Ubuntu 9.04 on 64 bit machine with NAS of 100 TB capacity. JVM is running with 2.5 GB Xmx. Can I create an index file with very large size, like 1 TB or so? Is there any limit on how large index file one can create? Also, will I be able to search on this 1 TB index file at all

Searching slow *after* an optimize (lucene.net 2.3.2)

2009-10-22 Thread ShibbyUK
Hi, We're having some odd performance problems. Recently, searching our index is becoming slow *after* performing an optimize. This is counter intuitive as usually the optimize has the opposite effect! We're using lucene.net 2.3.2 and have an index of 250,000 documents and about 500 queries per

Re: Resolving Lucene Index error

2009-10-22 Thread Michael McCandless
Can you provide more details? Which version of Lucene, Java, OS are you using? Is there a small test case? Hideously, it looks like somehow your path was supposed to be c:\Indexes\_z3_1.del, but somehow the \ was lost. Mike On Wed, Oct 21, 2009 at 9:50 PM, mitu2009 wrote: > > Hi, > > Why do I

Re: Handling + as a special character in Lucene search

2009-10-22 Thread Ian Lea
You need to make sure that these terms are getting indexed, by using an analyzer that won't drop them and using Luke to check. Then, if you are using QueryParser, you'll need to escape the special characters e.g. c\+\+. See http://lucene.apache.org/java/2_9_0/queryparsersyntax.html#Escaping%20Spe

Re: 2.9 per segment searching/caching

2009-10-22 Thread Simon Willnauer
Bill, per segments search does not replace index optimisation neither it prevents the performance degrade if your number of segments is increasing. Depending on how your index changes it can give you a performance improvement when reopening the index and it will certainly prevent one or another GC