Re: Is the new version of the Lucene book available in any form?

2007-01-26 Thread Otis Gospodnetic
We'll see, the blind men said. Otis - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, January 27, 2007 6:28:16 AM Subject: Re: Is the new version of the Lucene book available in any form? : LIA2 will happen, but Lucene is under

Re: lucense index/document architecture

2007-01-26 Thread Otis Gospodnetic
A single index with an id field sounds like a fine approach here. Otis - Original Message From: Joost Schouten <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, January 27, 2007 6:40:51 AM Subject: lucense index/document architecture Hi, I'm setting up lucene to work

Re : lucene document id's

2007-01-26 Thread saikrishna venkata pendyala
Hai , I was trying to store to document id's external. I have found that lucene generates document id's linearly starting from 0 and are not changed until any document is deleted. but it did work for me. Was the above one correct ? if not who could I store document id's exte

Re: lucense index/document architecture

2007-01-26 Thread Erick Erickson
To steal a phrase from Mr. Hatcher... it depends . I'd try keeping it all in one index at the start until you get some clue how big the index will eventually grow to and whether your searching is acceptable. Do you have any idea how big the raw data you're going to ask the index to hold? 1M? 1G?,

lucense index/document architecture

2007-01-26 Thread Joost Schouten
Hi, I'm setting up lucene to work with our webapp to index a database. My db holds files which can belong to a user or a company or both. I want the option for my users to search across all content, but also search within the files for one user or company. What is the best architecture approach fo

Re: What type of query best for OR with high score?

2007-01-26 Thread Chris Hostetter
regular Lucene BooleanQueries should work fine for this ... but you may want to customize your Similarity so that the idf and lengthNorms aren't a factor .. you may want to take the tf out of hte picture too (if you care more about matching lots of terms and less about matching one term lots of ti

Re: Is the new version of the Lucene book available in any form?

2007-01-26 Thread Chris Hostetter
: LIA2 will happen, but Lucene is undergoing a lot of changes, so Erik and : I are going to wait a little more for development to calm down : (utopia?). you're waiting for Lucene development to calm down? ... that could be a long wait. -Hoss ---

Re: Problem with Custom Filter

2007-01-26 Thread Paul Lynch
Thanks for that Erick, it was a great help in clearing up how the mechanism works. I have it working now, here is the changed bits method (I would appreciate any advice you/anyone might have particularly around efficiency - thanks again): public BitSet bits(IndexReader reader) throws IOExcept

Re: Multiword Highlighting

2007-01-26 Thread markharw00d
This is a deficiency in the highlighter functionality that has been discussed several times before. The summary is - not a trivial fix. See here for background: http://marc2.theaimsgroup.com/?l=lucene-user&m=114631181214303&w=1 http://www.gossamer-threads.com/lists/engine?do=post_view_printa

Write my own crawler VS use nutch?

2007-01-26 Thread spamsucks
I am successfully using lucene in our application to index 12 different types of objects located in a database, and their relationships to each other to provide some nice search functionality for our website. We are building lots of lucene queries programmatically to filter based upon categori

Multiword Highlighting

2007-01-26 Thread Anne Conger
Hi, I'm wondering what the best way is to do highlighting of multiword phrases. For example, if a search is for "president kennedy", how can I make sure that "president" is only highlighted if it is next to "kennedy" and "president" in "president clinton" is not. I haven't figured out where in the

Re: NO_NORMS and TOKENIZED?

2007-01-26 Thread Otis Gospodnetic
Funny, I was looking to do the same thing the other day and gave up thinking it wasn't possible, not being aware of setOmitNorms(). Yeah, a javadoc patch would be welcome. Otis - Original Message From: Nadav Har'El <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, Jan

Re: How many documents in the biggest Lucene index to date?

2007-01-26 Thread Otis Gospodnetic
It really all dependsright Erik? On the hardware you are using, complexity of queries, query concurrency, query latency you are willing to live with, the size of the index, etc. A few million sounds small even for average/cheap hw. I have several multi-million document indices that are con

Re: Is the new version of the Lucene book available in any form?

2007-01-26 Thread Otis Gospodnetic
Hi, I believe CLucene (C++, not C) is getting a lot of exercise, but you should really ask about production usage on its list. LIA2 will happen, but Lucene is undergoing a lot of changes, so Erik and I are going to wait a little more for development to calm down (utopia?). Otis - Original

Re: Extending scoring to eliminate sorting on timestamp

2007-01-26 Thread Chris Hostetter
: I used String because the timestamp is a Long and there wasn't any : SortField.LONG (I guess I should have used SortField.CUSTOM). In this : case, what should the indexing call look like? Currently, I have: : doc.add(new Field("timestamp",Long.toString(timestamp),Field.Store.NO,Field.Index

Re: Problem with Custom Filter

2007-01-26 Thread Erick Erickson
I think you're only setting one bit in your filter. You're docs array is only one cell long, and your termDocs.read reads up to the length of docs (exactly one in this case) entries. So, you're getting only one doc ID. And setting it. Even if you made your array larger, you would only set one bec

Re: Is the new version of the Lucene book available in any form?

2007-01-26 Thread Erick Erickson
The current LIA book, while written to the 1.4 code base is a very good place to start. There will be some incompatibilities with the 2.0 codebase, but they're relatively minor. I guess I'm really recommending that you go ahead and spend the bucks on the current version, it'll be money well spent

Re: Extending scoring to eliminate sorting on timestamp

2007-01-26 Thread Chiradeep Vittal
Thanks for the insight Chris. You are right-- I was trying to avoid the FieldCache hit. Because the index is updated frequently, we have to keep discarding our IndexSearcher. I used String because the timestamp is a Long and there wasn't any SortField.LONG (I guess I should have used SortField.

Problem with Custom Filter

2007-01-26 Thread Paul Lynch
Hi, I am going mad trying to find out what I am doing wrong with my custom filter implementation (almost an exact copy of SpecialsFilter from LIA). I have put together a quick sample to illustrate my problem, if some kind soul has 2 minutes to take a quick look and tell me where I am being so s

Is the new version of the Lucene book available in any form?

2007-01-26 Thread Bill Taylor
I notice that the Lucene book offered by Amazon was published in 2004. I saw some mail on the subject of a new edition. Is the new edition available in any form? I promise to buy the new edition as soon as it comes out even if I get some of the material early. I wrote a book which was publ

Re: How many documents in the biggest Lucene index to date?

2007-01-26 Thread Chiradeep Vittal
Grant, Is that on a single machine? If so, what kind of hardware specs does the machine have? I guess you're using a 64-bit JVM? A slightly unrelated question: if a query matches all the documents in the index, does that cause the entire index to get loaded into RAM ? - Original Message

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

2007-01-26 Thread Pustovalov Mike
in my applications JVM throws [java.lang.OutOfMemoryError: Java heap space] when too many java classes has been loaded and/or when i use some byte code manipulation libraries ... (hibernate, asm, cglib for example) - JVM has no more memory for compile bytecode. On Fri, 26 Jan 2007 19:46:06

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

2007-01-26 Thread maureen tanuwidjaja
oh thanks then:) Пустовалов Михаил <[EMAIL PROTECTED]> wrote: in your java command line, of course :) Example : java -Xms128m -Xmx1024m -server -Djava.awt.headless=true -XX:MaxPermSize=128m protei.Starter On Fri, 26 Jan 2007 19:39:13 +0300, maureen tanuwidjaja wrote: >

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

2007-01-26 Thread Пустовалов Михаил
in your java command line, of course :) Example : java -Xms128m -Xmx1024m -server -Djava.awt.headless=true -XX:MaxPermSize=128m protei.Starter On Fri, 26 Jan 2007 19:39:13 +0300, maureen tanuwidjaja <[EMAIL PROTECTED]> wrote: E...where shall I put that" -XX:MaxPermSize=128m"? Th

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

2007-01-26 Thread maureen tanuwidjaja
E...where shall I put that" -XX:MaxPermSize=128m"? Thanks Pustovalov Regards, Maureen Пустовалов Михаил <[EMAIL PROTECTED]> wrote: try this : -XX:MaxPermSize=128m On Fri, 26 Jan 2007 19:32:45 +0300, maureen tanuwidjaja wrote: > Hi Mike and Eric

Re: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

2007-01-26 Thread Пустовалов Михаил
try this : -XX:MaxPermSize=128m On Fri, 26 Jan 2007 19:32:45 +0300, maureen tanuwidjaja <[EMAIL PROTECTED]> wrote: Hi Mike and Erick and all, I have fixed my code and yes,indexing is much faster than previously when I do such "hammering" with IndexWriter However,I am now encountering th

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

2007-01-26 Thread maureen tanuwidjaja
Hi Mike and Erick and all, I have fixed my code and yes,indexing is much faster than previously when I do such "hammering" with IndexWriter However,I am now encountering the error while indexing Exception in thread "main" java.lang.OutOfMemoryError: Java heap space This error n

Re: How many documents in the biggest Lucene index to date?

2007-01-26 Thread Grant Ingersoll
I just indexed a collection w/ 15+ million docs in one index. Index size is roughly 42 gb. On Jan 26, 2007, at 12:45 AM, Bill Taylor wrote: I have used Lucene to index a small collection - only a few hundred documents. I have a potential client who wants to index a collection which will

Re: Lucene Indexing

2007-01-26 Thread Grant Ingersoll
I don't believe there is any b-tree strategy in Lucene. I would say that it is segment based, I guess, in that it indexes documents in memory based on your merge factors and then flushes to disk, at then end you can choose to merge the segments together via optimize(). I find it to have a

Re: Lucene Indexing

2007-01-26 Thread Sairaj Sunil
I went through that document. It mentions about the Lucene's Indexing algorithm that it uses incremental algorithm. So, can i say that it uses a combination of segment-based and b-tree based strategies. If i am wrong please correct me. On 1/26/07, Damien McCarthy <[EMAIL PROTECTED]> wrote: This

Re: How many documents in the biggest Lucene index to date?

2007-01-26 Thread mark harwood
I'm aware of a single index with the following characteristics: Single index size = 33.2GB Documents: 263 million Searchable fields = 7 Query Response times: <1 second for a single term search Anything from 5-20 seconds for more complex searches (e.g fuzzy matching on multiple fields) This is

RE: Lucene Indexing

2007-01-26 Thread Damien McCarthy
This document should contain the information you need : http://lucene.sourceforge.net/talks/inktomi/ Damien. -Original Message- From: Sairaj Sunil [mailto:[EMAIL PROTECTED] Sent: 26 January 2007 03:22 To: java-user@lucene.apache.org Subject: Re: Lucene Indexing Hi I was asking what exac

Re: How many documents in the biggest Lucene index to date?

2007-01-26 Thread Andrzej Bialecki
Bill Taylor wrote: I have used Lucene to index a small collection - only a few hundred documents. I have a potential client who wants to index a collection which will start at about a million documents and could easily grow to two million. Has anyone used Lucene with an index that large? I

Re: How many documents in the biggest Lucene index to date?

2007-01-26 Thread karl wettin
26 jan 2007 kl. 06.45 skrev Bill Taylor: I have used Lucene to index a small collection - only a few hundred documents. I have a potential client who wants to index a collection which will start at about a million documents and could easily grow to two million. The maximum number of d