Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
Thanks for that info. These indexes will be large, in the 10s of millions. id field is unique and is 29 bytes. I guess that's still a lot of data to trawl through to get to the term. Have you tested how long it takes to look up docs from your id? Not in indexes that size in a live environme

Re: I got the score "0.3044460713863373" for the cosine similarity of two document with the same text content !!

2009-05-05 Thread Grant Ingersoll
What is SimilarityQueries? I'd try the explain capabilities to see more. On May 5, 2009, at 2:23 PM, Kamal Najib wrote: hi all, i got the similarity score 0.3044460713863373 between two docs which have the same text content, is it correct? I expected 1.0, hier is my result line: doc:"

Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Michael McCandless
On Tue, May 5, 2009 at 7:24 PM, Antony Bowesman wrote: > Michael McCandless wrote: >> >> Lucene doesn't provide any way to do this, except opening a reader. >> >> Opening a reader is not "that" expensive if you use it for this >> purpose.  EG neither norms nor FieldCache will be loaded if you just

Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
Michael McCandless wrote: Lucene doesn't provide any way to do this, except opening a reader. Opening a reader is not "that" expensive if you use it for this purpose. EG neither norms nor FieldCache will be loaded if you just enumerate the term docs. Thanks for that info. These indexes will

I got the score "0.3044460713863373" for the cosine similarity of two document with the same text content !!

2009-05-05 Thread Kamal Najib
hi all, i got the similarity score 0.3044460713863373 between two docs which have the same text content, is it correct? I expected 1.0, hier is my result line: doc:"this expression of galectin-1 in blood vessel walls was correlated with vascular" doc2 :"this expression of galectin-1 in blood v

Re: How to het the score in percentage

2009-05-05 Thread Radha Sreedharan
Even i have a similar requirement. I need the percentage match. The way I am going about it is doing 2 searches eg if my search string is "pizza cheese " and my document has " pizza cheese ketchup" percentage match = ( score of searching "pizza cheese" in " pizza cheese ketchup") / ( score of

Lucene/Solr Meetup / May 20th, Reston VA, 6-8:30 pm

2009-05-05 Thread Erik Hatcher
Lucene/Solr Meetup / May 20th, Reston VA, 6-8:30 pm http://www.meetup.com/NOVA-Lucene-Solr-Meetup/ Join us for an evening of presentations and discussion on Lucene/Solr, the Apache Open Source Search Engine/Platform, featuring: Erik Hatcher, Lucid Imagination, Apache Lucene/Solr PMC: Solr power

Re: get term neighbours

2009-05-05 Thread Grant Ingersoll
There isn't a very clean way to do this just yet, but it is doable. Index with positions (you might find offsets useful too) and then use the TermVectorMapper and TermVector API call on the IndexReader (not the termPositions). Then, you will need to implement a TermVectorMapper that takes

Lucene 1.2 and JDK 1.5

2009-05-05 Thread D D
Hello, I would like to ask if anyone has tried running Lucene 1.2 with JDK 1.5? So far I could not find any documentation stating incompatibility and/or known bugs. Can anyone chime on similar experience? (Running older Lucene library with newer JDK) Thanks, Dave

Posting List Encoding: Group Varint Encoding

2009-05-05 Thread Renaud Delbru
Hi, I know that a new encoding technique, PFOR, is being implemented in the Lucene project [1]. Have you heard about the "Group Varint" encoding technique from Google ? There is a technical explanation in the talk of Jeffrey Dean, "Challenges in Building Large-Scale Information Retrieval Syst

Re: How to het the score in percentage

2009-05-05 Thread Peter Keegan
Maybe joseph means 'percentage of the theoretical maximum score' for the query? See this thread: http://www.gossamer-threads.com/lists/lucene/java-user/61075?search_string=theoretical%20maximum%20score;#61075 Peter On Tue, May 5, 2009 at 8:36 AM, Erick Erickson wrote: > But to echo Chris, what

Re: How to het the score in percentage

2009-05-05 Thread Erick Erickson
But to echo Chris, what does percentage mean? The percent of the words that matched? So, in your example, would document one match 75%, doc two 50% and doc three 100%? And what would that mean to a user? I think it would help if you backed up and told us *why* you want these percentages. A higher

Re: Which is more efficient

2009-05-05 Thread Michael McCandless
They should be very nearly the same. Under the hood, when you call updateDocument, IndexWriter buffers up the deleted terms, and flushes them periodically. Mike On Tue, May 5, 2009 at 7:42 AM, Antony Bowesman wrote: > Just wondered which was more efficient under the hood > >  for (int i = 0; i

Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Michael McCandless
Lucene doesn't provide any way to do this, except opening a reader. Opening a reader is not "that" expensive if you use it for this purpose. EG neither norms nor FieldCache will be loaded if you just enumerate the term docs. But, you can let Lucene do the same thing for you by just always using

Which is more efficient

2009-05-05 Thread Antony Bowesman
Just wondered which was more efficient under the hood for (int i = 0; i < size; i++) terms[i] = new Term("id", doc_key[i]); This writer.deleteDocuments(terms); for (int i = 0; i < size; i++) writer.addDocument(doc[i]); Or this for (int i = 0; i < size; i++) writer.updateDoc

Re: Missing RegexQuery class

2009-05-05 Thread Huntsman84
It was in contrib, thank you! Ian Lea wrote: > > You can find them in the source tarball. And maybe elsewhere > (contrib?) but I'm not sure about that. > > > -- > Ian. > > > On Tue, May 5, 2009 at 9:40 AM, Huntsman84 wrote: >> >> Hi, >> >> Does anybody know why at lucene API documentati

How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
I'm adding Documents in batches to an index with IndexWriter. In certain circumstances, I do not want to add the Document if it already exists, where existence is determined by field id=myId. Is there any way to do this with IndexWriter or do I have to open a reader and look for the term id:X

Re: How to het the score in percentage

2009-05-05 Thread joseph.christopher
joseph.christopher wrote: > > > thanks for the reply, > > By percentage, what I meant is that how much matching is the retrived > result with the search query. > > for exmple : if I have 3 indexed documnts like > > 1) chicken onion cheese pizza > > 2) mixed vegetable cheese pizza > > 3

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-05 Thread Andrew Purtell
Hi Bradford, Your mail reminds me of something I recently came across: http://svn.apache.org/repos/asf/labs/clouds/apache_cloud_computing_edition.pdf Perhaps if you have slides accompanying your talk, you may consider to make them publicly available. I for one would love to see them. Best rega

RE: Unable to remove from Lucene index

2009-05-05 Thread Enrico Goosen
Please elaborate... Here's a code snippet, as you can see I'm not trying to remove or requesting to remove anything. //Perform indexing for (Class entityType : entityTypes){ //read the data from the database //Scrollable res

Re: Unable to remove from Lucene index

2009-05-05 Thread Manish
I guess you are trying to remove or requesting to remove null referenced object. Manish B. Joshi (Adserving Team) On Tue, May 5, 2009 at 1:58 PM, Enrico Goosen wrote: > Hi, > > > > I’m new to Lucene, and I’m getting an exception while trying to do a manual > indexing operation on one of my ent

RE: Missing RegexQuery class

2009-05-05 Thread Uwe Schindler
These query types are in the contrib package lucene-regex.jar. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Huntsman84 [mailto:tpgarci...@gmail.com] > Sent: Tuesday, May 05, 2009 10:41 AM > To: java-u

RE: Lucene Index Encryption

2009-05-05 Thread Uwe Schindler
> Lucene needs to be able to ask a RAF opened for writing what it's > current "position" is during indexing, which it then stores away, and > later during searching it needs to ask a RAF opened for reading to > seek back to that position so it can read bytes from there. Would the > encryption APIs

Re: Missing RegexQuery class

2009-05-05 Thread Ian Lea
You can find them in the source tarball. And maybe elsewhere (contrib?) but I'm not sure about that. -- Ian. On Tue, May 5, 2009 at 9:40 AM, Huntsman84 wrote: > > Hi, > > Does anybody know why at lucene API documentation you can find the package > regex and its classes (RegexQuery, RegexTermE

Re: Lucene Index Encryption

2009-05-05 Thread Michael McCandless
Would you encrypt at the file level? Ie, the encryption would live "under" a RandomAccessFile (RAF) and otherwise feel "normal" to Lucene? (I think I remember others exploring encryption at the individual term level, which is interesting but does leak information in that you can see individual te

Missing RegexQuery class

2009-05-05 Thread Huntsman84
Hi, Does anybody know why at lucene API documentation you can find the package regex and its classes (RegexQuery, RegexTermEnum, SpanRegexQuery...), but they don't exist in the jar (so you can't use them)? Can I find them somewhere? Thank you so much! -- View this message in context: http://w

Unable to remove from Lucene index

2009-05-05 Thread Enrico Goosen
Hi, I'm new to Lucene, and I'm getting an exception while trying to do a manual indexing operation on one of my entities. It works fine for the Product entity, but fails for the ProductInfo entity (see attached). Versions: hibernate-search 3.0.1.GA Lucene 2.3 10:26:57,167 ERROR [Indexer

Re: Lucene Index Encryption

2009-05-05 Thread Danil ŢORIN
If you store such sensitive data that you think about index encription. then I may suggest simply isolate the host with lucene index: - ssh only, VERY limited set of users to login - provide a solr over https to search the index (avoid in-tranzit interception) - setup firewall rules This way Lu