Syns2Index utility: version of Lucene and Java

2006-11-27 Thread Risov, Maria
I am trying to use Syns2Index utility to convert the WordNet into a Lucene 
index.  First I downloaded the latest JDK and Lucene 2.0, but soon realized 
that both were too new for compiling Syns2Index.java.  Next, got down to 
j2sdk1.4.2_13 and Lucene 1.4.3.  by deciphering error messages.  (I am running 
XP SP2)

I have copied the java\org\apache\lucene directory in the same folder as the 
Syns2Index.java file.  I have a feeling that my classpath is most likely set 
right (or at least close), but I get a huge amount of identical compile errors.

Command used:

D:\InfringeDetector\JavaLucene>javac -classpath "D:\Project\JavaLucene;
C:\j2sdk1.4.2_13" D:\Project\JavaLucene\org\apache\lucene\wordnet\Syns2
Index.java

Compile results (posting just a few from the bottom of my screen):

^
C:\j2sdk1.4.2_13\java\nio\DirectByteBuffer.java:843: cannot resolve symbol
symbol  : method assert (boolean)
location: class java.nio.DirectByteBuffer
assert (off <= lim);
^
C:\j2sdk1.4.2_13\java\nio\DirectByteBuffer.java:934: cannot resolve symbol
symbol  : method assert (boolean)
location: class java.nio.DirectByteBuffer
assert (off <= lim);
^
C:\j2sdk1.4.2_13\java\nio\Bits.java:642: cannot resolve symbol
symbol  : method assert (boolean)
location: class java.nio.Bits
assert (reservedMemory > -1);
^
C:\j2sdk1.4.2_13\java\lang\CharacterDataLatin1.java:284: cannot resolve symbol
symbol  : method assert (boolean)
location: class java.lang.CharacterDataLatin1
assert (data.length == (256 * 2));
^
C:\j2sdk1.4.2_13\java\lang\CharacterData.java:956: cannot resolve symbol
symbol  : method assert (boolean)
location: class java.lang.CharacterData
assert (data.length == (678 * 2));
^
C:\j2sdk1.4.2_13\java\nio\DirectByteBufferR.java:165: cannot resolve symbol
symbol  : method assert (boolean)
location: class java.nio.DirectByteBufferR
assert (pos <= lim);
^
C:\j2sdk1.4.2_13\java\nio\DirectByteBufferR.java:479: cannot resolve symbol
symbol  : method assert (boolean)
location: class java.nio.DirectByteBufferR
assert (off <= lim);
^
Note: Some input files use or override a deprecated API.
Note: Recompile with -deprecation for details.
100 errors
206 warnings

I have to admit that I am fairly new to Java, but past the HelloWorld setups.  
I have been banging my head against the wall and Google for 10 hours.   Please 
help!!!

-marie



Re: How to set query time scoring

2006-11-27 Thread Sajid Khan

Thanks for the instant reply. More specifically i am trying to do is: 
 1) to show the results which contain the exact query phrase on top followed
by ANDed results followed by the ORed results.  
 2) introduce new parameter that uses the query phrase to influence the
ranking.

regards
Sajid


Bhavin Pandya wrote:
> 
> Hi sajid,
> 
> As you already boost data at indexing time...
> You can boost query at search time...
> eg. If you are firing boolean query and phrasequery...you might need to 
> boost phrasequery
> 
> PhraseQuery pq = new PhraseQuery();
> pq.setBoost(2.0f);
> 
> Thanks.
> Bhavin pandya
> 
> - Original Message - 
> From: "Sajid Khan" <[EMAIL PROTECTED]>
> To: 
> Sent: Monday, November 27, 2006 10:17 AM
> Subject: How to set query time scoring
> 
> 
>>
>> I have already set some score at the index time. And now i want to
>> set
>> some score at the query time. But i am not getting any idea of how to set
>> the score at query time in lucene.
>> Has anybody an idea how to do this?
>>
>> Regards
>> Sajid
>> -- 
>> View this message in context: 
>> http://www.nabble.com/How-to-set-query-time-scoring-tf2709773.html#a7554766
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-set-query-time-scoring-tf2709773.html#a7557740
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about the "not" in lucene

2006-11-27 Thread hawat23

Thanks you for your answer.

But, is it possible to group clauses with a "not". exemple:

type:product NOT (name:"toto" OR name:"titi") ??

Christophe

Mark Miller a écrit :
Personally, I think of it as not a 'not' operator, but more a 'but 
not' or 'and not' operator. Thats not totally the case I believe, but 
gives you semantics that work. Truly I think that each part of the 
query creates a score and the NOT query scores 0. That gives a 
different result than a boolean system. More than a few times it has 
been mentioned that Lucene is a scoring system and not a boolean system.


- Mark

christophe leroy wrote:

Hello,

I don't understand how to use "not" with Lucene. I
think that it is not a boolean not. I read the
documentation but it is not clear enough on how the
"not" works.

For example, I tried to do this request:
type:product
--> I got 100 responses. It is normal. Then, I tried
this request:
type:product AND name:test --> I got 1 response. It is normal too. 
And when I

tried this request:
type:product AND (name:test OR NOT name:test)
--> I got 1 response only. I should normally get 100
responses if the "not" was a boolean not.

Could you explain me how the "not" works?

Thank in advance,

Christophe






   
___ 
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos 
questions ! Profitez des connaissances, des opinions et des 
expériences des internautes sur Yahoo! Questions/Réponses 
http://fr.answers.yahoo.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hits length with no sorting or scoring

2006-11-27 Thread Hirsch Laurence
Hello,

I have an application in which we only need to know the total number of
documents matching a query.  In this case we do not need any sorting or
scoring or to store any reference to the matching documents.  Can you
tell me how to execute such a query with maximum performance?

Thanks

Laurie


Re: Database searching using Lucene....

2006-11-27 Thread Erick Erickson

This has been discussed extensively on this thread, so I think you'd get the
fastest answers by searching the mail archive for database, db, etc.

The short answer is "it all depends upon what you want to accomplish and the
characteristics of your problem".

Erick

On 11/27/06, Inderjeet Kalra <[EMAIL PROTECTED]> wrote:


Hi,

I need some inputs on the database searching using lucene.

Lucene directly supports the document searching but I am unable to find
out the easy and the fastest way for database searching.


Which option would be better - SPs or Lucene search engine in terms of
implementation, performance and security...if anyone has already done
analysis on the same, can you please provide me the comparison matrix or
benchmarks for the same ?



Thanks in advance



Regards

Inderjeet




***The information transmitted is intended only for the person or entity
to which it is addressed and may contain confidential and/or privileged
material. Any review,retransmission,dissemination or other use of, or taking
of any action in reliance upon, this information by persons or entities
other than the intended recipient is prohibited. If you received this in
error, please contact the sender and delete the material from any
computer.***




Re: Question about the "not" in lucene

2006-11-27 Thread Mark Miller
Yes, I believe that it is entirely possible. You can nest and link 
boolean clauses all you want: your example query would be a boolean with 
two top level clauses, one required to be there and one required not to 
be there. The second top level clause would itself be a boolean query 
with two two clauses, both with a SHOULD. Now, what I think happens (I 
haven't looked myself) is that the type:product will score a document 
positively if found, but the NOT clause will score a document to 0 if 
either of it's sub-clauses are found. Those 0 scores will not return as 
hits. Now notice that if you just have "NOT(name:"toto" OR name:"tit")" 
ALL of the docs will score 0 one way or another- the docs not found will 
be 0 and docs found will be scored 0 by the NOT...so you will not get a 
result. Now if you use the special query that matches all docs, and then 
use a NOT query...the not query will work as expected...all docs will 
get a positive score, but the NOT query will 0 out those in the MUST_NOT 
clause.


I am an unclear kind of guy, so I hope that gives some help.

- Mark

hawat23 wrote:

Thanks you for your answer.

But, is it possible to group clauses with a "not". exemple:

type:product NOT (name:"toto" OR name:"titi") ??

Christophe

Mark Miller a écrit :
Personally, I think of it as not a 'not' operator, but more a 'but 
not' or 'and not' operator. Thats not totally the case I believe, but 
gives you semantics that work. Truly I think that each part of the 
query creates a score and the NOT query scores 0. That gives a 
different result than a boolean system. More than a few times it has 
been mentioned that Lucene is a scoring system and not a boolean system.


- Mark

christophe leroy wrote:

Hello,

I don't understand how to use "not" with Lucene. I
think that it is not a boolean not. I read the
documentation but it is not clear enough on how the
"not" works.

For example, I tried to do this request:
type:product
--> I got 100 responses. It is normal. Then, I tried
this request:
type:product AND name:test --> I got 1 response. It is normal too. 
And when I

tried this request:
type:product AND (name:test OR NOT name:test)
--> I got 1 response only. I should normally get 100
responses if the "not" was a boolean not.

Could you explain me how the "not" works?

Thank in advance,

Christophe



   
   
___ 
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos 
questions ! Profitez des connaissances, des opinions et des 
expériences des internautes sur Yahoo! Questions/Réponses 
http://fr.answers.yahoo.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



StackOverflowError while calling IndexReader.deleteDocuments(new Term())

2006-11-27 Thread Suman Ghosh

I was trying to build a lucene index (Lucene 2.0, JDK 5) with
approximately 15 documents containing about 25 fields for each
document. After indexing about 45000 documents , the program crashed.
It was running as a batch job and did not log the cause of the crash.
In order to identify why the process crashed, I restarted the job
about 50 documents before the crash point so that I can identify the
problem. At this point, the program first tries to delete the document
if it's already present in the index and then adds it. As soon as I
start the program, the program aborts with a StackOverflowError while
calling indexreader.deleteDocuments(new Term()) method (even for the
document that was indexed earlier). Here is the partial stacktrace:

Exception in thread "main" java.lang.StackOverflowError
   at java.lang.ref.Reference.(Reference.java:207)
   at java.lang.ref.WeakReference.(WeakReference.java:40)
   at 
java.lang.ThreadLocal$ThreadLocalMap$Entry.(ThreadLocal.java:240)
   at 
java.lang.ThreadLocal$ThreadLocalMap$Entry.(ThreadLocal.java:235)
   at 
java.lang.ThreadLocal$ThreadLocalMap.getAfterMiss(ThreadLocal.java:375)
   at java.lang.ThreadLocal$ThreadLocalMap.get(ThreadLocal.java:347)
   at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:225)
   at java.lang.ThreadLocal.get(ThreadLocal.java:127)
   at 
org.apache.lucene.index.TermInfosReader.getEnum(TermInfosReader.java:79)
   at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:139)
   at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:50)
   at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:392)
   at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:348)
   at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)

The last line [at
org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)]
repeats another 1010 times before the program crashes.

I understand that without the actual index or the documents, it's
nearly impossible to narrow down the cause of the error. However, can
you please point to any theoretical reason why
org.apache.lucene.index.MultiTermDocs.next will go into an infinite
loop?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching by bit masks

2006-11-27 Thread Biggy


i have the same problem here. I have an interest bit field, which i 
receive from the applciation backend. I have control over how the docuemtns
are built.
To be specific, the field looks like this:

ID: interest
1 : sport
2 : music
4 : film
8 : clubs

So someone interested in sports and music can be found by "interest & 3" =>
e.g. when using SQL.

I do not wish to Post-Filter the results
On to Lucene, Is there a filter which supports this kind of query ?

Someone suggested splitting the bits into fields:
> Document doc = new Document();
> doc.add("flag1", "Y");
> doc.add("flag2", "Y");
> IndexWriter.add(doc); 
Is this helpful at all ?

Code would be helpful too as i am a newbie



ltaylor.employon wrote:
> 
> Hello, 
> 
> I am currently evaluating Lucene to see if it would be appropriate to
> replace my company's current search software. So far everything has been
> looking great, however there is one requirement that I am not too
> certain about. 
> 
> What we need to do is to be able to store a bit mask specifying various
> filter flags for a document in the index and then search this field by
> specifying another bit mask with desired filters, returning documents
> that have any of the specified flags set. In other words, we are doing a
> bitwise OR on the stored filter bit mask and the specified filter bit
> mask and if it is non-zero, we want to return the document. 
> 
> Before I started toying around with various options myself, I wanted to
> see if any of you good folks in the Lucene community had some
> suggestions for an efficient way to implement this. 
> 
> We currently need to index ~8,000,000 documents. We have several filter
> flag fields, the most important of which currently has 7 possible flags
> with any combination of the flags being valid. The number of flags is
> expected to increase rather rapidly in the near future. 
> 
> My preemptive thanks for your suggestions,
> 
> 
> Lawrence Taylor
> Senior Software Engineer
> Employon
> Message was edited by: ltaylor.employon
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7564237
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RAMDirectory vs MemoryIndex

2006-11-27 Thread Wolfgang Hoschek


On Nov 26, 2006, at 8:57 AM, jm wrote:


I tested this. I use a single static analyzer for all my documents,
and the caching analyzer was not working properly. I had to add a
method to clear the cache each time a new document was to be indexed,
and then it worked as expected. I have never looked into lucenes inner
working so I am not sure if what I did is correct.


Makes sense, I've now incorporated that as well by adding a clear()  
method and extracting the functionality into a public class  
AnalyzerUtil.TokenCachingAnalyzer.




I also had to comment some code cause I merged the memory stuff from
trunk with lucene 2.0.

Performance was certainly much better (4 times faster in my very gross
testing), but for my processing that operation is only a very small,
so I will keep the original way, without caching the tokens, just to
be able to use the unmodified lucene 2.0.  I found a data problem in
my tests, but as I was not going to pursue that improvement for now I
did not look into it.


Ok.
Wolfgang.



thanks,
javier

On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:

Out of interest, I've checked an implementation of something like
this into AnalyzerUtil SVN trunk:

   /**
* Returns an analyzer wrapper that caches all tokens generated by
the underlying child analyzer's
* token stream, and delivers those cached tokens on subsequent
calls to
* tokenStream(String fieldName, Reader reader).
* 
* This can help improve performance in the presence of expensive
Analyzer / TokenFilter chains.
* 
* Caveats:
* 1) Caching only works if the methods equals() and hashCode()
methods are properly
* implemented on the Reader passed to tokenStream(String
fieldName, Reader reader).
* 2) Caching the tokens of large Lucene documents can lead to out
of memory exceptions.
* 3) The Token instances delivered by the underlying child
analyzer must be immutable.
*
* @param child
*the underlying child analyzer
* @return a new analyzer
*/
   public static Analyzer getTokenCachingAnalyzer(final Analyzer
child) { ... }


Check it out, and let me know if this is close to what you had in  
mind.


Wolfgang.

On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:

> I've never tried it, but I guess you could write an Analyzer and
> TokenFilter that no only feeds into IndexWriter on
> IndexWriter.addDocument(), but as a sneaky side effect also
> simultaneously saves its tokens into a list so that you could later
> turn that list into another TokenStream to be added to MemoryIndex.
> How much this might help depends on how expensive your analyzer
> chain is. For some examples on how to set up analyzers for chains
> of token streams, see MemoryIndex.keywordTokenStream and class
> AnalzyerUtil in the same package.
>
> Wolfgang.
>
> On Nov 22, 2006, at 4:15 AM, jm wrote:
>
>> checking one last thing, just in case...
>>
>> as I mentioned, I have previously indexed the same document in
>> another
>> index (for another purpose), as I am going to use the same  
analyzer,

>> would it be possible to avoid analyzing the doc again?
>>
>> I see IndexWriter.addDocument() returns void, so it does not  
seem to

>> be an easy way to do that no?
>>
>> thanks
>>
>> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>>
>>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>>>
>>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
>>> enoguh
>>> > I will explore the other options then.
>>>
>>> To get started you can use something like this:
>>>
>>> for each document D:
>>>  MemoryIndex index = createMemoryIndex(D, ...)
>>>  for each query Q:
>>>  float score = index.search(Q)
>>> if (score > 0.0) System.out.println("it's a match");
>>>
>>>
>>>
>>>
>>>private MemoryIndex createMemoryIndex(Document doc, Analyzer
>>> analyzer) {
>>>  MemoryIndex index = new MemoryIndex();
>>>  Enumeration iter = doc.fields();
>>>  while (iter.hasMoreElements()) {
>>>Field field = (Field) iter.nextElement();
>>>index.addField(field.name(), field.stringValue(),  
analyzer);

>>>  }
>>>  return index;
>>>}
>>>
>>>
>>>
>>> >
>>> >
>>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > I have to decide between  using a RAMDirectory and
>>> MemoryIndex, but
>>> >> > not sure what approach will work better...
>>> >> >
>>> >> > I have to run many items (tens of thousands) against some
>>> >> queries (100
>>> >> > at most), but I have to do it one item at a time. And I  
already

>>> >> have
>>> >> > the lucene Document associated with each item, from a  
previous

>>> >> > operation I perform.
>>> >> >
>>> >> > From what I read MemoryIndex should be faster, but  
apparently I

>>> >> cannot
>>> >> > reuse the document I already have, and I have to create a  
new

>>> >> > MemoryIndex per item.
>>> >>
>>> >> A MemoryIndex object holds

Re: RAMDirectory vs MemoryIndex

2006-11-27 Thread jm

On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:


On Nov 26, 2006, at 8:57 AM, jm wrote:

> I tested this. I use a single static analyzer for all my documents,
> and the caching analyzer was not working properly. I had to add a
> method to clear the cache each time a new document was to be indexed,
> and then it worked as expected. I have never looked into lucenes inner
> working so I am not sure if what I did is correct.

Makes sense, I've now incorporated that as well by adding a clear()
method and extracting the functionality into a public class
AnalyzerUtil.TokenCachingAnalyzer.

yes, same here, I could have posted my code, sorry,  but I was not
sure if it was even correct...
When theres is a new lucene 2.1 or whatever I'll incorporate to that
optimization into my code. thanks



>
> I also had to comment some code cause I merged the memory stuff from
> trunk with lucene 2.0.
>
> Performance was certainly much better (4 times faster in my very gross
> testing), but for my processing that operation is only a very small,
> so I will keep the original way, without caching the tokens, just to
> be able to use the unmodified lucene 2.0.  I found a data problem in
> my tests, but as I was not going to pursue that improvement for now I
> did not look into it.

Ok.
Wolfgang.

>
> thanks,
> javier
>
> On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> Out of interest, I've checked an implementation of something like
>> this into AnalyzerUtil SVN trunk:
>>
>>/**
>> * Returns an analyzer wrapper that caches all tokens generated by
>> the underlying child analyzer's
>> * token stream, and delivers those cached tokens on subsequent
>> calls to
>> * tokenStream(String fieldName, Reader reader).
>> * 
>> * This can help improve performance in the presence of expensive
>> Analyzer / TokenFilter chains.
>> * 
>> * Caveats:
>> * 1) Caching only works if the methods equals() and hashCode()
>> methods are properly
>> * implemented on the Reader passed to tokenStream(String
>> fieldName, Reader reader).
>> * 2) Caching the tokens of large Lucene documents can lead to out
>> of memory exceptions.
>> * 3) The Token instances delivered by the underlying child
>> analyzer must be immutable.
>> *
>> * @param child
>> *the underlying child analyzer
>> * @return a new analyzer
>> */
>>public static Analyzer getTokenCachingAnalyzer(final Analyzer
>> child) { ... }
>>
>>
>> Check it out, and let me know if this is close to what you had in
>> mind.
>>
>> Wolfgang.
>>
>> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
>>
>> > I've never tried it, but I guess you could write an Analyzer and
>> > TokenFilter that no only feeds into IndexWriter on
>> > IndexWriter.addDocument(), but as a sneaky side effect also
>> > simultaneously saves its tokens into a list so that you could later
>> > turn that list into another TokenStream to be added to MemoryIndex.
>> > How much this might help depends on how expensive your analyzer
>> > chain is. For some examples on how to set up analyzers for chains
>> > of token streams, see MemoryIndex.keywordTokenStream and class
>> > AnalzyerUtil in the same package.
>> >
>> > Wolfgang.
>> >
>> > On Nov 22, 2006, at 4:15 AM, jm wrote:
>> >
>> >> checking one last thing, just in case...
>> >>
>> >> as I mentioned, I have previously indexed the same document in
>> >> another
>> >> index (for another purpose), as I am going to use the same
>> analyzer,
>> >> would it be possible to avoid analyzing the doc again?
>> >>
>> >> I see IndexWriter.addDocument() returns void, so it does not
>> seem to
>> >> be an easy way to do that no?
>> >>
>> >> thanks
>> >>
>> >> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>> >>>
>> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
>> >>> enoguh
>> >>> > I will explore the other options then.
>> >>>
>> >>> To get started you can use something like this:
>> >>>
>> >>> for each document D:
>> >>>  MemoryIndex index = createMemoryIndex(D, ...)
>> >>>  for each query Q:
>> >>>  float score = index.search(Q)
>> >>> if (score > 0.0) System.out.println("it's a match");
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>private MemoryIndex createMemoryIndex(Document doc, Analyzer
>> >>> analyzer) {
>> >>>  MemoryIndex index = new MemoryIndex();
>> >>>  Enumeration iter = doc.fields();
>> >>>  while (iter.hasMoreElements()) {
>> >>>Field field = (Field) iter.nextElement();
>> >>>index.addField(field.name(), field.stringValue(),
>> analyzer);
>> >>>  }
>> >>>  return index;
>> >>>}
>> >>>
>> >>>
>> >>>
>> >>> >
>> >>> >
>> >>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>> >>> >>
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > I have to decide between  using a RAMDirectory and
>> >>> MemoryInd

Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())

2006-11-27 Thread Yonik Seeley

On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote:

The last line [at
org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)]
repeats another 1010 times before the program crashes.

I understand that without the actual index or the documents, it's
nearly impossible to narrow down the cause of the error. However, can
you please point to any theoretical reason why
org.apache.lucene.index.MultiTermDocs.next will go into an infinite
loop?


MultiTermDocs.next() is a recursive function.  From what I can see of
it though, it shouldn't recurse greater than the number of segments in
the index.

How many segments do you have in your index?  What IndexWriter
settings have you changed (mergeFactor, maxMergeDocs, etc)?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())

2006-11-27 Thread Suman Ghosh

Here are the values:

mergeFactor=10
maxMergeDocs=10
minMergeDocs=100

And I see your point. At the time of the crash, I have over 5000
segments. I'll try some conservative number and try to rebuild the
index.


On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote:
> The last line [at
> org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)]
> repeats another 1010 times before the program crashes.
>
> I understand that without the actual index or the documents, it's
> nearly impossible to narrow down the cause of the error. However, can
> you please point to any theoretical reason why
> org.apache.lucene.index.MultiTermDocs.next will go into an infinite
> loop?

MultiTermDocs.next() is a recursive function.  From what I can see of
it though, it shouldn't recurse greater than the number of segments in
the index.

How many segments do you have in your index?  What IndexWriter
settings have you changed (mergeFactor, maxMergeDocs, etc)?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())

2006-11-27 Thread Yonik Seeley

On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote:

Here are the values:

mergeFactor=10
maxMergeDocs=10
minMergeDocs=100

And I see your point. At the time of the crash, I have over 5000
segments. I'll try some conservative number and try to rebuild the
index.


Although I don't see how those settings can produce 5000 segments,
I've developed a non-recursive patch you might want to try:
https://issues.apache.org/jira/browse/LUCENE-729

The patch is to the Lucene trunk (current devel version), so if you
want to stick with Lucene 2.0, you might have to patch by hand.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hits length with no sorting or scoring

2006-11-27 Thread Paul Elschot
On Monday 27 November 2006 14:30, Hirsch Laurence wrote:
> Hello,
> 
> I have an application in which we only need to know the total number of
> documents matching a query.  In this case we do not need any sorting or
> scoring or to store any reference to the matching documents.  Can you
> tell me how to execute such a query with maximum performance?

A fairly quick way is to implement your own HitCollector to count,
and then use the appropriate methods of IndexSearcher.

If you really need maximum performance, this bit of code
avoids computing the score values and invoking the
HitCollector (untested):

// s is the IndexSearcher, query the Query
org.apache.lucene.search.Scorer scorer =
   query.weight(s).scorer(s.getIndexReader());
int count = 0;
while (scorer.next()) count++;

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())

2006-11-27 Thread Suman Ghosh

Yonik,

Thanks for the pointer. I'll try the nightly build once the change is committed.

Suman

On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote:
> Here are the values:
>
> mergeFactor=10
> maxMergeDocs=10
> minMergeDocs=100
>
> And I see your point. At the time of the crash, I have over 5000
> segments. I'll try some conservative number and try to rebuild the
> index.

Although I don't see how those settings can produce 5000 segments,
I've developed a non-recursive patch you might want to try:
https://issues.apache.org/jira/browse/LUCENE-729

The patch is to the Lucene trunk (current devel version), so if you
want to stick with Lucene 2.0, you might have to patch by hand.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RAMDirectory vs MemoryIndex

2006-11-27 Thread Wolfgang Hoschek


On Nov 27, 2006, at 9:57 AM, jm wrote:


On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:


On Nov 26, 2006, at 8:57 AM, jm wrote:

> I tested this. I use a single static analyzer for all my documents,
> and the caching analyzer was not working properly. I had to add a
> method to clear the cache each time a new document was to be  
indexed,
> and then it worked as expected. I have never looked into lucenes  
inner

> working so I am not sure if what I did is correct.

Makes sense, I've now incorporated that as well by adding a clear()
method and extracting the functionality into a public class
AnalyzerUtil.TokenCachingAnalyzer.

yes, same here, I could have posted my code, sorry,  but I was not
sure if it was even correct...
When theres is a new lucene 2.1 or whatever I'll incorporate to that
optimization into my code. thanks



Actually, now I'm considering reverting back to the version without a  
public clear() method. The rationale is that this would be less  
complex and more consistent with the AnalyzerUtil design (simple  
methods generating simple anonymous analyzer wrappers). If desired,  
you can still (re)use a single static "child" analyzer instance. It's  
cheap and easy to create a new caching analyzer on top of the static  
analyzer, and to do so before each document. The old one will simply  
be gc'd.


Let me know if that'd work for you.

Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching by bit masks

2006-11-27 Thread Erick Erickson

Well, you really have the code already . From the top...

1> there's no good way to support searching bitfields If you wanted, you
could probably store it as a small integer and then search on it, but that's
waaay too complicated than you want.

2> Add the fields like you have the snippet from, something like
Document doc = new Document.
if (bitsfromdb & 1) {
   doc.add("sport", "y");
}
if (bitsfromdb & 2) {
   doc.add("music", "y");
}
.
.
.
IndexWriter.add(doc);



Now, when searching, search on things like new Term("sport", "y")). and
you'll only get the documents that correspond to the 2s bit being set.

Watch out for capitalization. Y may not be equivalent to y. It depends on
the analyzer you use at index AND search time.

You can or as many of these together as you want. In your example, you could
have up to 4 sub-clauses just for the bitmask-equivalents.

NOTE: the documents won't all have the same fields. A document may not have,
for instance, the "sports" field. This is OK in Lucene, but not the first
thing folks with their DB hat on think of

Get a copy of Luke (google lucene luke) and get familiar with it for
examining your index and the effects of various analyzers. Really, really,
really get a copy of Luke. Really.

Do you have a copy of "Lucene In Action"? If not, I highly recommend it. It
has tons of useful examples as well as a good introduction to many of the
concepts. It's written to the 1.4 codebase, so be warned that there are some
incompatibilities that are, for the most part, minor.


Best
Erick
On 11/27/06, Biggy <[EMAIL PROTECTED]> wrote:




i have the same problem here. I have an interest bit field, which i
receive from the applciation backend. I have control over how the
docuemtns
are built.
To be specific, the field looks like this:

ID: interest
1 : sport
2 : music
4 : film
8 : clubs

So someone interested in sports and music can be found by "interest & 3"
=>
e.g. when using SQL.

I do not wish to Post-Filter the results
On to Lucene, Is there a filter which supports this kind of query ?

Someone suggested splitting the bits into fields:
> Document doc = new Document();
> doc.add("flag1", "Y");
> doc.add("flag2", "Y");
> IndexWriter.add(doc);
Is this helpful at all ?

Code would be helpful too as i am a newbie



ltaylor.employon wrote:
>
> Hello,
>
> I am currently evaluating Lucene to see if it would be appropriate to
> replace my company's current search software. So far everything has been
> looking great, however there is one requirement that I am not too
> certain about.
>
> What we need to do is to be able to store a bit mask specifying various
> filter flags for a document in the index and then search this field by
> specifying another bit mask with desired filters, returning documents
> that have any of the specified flags set. In other words, we are doing a
> bitwise OR on the stored filter bit mask and the specified filter bit
> mask and if it is non-zero, we want to return the document.
>
> Before I started toying around with various options myself, I wanted to
> see if any of you good folks in the Lucene community had some
> suggestions for an efficient way to implement this.
>
> We currently need to index ~8,000,000 documents. We have several filter
> flag fields, the most important of which currently has 7 possible flags
> with any combination of the flags being valid. The number of flags is
> expected to increase rather rapidly in the near future.
>
> My preemptive thanks for your suggestions,
>
>
> Lawrence Taylor
> Senior Software Engineer
> Employon
> Message was edited by: ltaylor.employon
>
>
>

--
View this message in context:
http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7564237
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: RAMDirectory vs MemoryIndex

2006-11-27 Thread jm

yes that would be ok for my, as long as I can reuse my child analyzer.

On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:


On Nov 27, 2006, at 9:57 AM, jm wrote:

> On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>
>> On Nov 26, 2006, at 8:57 AM, jm wrote:
>>
>> > I tested this. I use a single static analyzer for all my documents,
>> > and the caching analyzer was not working properly. I had to add a
>> > method to clear the cache each time a new document was to be
>> indexed,
>> > and then it worked as expected. I have never looked into lucenes
>> inner
>> > working so I am not sure if what I did is correct.
>>
>> Makes sense, I've now incorporated that as well by adding a clear()
>> method and extracting the functionality into a public class
>> AnalyzerUtil.TokenCachingAnalyzer.
> yes, same here, I could have posted my code, sorry,  but I was not
> sure if it was even correct...
> When theres is a new lucene 2.1 or whatever I'll incorporate to that
> optimization into my code. thanks


Actually, now I'm considering reverting back to the version without a
public clear() method. The rationale is that this would be less
complex and more consistent with the AnalyzerUtil design (simple
methods generating simple anonymous analyzer wrappers). If desired,
you can still (re)use a single static "child" analyzer instance. It's
cheap and easy to create a new caching analyzer on top of the static
analyzer, and to do so before each document. The old one will simply
be gc'd.

Let me know if that'd work for you.

Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RAMDirectory vs MemoryIndex

2006-11-27 Thread Wolfgang Hoschek

Ok. I reverted back to the version without a public clear() method.
Wolfgang.

On Nov 27, 2006, at 12:17 PM, jm wrote:


yes that would be ok for my, as long as I can reuse my child analyzer.

On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:


On Nov 27, 2006, at 9:57 AM, jm wrote:

> On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>
>> On Nov 26, 2006, at 8:57 AM, jm wrote:
>>
>> > I tested this. I use a single static analyzer for all my  
documents,
>> > and the caching analyzer was not working properly. I had to  
add a

>> > method to clear the cache each time a new document was to be
>> indexed,
>> > and then it worked as expected. I have never looked into lucenes
>> inner
>> > working so I am not sure if what I did is correct.
>>
>> Makes sense, I've now incorporated that as well by adding a  
clear()

>> method and extracting the functionality into a public class
>> AnalyzerUtil.TokenCachingAnalyzer.
> yes, same here, I could have posted my code, sorry,  but I was not
> sure if it was even correct...
> When theres is a new lucene 2.1 or whatever I'll incorporate to  
that

> optimization into my code. thanks


Actually, now I'm considering reverting back to the version without a
public clear() method. The rationale is that this would be less
complex and more consistent with the AnalyzerUtil design (simple
methods generating simple anonymous analyzer wrappers). If desired,
you can still (re)use a single static "child" analyzer instance. It's
cheap and easy to create a new caching analyzer on top of the static
analyzer, and to do so before each document. The old one will simply
be gc'd.

Let me know if that'd work for you.

Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Querying performance decrease in 1.9.1 and 2.0.0

2006-11-27 Thread Paul Elschot
Stanislav,

On Wednesday 22 November 2006 09:52, Stanislav Jordanov wrote:
> Paul,
> We are working on delivering the next release by the end of the week so 
> I have to take care of 2 or 3 issues before I try the nightly build.
> I promise to try it and report the results here.

I have made a first attempt at restoring the old query performance here:
http://issues.apache.org/jira/browse/LUCENE-730

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())

2006-11-27 Thread Michael McCandless

Suman Ghosh wrote:


On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote:
> Here are the values:
>
> mergeFactor=10
> maxMergeDocs=10
> minMergeDocs=100
>
> And I see your point. At the time of the crash, I have over 5000
> segments. I'll try some conservative number and try to rebuild the
> index.

Although I don't see how those settings can produce 5000 segments,
I've developed a non-recursive patch you might want to try:
https://issues.apache.org/jira/browse/LUCENE-729


Suman, I'd really like to understand how you're getting so many
segments in your index.  Is this (getting 5000 segments) easy to
reproduce?  Are you closing / reopening your writer every so often (eg
to delete documents or something)?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching by bit masks

2006-11-27 Thread Daniel Noll

Erick Erickson wrote:

Well, you really have the code already . From the top...

1> there's no good way to support searching bitfields If you wanted, you
could probably store it as a small integer and then search on it, but 
that's

waaay too complicated than you want.

2> Add the fields like you have the snippet from, something like
Document doc = new Document.
if (bitsfromdb & 1) {
   doc.add("sport", "y");
}
if (bitsfromdb & 2) {
   doc.add("music", "y");
}


Beware that if there are a large number of bits, this is going to impact 
memory usage due to there being more fields.


Perhaps a better way would be to use a single "bits" field and store the 
words "sport", "music", ... in that field.


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())

2006-11-27 Thread Suman Ghosh

Mike,
I've not tried it yet, but I think the problem can be reproduced.
However, it'll take a few hours to reach that threshhold since my code
also needs to extract text from some very large PDF documents to store
in the index.

I'll post the pseudo-code of my code tomorrow. Maybe that'll help
point to mistakes I'm making in the logic.

Suman

On 11/27/06, Michael McCandless <[EMAIL PROTECTED]> wrote:

Suman Ghosh wrote:

> On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>> On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote:
>> > Here are the values:
>> >
>> > mergeFactor=10
>> > maxMergeDocs=10
>> > minMergeDocs=100
>> >
>> > And I see your point. At the time of the crash, I have over 5000
>> > segments. I'll try some conservative number and try to rebuild the
>> > index.
>>
>> Although I don't see how those settings can produce 5000 segments,
>> I've developed a non-recursive patch you might want to try:
>> https://issues.apache.org/jira/browse/LUCENE-729

Suman, I'd really like to understand how you're getting so many
segments in your index.  Is this (getting 5000 segments) easy to
reproduce?  Are you closing / reopening your writer every so often (eg
to delete documents or something)?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]