Merge Exception in Lucene 2.4

2009-08-20 Thread Sumanta Bhowmik
Hi

 

I am getting this issue in Lucene2.4 when I try to merge multiple
IndexWriters(generally 6) 

 

sh-3.2# Exception in thread "Lucene Merge Thread #5"
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
read past EOF

at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Concur
rentMergeScheduler.java:309)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
ergeScheduler.java:286)

Caused by: java.io.IOException: read past EOF

at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java
:135)

at
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:
228)

at
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:184
)

at
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:
204)

at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4260)

at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3877)

at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeSche
duler.java:205)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
ergeScheduler.java:260)

 

 

Is this a known issue and has any fix been provided for it ? Would
appreciate any help.

 

Regards

Sumanta

 

 



RE: Merge Exception in Lucene 2.4

2009-08-20 Thread Sumanta Bhowmik
I checked at http://issues.apache.org/jira/browse/LUCENE-1282
SegmentMerger.java has this code

TermFreqVector[] vectors = reader.getTermFreqVectors(docNum);
termVectorsWriter.addAllDocVectors(vectors);

so this issue appears inspite of this fix.

I am using java version "1.6.0_07". Is it fixed in jdk6u10 and above
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6707044) ? 

Regards
Sumanta


-Original Message-
From: Sumanta Bhowmik [mailto:sumanta.bhow...@guavus.com] 
Sent: Thursday, August 20, 2009 1:15 PM
To: java-user@lucene.apache.org
Subject: Merge Exception in Lucene 2.4

Hi

 

I am getting this issue in Lucene2.4 when I try to merge multiple
IndexWriters(generally 6) 

 

sh-3.2# Exception in thread "Lucene Merge Thread #5"
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
read past EOF

at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Concur
rentMergeScheduler.java:309)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
ergeScheduler.java:286)

Caused by: java.io.IOException: read past EOF

at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java
:135)

at
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:
228)

at
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:184
)

at
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:
204)

at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4260)

at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3877)

at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeSche
duler.java:205)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
ergeScheduler.java:260)

 

 

Is this a known issue and has any fix been provided for it ? Would
appreciate any help.

 

Regards

Sumanta

 

 




__ NOD32 4349 (20090819) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Merge Exception in Lucene 2.4

2009-08-20 Thread Michael McCandless
You should definitely upgrade to the latest JDK 1.6 to get the fix for
the JRE bug in LUCENE-1282, but, I don't think you are hitting that
bug (read past EOF during merge is a different exception).

Can you describe more detail on how you merge 6 IndexWriters?

Mike

On Thu, Aug 20, 2009 at 5:21 AM, Sumanta
Bhowmik wrote:
> I checked at http://issues.apache.org/jira/browse/LUCENE-1282
> SegmentMerger.java has this code
>
> TermFreqVector[] vectors = reader.getTermFreqVectors(docNum);
> termVectorsWriter.addAllDocVectors(vectors);
>
> so this issue appears inspite of this fix.
>
> I am using java version "1.6.0_07". Is it fixed in jdk6u10 and above
> (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6707044) ?
>
> Regards
> Sumanta
>
>
> -Original Message-
> From: Sumanta Bhowmik [mailto:sumanta.bhow...@guavus.com]
> Sent: Thursday, August 20, 2009 1:15 PM
> To: java-user@lucene.apache.org
> Subject: Merge Exception in Lucene 2.4
>
> Hi
>
>
>
> I am getting this issue in Lucene2.4 when I try to merge multiple
> IndexWriters(generally 6)
>
>
>
> sh-3.2# Exception in thread "Lucene Merge Thread #5"
> org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> read past EOF
>
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Concur
> rentMergeScheduler.java:309)
>
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
> ergeScheduler.java:286)
>
> Caused by: java.io.IOException: read past EOF
>
>        at
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java
> :135)
>
>        at
> org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:
> 228)
>
>        at
> org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:184
> )
>
>        at
> org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:
> 204)
>
>        at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4260)
>
>        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3877)
>
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeSche
> duler.java:205)
>
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
> ergeScheduler.java:260)
>
>
>
>
>
> Is this a known issue and has any fix been provided for it ? Would
> appreciate any help.
>
>
>
> Regards
>
> Sumanta
>
>
>
>
>
>
>
>
> __ NOD32 4349 (20090819) Information __
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Merge Exception in Lucene 2.4

2009-08-20 Thread Sumanta Bhowmik
We put all the IndexWriters in an array which is defined by

final Directory[] finalDir; 

We also declare an indexer
As 
private volatile static Indexer indexer;

final Indexer finalIndexer = indexer;

Next we call the merge in a new thread :

Thread thread = new Thread(){
public void run()
{
try {
logger.debug("starts merging w/o
optimization");

finalIndexer.getWriter().addIndexesNoOptimize(finalDir);
logger.debug("ends merging w/o
optimization");

} catch (CorruptIndexException e) {
logger.notice("",e);
} catch (IOException e) {
logger.notice("",e);
}

}
}

Sumanta




-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, August 20, 2009 3:07 PM
To: java-user@lucene.apache.org
Subject: Re: Merge Exception in Lucene 2.4

You should definitely upgrade to the latest JDK 1.6 to get the fix for
the JRE bug in LUCENE-1282, but, I don't think you are hitting that
bug (read past EOF during merge is a different exception).

Can you describe more detail on how you merge 6 IndexWriters?

Mike

On Thu, Aug 20, 2009 at 5:21 AM, Sumanta
Bhowmik wrote:
> I checked at http://issues.apache.org/jira/browse/LUCENE-1282
> SegmentMerger.java has this code
>
> TermFreqVector[] vectors = reader.getTermFreqVectors(docNum);
> termVectorsWriter.addAllDocVectors(vectors);
>
> so this issue appears inspite of this fix.
>
> I am using java version "1.6.0_07". Is it fixed in jdk6u10 and above
> (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6707044) ?
>
> Regards
> Sumanta
>
>
> -Original Message-
> From: Sumanta Bhowmik [mailto:sumanta.bhow...@guavus.com]
> Sent: Thursday, August 20, 2009 1:15 PM
> To: java-user@lucene.apache.org
> Subject: Merge Exception in Lucene 2.4
>
> Hi
>
>
>
> I am getting this issue in Lucene2.4 when I try to merge multiple
> IndexWriters(generally 6)
>
>
>
> sh-3.2# Exception in thread "Lucene Merge Thread #5"
> org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> read past EOF
>
>        at
>
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Concur
> rentMergeScheduler.java:309)
>
>        at
>
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
> ergeScheduler.java:286)
>
> Caused by: java.io.IOException: read past EOF
>
>        at
>
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java
> :135)
>
>        at
>
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:
> 228)
>
>        at
>
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:184
> )
>
>        at
>
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:
> 204)
>
>        at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4260)
>
>        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3877)
>
>        at
>
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeSche
> duler.java:205)
>
>        at
>
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentM
> ergeScheduler.java:260)
>
>
>
>
>
> Is this a known issue and has any fix been provided for it ? Would
> appreciate any help.
>
>
>
> Regards
>
> Sumanta
>
>
>
>
>
>
>
>
> __ NOD32 4349 (20090819) Information __
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
> -----
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


__ NOD32 4350 (20090820) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Extending Sort/FieldCache

2009-08-20 Thread Shai Erera
Hi

I'd like to extend Lucene's FieldCache such that it will read native values
from a different place (in my case, payloads). That is, instead of iterating
on a field's terms and parsing each String to long (for example), I'd like
to iterate over one term (sort:long, again - an example) and decode each
payload value to long, and store it in the cache. The reason I want to
extend Lucene's FieldCache is because I'd like Lucene to take care of
updating this cache when necessary (such as after reopen for example). This
will allow me to use Lucene's Sort option more easily.

I noticed Sort can be extended by providing a CUSTOM SortField, but that
forces me to create Comparable objects, which is much more expensive than
native longs (40 bytes vs. 8 bytes).

I didn't find a way though to extend FieldCache, or ExtendedFieldCache -->
even though both are extendable, I don't find the place where they're given
as input to TopFieldDocCollector, FieldSortedHitQueue etc. Perhaps I'm
missing it?

I need to do it on top of 2.4.1, but if it's not possible and something was
done in 2.9 (which I missed) in that regard I'd be happy to get a pointer to
that.

Today, I use this payload to implement sorting, which is memory efficient
for very large indexes, but for small indexes I think Lucene's built-in sort
will perform better.

BTW, if it interests anyone, perhaps augmenting Lucene's sort w/ reading
values from a Payload, or doing a complete payload-based-sort, I can work up
a patch ...

Thanks
Shai


Re: custom scorer

2009-08-20 Thread Chris Salem
No, I take it I have to use it for both?  Is there anything else I should have 
to do?
Sincerely,
Chris Salem 


- Original Message - 
To: java-user@lucene.apache.org
From: Grant Ingersoll 
Sent: 8/19/2009 7:17:45 PM
Subject: Re: custom scorer


Are you setting the Similarity before indexing, too, on the IndexWriter?

On Aug 19, 2009, at 4:20 PM, Chris Salem wrote:

> Hello,
> I'm trying to write a custom scorer that only uses the term 
> frequency function from the DefaultSimilarity class, the problem is 
> that documents with lower frequencies are returning with higher 
> scores than documents with higher frequencies. Here's the code:
> searcher.setSimilarity(new DefaultSimilarity(){
> public float lengthNorm(String field, int numTerms){
> return 1;
> }
> public float idf(int docFreq, int numDocs){
> return 1;
> }
> public float coord(int overlap, int maxoverlap){
> return 1;
> }
> public float queryNorm(float sumOfSquaredWeights){
> return 1;
> }
> public float sloppyFreq(int distance){
> return 1;
> }
> });
> Any idea why this wouldn't be working?
> Sincerely,
> Chris Salem

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
using Solr/Lucene:
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


Re: custom scorer

2009-08-20 Thread Simon Willnauer
You could simply set Similarity.setDefault(yourSimilarity) to make
sure it is used all over the place.

Simon

On Thu, Aug 20, 2009 at 3:25 PM, Chris Salem wrote:
> No, I take it I have to use it for both?  Is there anything else I should 
> have to do?
> Sincerely,
> Chris Salem
>
>
> - Original Message -
> To: java-user@lucene.apache.org
> From: Grant Ingersoll 
> Sent: 8/19/2009 7:17:45 PM
> Subject: Re: custom scorer
>
>
> Are you setting the Similarity before indexing, too, on the IndexWriter?
>
> On Aug 19, 2009, at 4:20 PM, Chris Salem wrote:
>
>> Hello,
>> I'm trying to write a custom scorer that only uses the term
>> frequency function from the DefaultSimilarity class, the problem is
>> that documents with lower frequencies are returning with higher
>> scores than documents with higher frequencies. Here's the code:
>> searcher.setSimilarity(new DefaultSimilarity(){
>> public float lengthNorm(String field, int numTerms){
>> return 1;
>> }
>> public float idf(int docFreq, int numDocs){
>> return 1;
>> }
>> public float coord(int overlap, int maxoverlap){
>> return 1;
>> }
>> public float queryNorm(float sumOfSquaredWeights){
>> return 1;
>> }
>> public float sloppyFreq(int distance){
>> return 1;
>> }
>> });
>> Any idea why this wouldn't be working?
>> Sincerely,
>> Chris Salem
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery

Hi all, 

I am trying to tune Lucene to respect such tokens like C++, C#, .NET

The task is known for Lucene community, but surprisingly I can't google out
somewhat good info on it.

Of course, I tried to re-use Lucene's  building blocks for Tokenizer. Here
we go:

  1) StandardTokenizer -- oh, this option would be just fantastic, but "C++,
C#, .NET" ends up with "c c net". Too bad.

  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
have been chopped into smaller pieces. Example: "C/C++" comes out like a
single lexem. If I follow this way I end-up with "Tokenization of tokens" --
that sounds a bit odd, doesn't it?

  3) CharTokenizer allows me to add the '/' to be also a token-emitting
char, but then '/' gets immediately lost like those whitespace chars. In
result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the
original char stream for the "/" char to re-build "SAP R/3" term as a whole.

Do you see any other relevant building blocks missed by me?

Also, people around there have meant that such problem should be solved by a
synonym dictionary. However this hint sheds no light on which tokenization
strategy should be more appropriate *before* the synonym step.

So, it looks like I have to take the class CharTokenizer as for the starting
point and write anew my own Tokenizer. This Tokenizer should also react on
delimiting characters and emit the token. However, it should distinguish
between delimiters like whitespaces along with ";,?" and the delimiters like
"./&". 

Indeed, the delimiters like whitespaces and ";,?" should be thrown away from
Lexem level, 
whereas the token emitting characters like "./&" should be kept in Lexem
level.

Your comments, gurus?

regards,
Valery

-- 
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Robert Muir
Valery,

One thing you could try would be to create a JFlex-based tokenizer,
specifying a grammar with the rules you want.
You could use the source code & grammar of StandardTokenizer as a
starting point.


On Thu, Aug 20, 2009 at 10:28 AM, Valery wrote:
>
> Hi all,
>
> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>
> The task is known for Lucene community, but surprisingly I can't google out
> somewhat good info on it.
>
> Of course, I tried to re-use Lucene's  building blocks for Tokenizer. Here
> we go:
>
>  1) StandardTokenizer -- oh, this option would be just fantastic, but "C++,
> C#, .NET" ends up with "c c net". Too bad.
>
>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
> have been chopped into smaller pieces. Example: "C/C++" comes out like a
> single lexem. If I follow this way I end-up with "Tokenization of tokens" --
> that sounds a bit odd, doesn't it?
>
>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
> char, but then '/' gets immediately lost like those whitespace chars. In
> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the
> original char stream for the "/" char to re-build "SAP R/3" term as a whole.
>
> Do you see any other relevant building blocks missed by me?
>
> Also, people around there have meant that such problem should be solved by a
> synonym dictionary. However this hint sheds no light on which tokenization
> strategy should be more appropriate *before* the synonym step.
>
> So, it looks like I have to take the class CharTokenizer as for the starting
> point and write anew my own Tokenizer. This Tokenizer should also react on
> delimiting characters and emit the token. However, it should distinguish
> between delimiters like whitespaces along with ";,?" and the delimiters like
> "./&".
>
> Indeed, the delimiters like whitespaces and ";,?" should be thrown away from
> Lexem level,
> whereas the token emitting characters like "./&" should be kept in Lexem
> level.
>
> Your comments, gurus?
>
> regards,
> Valery
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery

Hi Robert, 

thanks for the hint. 

Indeed, a natural way to go. Especially if one builds a Tokenizer of the
level of quality like StandardTokenizer's. 

OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
this task?..

regards
Valery



Robert Muir wrote:
> 
> Valery,
> 
> One thing you could try would be to create a JFlex-based tokenizer,
> specifying a grammar with the rules you want.
> You could use the source code & grammar of StandardTokenizer as a
> starting point.
> 
> 
> On Thu, Aug 20, 2009 at 10:28 AM, Valery wrote:
>>
>> Hi all,
>>
>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>
>> The task is known for Lucene community, but surprisingly I can't google
>> out
>> somewhat good info on it.
>>
>> Of course, I tried to re-use Lucene's  building blocks for Tokenizer.
>> Here
>> we go:
>>
>>  1) StandardTokenizer -- oh, this option would be just fantastic, but
>> "C++,
>> C#, .NET" ends up with "c c net". Too bad.
>>
>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>> --
>> that sounds a bit odd, doesn't it?
>>
>>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
>> char, but then '/' gets immediately lost like those whitespace chars. In
>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>> the
>> original char stream for the "/" char to re-build "SAP R/3" term as a
>> whole.
>>
>> Do you see any other relevant building blocks missed by me?
>>
>> Also, people around there have meant that such problem should be solved
>> by a
>> synonym dictionary. However this hint sheds no light on which
>> tokenization
>> strategy should be more appropriate *before* the synonym step.
>>
>> So, it looks like I have to take the class CharTokenizer as for the
>> starting
>> point and write anew my own Tokenizer. This Tokenizer should also react
>> on
>> delimiting characters and emit the token. However, it should distinguish
>> between delimiters like whitespaces along with ";,?" and the delimiters
>> like
>> "./&".
>>
>> Indeed, the delimiters like whitespaces and ";,?" should be thrown away
>> from
>> Lexem level,
>> whereas the token emitting characters like "./&" should be kept in Lexem
>> level.
>>
>> Your comments, gurus?
>>
>> regards,
>> Valery
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Robert Muir
Valery, oh I think there might be other ways to solve this.

But you provided some examples such as C/C++ and SAP R/3.
In these two examples you want the "/" to behave differently depending
upon context, so my first thought was that a grammar might be a good
way to ensure it does what you want.

On Thu, Aug 20, 2009 at 11:09 AM, Valery wrote:
>
> Hi Robert,
>
> thanks for the hint.
>
> Indeed, a natural way to go. Especially if one builds a Tokenizer of the
> level of quality like StandardTokenizer's.
>
> OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
> this task?..
>
> regards
> Valery
>
>
>
> Robert Muir wrote:
>>
>> Valery,
>>
>> One thing you could try would be to create a JFlex-based tokenizer,
>> specifying a grammar with the rules you want.
>> You could use the source code & grammar of StandardTokenizer as a
>> starting point.
>>
>>
>> On Thu, Aug 20, 2009 at 10:28 AM, Valery wrote:
>>>
>>> Hi all,
>>>
>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>>
>>> The task is known for Lucene community, but surprisingly I can't google
>>> out
>>> somewhat good info on it.
>>>
>>> Of course, I tried to re-use Lucene's  building blocks for Tokenizer.
>>> Here
>>> we go:
>>>
>>>  1) StandardTokenizer -- oh, this option would be just fantastic, but
>>> "C++,
>>> C#, .NET" ends up with "c c net". Too bad.
>>>
>>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>>> --
>>> that sounds a bit odd, doesn't it?
>>>
>>>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
>>> char, but then '/' gets immediately lost like those whitespace chars. In
>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>>> the
>>> original char stream for the "/" char to re-build "SAP R/3" term as a
>>> whole.
>>>
>>> Do you see any other relevant building blocks missed by me?
>>>
>>> Also, people around there have meant that such problem should be solved
>>> by a
>>> synonym dictionary. However this hint sheds no light on which
>>> tokenization
>>> strategy should be more appropriate *before* the synonym step.
>>>
>>> So, it looks like I have to take the class CharTokenizer as for the
>>> starting
>>> point and write anew my own Tokenizer. This Tokenizer should also react
>>> on
>>> delimiting characters and emit the token. However, it should distinguish
>>> between delimiters like whitespaces along with ";,?" and the delimiters
>>> like
>>> "./&".
>>>
>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown away
>>> from
>>> Lexem level,
>>> whereas the token emitting characters like "./&" should be kept in Lexem
>>> level.
>>>
>>> Your comments, gurus?
>>>
>>> regards,
>>> Valery
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Ken Krugler

Hi Valery,

From our experience at Krugle, we wound up having to create our own  
tokenizers (actually kind of  specialized parser) for the different  
languages. It didn't seem like a good option to try to twist one of  
the existing tokenizers into something that would work well enough. We  
wound up using ANTLR for this.


-- Ken


On Aug 20, 2009, at 8:09am, Valery wrote:



Hi Robert,

thanks for the hint.

Indeed, a natural way to go. Especially if one builds a Tokenizer of  
the

level of quality like StandardTokenizer's.

OTOH, you mean that the out-of-the-box stuff is indeed not  
customizable for

this task?..

regards
Valery



Robert Muir wrote:


Valery,

One thing you could try would be to create a JFlex-based tokenizer,
specifying a grammar with the rules you want.
You could use the source code & grammar of StandardTokenizer as a
starting point.


On Thu, Aug 20, 2009 at 10:28 AM, Valery wrote:


Hi all,

I am trying to tune Lucene to respect such tokens like C++, C#, .NET

The task is known for Lucene community, but surprisingly I can't  
google

out
somewhat good info on it.

Of course, I tried to re-use Lucene's  building blocks for  
Tokenizer.

Here
we go:

 1) StandardTokenizer -- oh, this option would be just fantastic,  
but

"C++,
C#, .NET" ends up with "c c net". Too bad.

 2) WhitespaceTokenizer gives me a lot of lexems that are actually  
should
have been chopped into smaller pieces. Example: "C/C++" comes out  
like a
single lexem. If I follow this way I end-up with "Tokenization of  
tokens"

--
that sounds a bit odd, doesn't it?

 3) CharTokenizer allows me to add the '/' to be also a token- 
emitting
char, but then '/' gets immediately lost like those whitespace  
chars. In
result "SAP R/3" ends up with "SAP" "R" "3" and one will need to  
search

the
original char stream for the "/" char to re-build "SAP R/3" term  
as a

whole.

Do you see any other relevant building blocks missed by me?

Also, people around there have meant that such problem should be  
solved

by a
synonym dictionary. However this hint sheds no light on which
tokenization
strategy should be more appropriate *before* the synonym step.

So, it looks like I have to take the class CharTokenizer as for the
starting
point and write anew my own Tokenizer. This Tokenizer should also  
react

on
delimiting characters and emit the token. However, it should  
distinguish
between delimiters like whitespaces along with ";,?" and the  
delimiters

like
"./&".

Indeed, the delimiters like whitespaces and ";,?" should be thrown  
away

from
Lexem level,
whereas the token emitting characters like "./&" should be kept in  
Lexem

level.

Your comments, gurus?

regards,
Valery

--
View this message in context:
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
Sent from the Lucene - Java Users mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






--
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





--
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Ken Krugler
TransPac Software, Inc.

+1 530-210-6378


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery

Hi Ken, 

thanks for the comments. Well, Terrence's ANTLR was and is a good piece of
work. 

Do you mean that you use ANTLR to generate a Tokenzer (lexem parser) 

or 

did you even proceed further and used ANTLR to generate higher level parsers
to overrule Lucene's TokenFilters?

or maybe even both?..

regards,
Valery



Ken Krugler wrote:
> 
> Hi Valery,
> 
>  From our experience at Krugle, we wound up having to create our own  
> tokenizers (actually kind of  specialized parser) for the different  
> languages. It didn't seem like a good option to try to twist one of  
> the existing tokenizers into something that would work well enough. We  
> wound up using ANTLR for this.
> 
> -- Ken
> 
> 
> On Aug 20, 2009, at 8:09am, Valery wrote:
> 
>>
>> Hi Robert,
>>
>> thanks for the hint.
>>
>> Indeed, a natural way to go. Especially if one builds a Tokenizer of  
>> the
>> level of quality like StandardTokenizer's.
>>
>> OTOH, you mean that the out-of-the-box stuff is indeed not  
>> customizable for
>> this task?..
>>
>> regards
>> Valery
>>
>>
>>
>> Robert Muir wrote:
>>>
>>> Valery,
>>>
>>> One thing you could try would be to create a JFlex-based tokenizer,
>>> specifying a grammar with the rules you want.
>>> You could use the source code & grammar of StandardTokenizer as a
>>> starting point.
>>>
>>>
>>> On Thu, Aug 20, 2009 at 10:28 AM, Valery wrote:

 Hi all,

 I am trying to tune Lucene to respect such tokens like C++, C#, .NET

 The task is known for Lucene community, but surprisingly I can't  
 google
 out
 somewhat good info on it.

 Of course, I tried to re-use Lucene's  building blocks for  
 Tokenizer.
 Here
 we go:

  1) StandardTokenizer -- oh, this option would be just fantastic,  
 but
 "C++,
 C#, .NET" ends up with "c c net". Too bad.

  2) WhitespaceTokenizer gives me a lot of lexems that are actually  
 should
 have been chopped into smaller pieces. Example: "C/C++" comes out  
 like a
 single lexem. If I follow this way I end-up with "Tokenization of  
 tokens"
 --
 that sounds a bit odd, doesn't it?

  3) CharTokenizer allows me to add the '/' to be also a token- 
 emitting
 char, but then '/' gets immediately lost like those whitespace  
 chars. In
 result "SAP R/3" ends up with "SAP" "R" "3" and one will need to  
 search
 the
 original char stream for the "/" char to re-build "SAP R/3" term  
 as a
 whole.

 Do you see any other relevant building blocks missed by me?

 Also, people around there have meant that such problem should be  
 solved
 by a
 synonym dictionary. However this hint sheds no light on which
 tokenization
 strategy should be more appropriate *before* the synonym step.

 So, it looks like I have to take the class CharTokenizer as for the
 starting
 point and write anew my own Tokenizer. This Tokenizer should also  
 react
 on
 delimiting characters and emit the token. However, it should  
 distinguish
 between delimiters like whitespaces along with ";,?" and the  
 delimiters
 like
 "./&".

 Indeed, the delimiters like whitespaces and ";,?" should be thrown  
 away
 from
 Lexem level,
 whereas the token emitting characters like "./&" should be kept in  
 Lexem
 level.

 Your comments, gurus?

 regards,
 Valery

 --
 View this message in context:
 http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
 Sent from the Lucene - Java Users mailing list archive at  
 Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


>>>
>>>
>>>
>>> -- 
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
> 
> --
> Ken Krugler
> TransPac Software, Inc.
> 
> +1 530-210-6378
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-ma

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery

Hi Robert, 

so, would you expect a Tokenizer to consider '/' in both cases as a separate
Token?

Personally, I see no problem if Tokenzer would do the following job:

"C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} 
and come up with "C" and "C++" tokens after processing through the
downstream tokenfilters.

Similarly:

"SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"} 
and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.

I try to follow a spirit that a token (or its lexem) usually should never be
parsed again. One can build  more complex (compound) things from the tokens.
However, usually one never chops a lexem into smaller pieces.

What do you think, Robert?

regards,
Valery

-- 
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Robert Muir
Valery, I think it all depends on how you want your search to work.

when I say this, I mean for example: if a document only contains "C++"
do you want searches on just "C" to match or not?

another thing I would suggest is to take a look at the capabilities of
Solr: it has some analysis stuff that might be beneficial for your
needs.
wiki page is here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


On Thu, Aug 20, 2009 at 1:46 PM, Valery wrote:
>
> Hi Robert,
>
> so, would you expect a Tokenizer to consider '/' in both cases as a separate
> Token?
>
> Personally, I see no problem if Tokenzer would do the following job:
>
> "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"}
> and come up with "C" and "C++" tokens after processing through the
> downstream tokenfilters.
>
> Similarly:
>
> "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"}
> and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
>
> I try to follow a spirit that a token (or its lexem) usually should never be
> parsed again. One can build  more complex (compound) things from the tokens.
> However, usually one never chops a lexem into smaller pieces.
>
> What do you think, Robert?
>
> regards,
> Valery
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Simon Willnauer
Valery, have you tried to use whitespaceTokenizer / CharTokenizer and
do any further processing in a  custom TokenFilter?!

simon

On Thu, Aug 20, 2009 at 8:48 PM, Robert Muir wrote:
> Valery, I think it all depends on how you want your search to work.
>
> when I say this, I mean for example: if a document only contains "C++"
> do you want searches on just "C" to match or not?
>
> another thing I would suggest is to take a look at the capabilities of
> Solr: it has some analysis stuff that might be beneficial for your
> needs.
> wiki page is here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
>
> On Thu, Aug 20, 2009 at 1:46 PM, Valery wrote:
>>
>> Hi Robert,
>>
>> so, would you expect a Tokenizer to consider '/' in both cases as a separate
>> Token?
>>
>> Personally, I see no problem if Tokenzer would do the following job:
>>
>> "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"}
>> and come up with "C" and "C++" tokens after processing through the
>> downstream tokenfilters.
>>
>> Similarly:
>>
>> "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"}
>> and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
>>
>> I try to follow a spirit that a token (or its lexem) usually should never be
>> parsed again. One can build  more complex (compound) things from the tokens.
>> However, usually one never chops a lexem into smaller pieces.
>>
>> What do you think, Robert?
>>
>> regards,
>> Valery
>>
>> --
>> View this message in context: 
>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya
Hi,

This question is going to be a little complicated to explain, but let me try.

I have implemented an indexer app based on the demo IndexFiles app, and a web 
app based on the luceneweb web app for the searching.

In my case, the "Documents" that I'm indexing are a proprietary file type, and 
each document has kind of "sub-documents".  So, in my indexer, I parse each of 
the sub-documents, and, for a given "Document", I build a long string 
containing terms that I extracted from each of the sub-documents, then I do:

doc.add(new Field("contents", longstring, Field.Store.YES, 
Field.Index.ANALYZED));

I also add the longstring to another non-indexed field, summary:

doc.add(new Field("summary", longstring, Field.Store.YES, Field.Index.NO));

The modified luceneweb web app that I use is pretty vanilla, and originally, 
what I was asked to do was to be able to search just for a Document, i.e., 
given a query like "X and Y" (document containing both term=X and term=Y), 
return the file path+name for the document.  I also was displaying the terms 
associated with each sub-document by parsing the 'summary' string.

So, for example, if "Document1" contained 3 sub-documents (which contained 
(term1, term2), (term1a, term2a), and (term1b, term2b), respectively), and if I 
queried for "term1a AND term2a", the web app would display something like:

Document1 subdoc1 term1 term2
  subdoc2 term1a term2a
  subdoc3 term1b term2b

However, I've now been asked to implement the ability to query the 
sub-documents. 

In other words, rather than the web app displaying what I showed above, they 
want it to return something like just:

Document1 subdoc2 term1a term2a

Right now, the web app gets the 'summary' (again, in a long string), then just 
breaks it into subdoc1, subdoc2, and subdoc3 lines, just for display purposes, 
so to do what I've been asked, I need to query the 3 sub-strings from the 
'summary', i.e., run the "term1a AND term2a" query against the following 
strings:

subdoc1 term1 term2
subdoc2 term1a term2a
subdoc3 term1b term2b

I guess that I can write a method to do that, but I want to make sure that the 
sub-document/string query "duplicates" the behavior of the Lucene query.

It seems like there should be a way to duplicate the Lucene query logic by 
using something (methods) in Lucene itself??

I've been reviewing the Javadocs, but I'm still fairly new to Lucene, so I was 
hoping that someone could point me in the right direction?

My apologies for the longish post, but I hope that I've been able to explain 
clearly :)!!

Thanks,
Jim

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya
Hi,

I guess, that, in short, what I'm really trying to find out is:

If I construct a Lucene query, can I (somehow) use that to query a String 
object that I have, rather than querying against a Lucene index?

Thanks,
Jim


 oh...@cox.net wrote: 
> Hi,
> 
> This question is going to be a little complicated to explain, but let me try.
> 
> I have implemented an indexer app based on the demo IndexFiles app, and a web 
> app based on the luceneweb web app for the searching.
> 
> In my case, the "Documents" that I'm indexing are a proprietary file type, 
> and each document has kind of "sub-documents".  So, in my indexer, I parse 
> each of the sub-documents, and, for a given "Document", I build a long string 
> containing terms that I extracted from each of the sub-documents, then I do:
> 
> doc.add(new Field("contents", longstring, Field.Store.YES, 
> Field.Index.ANALYZED));
> 
> I also add the longstring to another non-indexed field, summary:
> 
> doc.add(new Field("summary", longstring, Field.Store.YES, Field.Index.NO));
> 
> The modified luceneweb web app that I use is pretty vanilla, and originally, 
> what I was asked to do was to be able to search just for a Document, i.e., 
> given a query like "X and Y" (document containing both term=X and term=Y), 
> return the file path+name for the document.  I also was displaying the terms 
> associated with each sub-document by parsing the 'summary' string.
> 
> So, for example, if "Document1" contained 3 sub-documents (which contained 
> (term1, term2), (term1a, term2a), and (term1b, term2b), respectively), and if 
> I queried for "term1a AND term2a", the web app would display something like:
> 
> Document1 subdoc1 term1 term2
>   subdoc2 term1a term2a
>   subdoc3 term1b term2b
> 
> However, I've now been asked to implement the ability to query the 
> sub-documents. 
> 
> In other words, rather than the web app displaying what I showed above, they 
> want it to return something like just:
> 
> Document1 subdoc2 term1a term2a
> 
> Right now, the web app gets the 'summary' (again, in a long string), then 
> just breaks it into subdoc1, subdoc2, and subdoc3 lines, just for display 
> purposes, so to do what I've been asked, I need to query the 3 sub-strings 
> from the 'summary', i.e., run the "term1a AND term2a" query against the 
> following strings:
> 
> subdoc1 term1 term2
> subdoc2 term1a term2a
> subdoc3 term1b term2b
> 
> I guess that I can write a method to do that, but I want to make sure that 
> the sub-document/string query "duplicates" the behavior of the Lucene query.
> 
> It seems like there should be a way to duplicate the Lucene query logic by 
> using something (methods) in Lucene itself??
> 
> I've been reviewing the Javadocs, but I'm still fairly new to Lucene, so I 
> was hoping that someone could point me in the right direction?
> 
> My apologies for the longish post, but I hope that I've been able to explain 
> clearly :)!!
> 
> Thanks,
> Jim
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread Paul Cowan

oh...@cox.net wrote:

Document1 subdoc1 term1 term2
  subdoc2 term1a term2a
  subdoc3 term1b term2b

However, I've now been asked to implement the ability to query the sub-documents. 


In other words, rather than the web app displaying what I showed above, they 
want it to return something like just:

Document1 subdoc2 term1a term2a


Just checking here... you only want to match where the terms are in 
specific sub-documents? That is, if someone searches for 'term1a AND 
term2b', what do you want to see? Nothing (because no sub-document 
matches both terms)? Or subdoc2 and subdoc3, because they're both part 
of the reason that Document1 matched?


If the former, then just indexing each sub-doc as a separate document 
(duplicating the document-level information) may be the simplest option.


Cheers,

Paul

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya

 Paul Cowan  wrote: 
> oh...@cox.net wrote:
> > Document1 subdoc1 term1 term2
> >   subdoc2 term1a term2a
> >   subdoc3 term1b term2b
> >
> > However, I've now been asked to implement the ability to query the 
> > sub-documents. 
> >
> > In other words, rather than the web app displaying what I showed above, 
> > they want it to return something like just:
> >
> > Document1 subdoc2 term1a term2a
> 
> Just checking here... you only want to match where the terms are in 
> specific sub-documents? That is, if someone searches for 'term1a AND 
> term2b', what do you want to see? Nothing (because no sub-document 
> matches both terms)? Or subdoc2 and subdoc3, because they're both part 
> of the reason that Document1 matched?
> 
> If the former, then just indexing each sub-doc as a separate document 
> (duplicating the document-level information) may be the simplest option.
> 
> Cheers,
> 
> Paul
>


Hi Paul,

Hah!

Yes, it's the former I think...

The "Hah!" was because I was googling, and just ran across this:

http://javatechniques.com/blog/lucene-in-memory-text-search-example/

which, I think, creates an in-memory index, then searches it.

I was reading through that, as I saw your message.

As I was reading though, I am wondering:  This seems like it would create an 
awful lot of overhead?

In other words:

- I'd have to create a (very small) index, for each sub-document, where I do 
the Document.add() with just the (for example) two terms, then
- Run a query against the 1-entry index, which
- Would either give me a "yes" or "no" (for that sub-document)

As I said, I'm concerned about overhead.  Some of the documents are quite 
large, containing >20K sub-documents.  That means that, for such a document, 
I'd have to create >20K indexes.

Is there really no other way to do this?  I guess that, in my mind, I keep 
thinking about somehow "redirecting" Lucene to do a search on a single String 
object (that was just a kind of metaphor)?

Comments?

Thanks for your response!

Jim



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread Paul Cowan

oh...@cox.net wrote:

- I'd have to create a (very small) index, for each sub-document, where I do 
the Document.add() with just the (for example) two terms, then
- Run a query against the 1-entry index, which
- Would either give me a "yes" or "no" (for that sub-document)

As I said, I'm concerned about overhead.  Some of the documents are quite large, 
containing >20K sub-documents.  That means that, for such a document, I'd have to 
create >20K indexes.


No, I'm talking about a separate document in the same index.

There are a few approaches here:

1) Index each sub-document separately. So if you have fields 'doc#', 
'docname', 'subdoc#', and 'subdocterms', you might do:


   for (Doc parent : Docs) {
 for (SubDoc child : parent.subDocs()) {
   Document luceneDoc = new Document();
   doc.add(new Field("doc#", parent.number()));
   doc.add(new Field("docname", parent.name()));
   doc.add(new Field("subdoc#", child.number()));
   doc.add(new Field("subdocterms", child.data()));
 }
   }

This means that in your index after indexing 2 docs with 2 subdocs each, 
you'll have

   (Lucene #)   doc#   docname   subdoc#   subdocterms
   
   0100Foo   101   subdoc1 terms here
   1100Foo   102   subdoc2 terms
   2200Bar   201   subdoc1 terms from doc2
   3200Bar   202   some more subdoc text

So the search you're doing is actually on the subdoc level. This can get 
complicated, especially as subdocs from the same parent doc may come 
back out of order, etc, depending on scoring/sorting.


Also, if there is a lot of data at the parent level, you're obviously 
duplicating it. This can get nasty.


2) Maintain a (logically) separate subdoc index. You could have 
something like:

   doc#   docname  bigblobofdocdata
   -
   100Foo  lots of data here...
   200Bar  and lots more here..
in one index, and
   doc#   subdoc#  subdocterms
   -
   100101   subdoc1 terms here
   100102   subdoc2 terms
   200201   subdoc1 terms from doc2
   200202   some more subdoc text

Then you can FIRST search on the doc index to do any matches on 
'docname' etc, then use the IDs you find to filter the subdoc index -- 
so if the user searches for 'docname=foo' and 'subdocterms=text', you 
first do the docname search to get the docname-matching doc (100), then 
do a search on the second index for 'subdocterms', but also filter where 
doc#=100.


Note they don't HAVE to be separate indexes -- you can actually keep 
these in the same physical index, with some sort of discriminator (all 
docs in an index don't have to have the same fields).


3) Do some really hardcore tricks with spanqueries. This is what I'm 
working on at the moment, so it's near and dear to my heart. It's not 
for the faint-hearted, though, and if you're new to Lucene may scare you 
off, sorry! Basically Lucene has the concept of 'positions' for terms -- 
metadata about where in the document the term can be found. This lets 
you do 'near' queries, etc.


We're taking advantage of that to do some many-to-one stuff like your 
problem. Using the first example, with term positions indicated in [], 
we position terms from different subdocs with a large gap between them, 
like so:


   (Lucene #)   doc#   docname   subdoc#   subdocterms
   
   0100Foo   101[0]subdoc1[0] terms[1] here[2]
 102[100]  subdoc2[100] terms[101]

   1200Bar   201[0]subdoc1[0] terms[1] from[2]
 202[100]  doc2[3] some[100] more[101]
   subdoc[102] text[103]

So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200, 
etc. Then when we search we can say 'the terms you're looking for must 
be in the same 100-position block' to find only subdocs that match all 
subdoc-related subqueries. This is pretty hairy but is working well for 
us -- massively reduces our indexing and search times compared to the 
duplicated document way I mentioned above.


Cheers,

Paul

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya

 Paul Cowan  wrote: 
> oh...@cox.net wrote:
> > - I'd have to create a (very small) index, for each sub-document, where I 
> > do the Document.add() with just the (for example) two terms, then
> > - Run a query against the 1-entry index, which
> > - Would either give me a "yes" or "no" (for that sub-document)
> > 
> > As I said, I'm concerned about overhead.  Some of the documents are quite 
> > large, containing >20K sub-documents.  That means that, for such a 
> > document, I'd have to create >20K indexes.
> 
> No, I'm talking about a separate document in the same index.
> 
> There are a few approaches here:
> 
> 1) Index each sub-document separately. So if you have fields 'doc#', 
> 'docname', 'subdoc#', and 'subdocterms', you might do:
> 
> for (Doc parent : Docs) {
>   for (SubDoc child : parent.subDocs()) {
> Document luceneDoc = new Document();
> doc.add(new Field("doc#", parent.number()));
> doc.add(new Field("docname", parent.name()));
> doc.add(new Field("subdoc#", child.number()));
> doc.add(new Field("subdocterms", child.data()));
>   }
> }
> 
> This means that in your index after indexing 2 docs with 2 subdocs each, 
> you'll have
> (Lucene #)   doc#   docname   subdoc#   subdocterms
> 
> 0100Foo   101   subdoc1 terms here
> 1100Foo   102   subdoc2 terms
> 2200Bar   201   subdoc1 terms from doc2
> 3200Bar   202   some more subdoc text
> 
> So the search you're doing is actually on the subdoc level. This can get 
> complicated, especially as subdocs from the same parent doc may come 
> back out of order, etc, depending on scoring/sorting.
> 
> Also, if there is a lot of data at the parent level, you're obviously 
> duplicating it. This can get nasty.
> 
> 2) Maintain a (logically) separate subdoc index. You could have 
> something like:
> doc#   docname  bigblobofdocdata
> -
> 100Foo  lots of data here...
> 200Bar  and lots more here..
> in one index, and
> doc#   subdoc#  subdocterms
> -
> 100101   subdoc1 terms here
> 100102   subdoc2 terms
> 200201   subdoc1 terms from doc2
> 200202   some more subdoc text
> 
> Then you can FIRST search on the doc index to do any matches on 
> 'docname' etc, then use the IDs you find to filter the subdoc index -- 
> so if the user searches for 'docname=foo' and 'subdocterms=text', you 
> first do the docname search to get the docname-matching doc (100), then 
> do a search on the second index for 'subdocterms', but also filter where 
> doc#=100.
> 
> Note they don't HAVE to be separate indexes -- you can actually keep 
> these in the same physical index, with some sort of discriminator (all 
> docs in an index don't have to have the same fields).
> 
> 3) Do some really hardcore tricks with spanqueries. This is what I'm 
> working on at the moment, so it's near and dear to my heart. It's not 
> for the faint-hearted, though, and if you're new to Lucene may scare you 
> off, sorry! Basically Lucene has the concept of 'positions' for terms -- 
> metadata about where in the document the term can be found. This lets 
> you do 'near' queries, etc.
> 
> We're taking advantage of that to do some many-to-one stuff like your 
> problem. Using the first example, with term positions indicated in [], 
> we position terms from different subdocs with a large gap between them, 
> like so:
> 
> (Lucene #)   doc#   docname   subdoc#   subdocterms
> 
> 0100Foo   101[0]subdoc1[0] terms[1] here[2]
>   102[100]  subdoc2[100] terms[101]
> 
> 1200Bar   201[0]subdoc1[0] terms[1] from[2]
>   202[100]  doc2[3] some[100] more[101]
> subdoc[102] text[103]
> 
> So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200, 
> etc. Then when we search we can say 'the terms you're looking for must 
> be in the same 100-position block' to find only subdocs that match all 
> subdoc-related subqueries. This is pretty hairy but is working well for 
> us -- massively reduces our indexing and search times compared to the 
> duplicated document way I mentioned above.
> 
> Cheers,
> 
> Paul


Paul,

Oh boy, you've given me a LOT to chew on :)!!

At first read, I like your #1 approach, maybe because it's easiest for me to 
understand.  I have to think about it, but my first thought is that we might 
not need/want the sub-doc index to persist after they're used (maybe!), so 
create the sub-doc index "on-the-fly" for each Document, maybe using that 
example I linked as the template, do the query

Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-20 Thread Chris Hostetter

: But in that case, I assume Solr does a commit per document added.

not at all ... it computes a signature and then uses that as a unique key.  
IndexWriter.updateDocument does all the hard work.


-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-20 Thread Yonik Seeley
On Fri, Aug 21, 2009 at 12:49 AM, Chris
Hostetter wrote:
>
> : But in that case, I assume Solr does a commit per document added.
>
> not at all ... it computes a signature and then uses that as a unique key.
> IndexWriter.updateDocument does all the hard work.

Right - Solr used to do that hard work, but we handed that over to
Lucene when that capability was added.  It involves batching either
way (but letting Lucene handle it at a lower level is "better" since
it can prevent inconsistencies from crashes).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org