ParallelMultiSearcher

2006-09-21 Thread Yura Smolsky
Hello, java-user.

Does anyone here uses ParallelMultiSearcher for searching big arrays
of data? I have some questions about PrefixQuery search..

Thanks in advance.

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParallelMultiSearcher

2006-09-21 Thread Ronnie Kolehmainen
Dont ask to ask, just ask! ;)



Citerar Yura Smolsky <[EMAIL PROTECTED]>:

> Hello, java-user.
> 
> Does anyone here uses ParallelMultiSearcher for searching big arrays
> of data? I have some questions about PrefixQuery search..
> 
> Thanks in advance.
> 
> --
> Yura Smolsky,
> http://altervisionmedia.com/
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: ParallelMultiSearcher

2006-09-21 Thread Yura Smolsky
Hello, Ronnie.

RK> Dont ask to ask, just ask! ;)

ok. I have big issue when I try to search ParallelMultiSearcher for
PrefixQuery. This query is being rewritten to BooleanQuery during
search. This causes Similarity to calculate docFreq for each Term in the
BooleanQuery. So if we have a lot of results for some PrefixQuery then
we have a lot of calls to docFreq method of Searchable object passed
to ParallelMultiSearcher. In my case this Searchable object exists on the
other computer (network). Search became very slow b/c
of those multiple calls of docFreq over net.

I am not sure if this question for users mail list. But I have spent
about 3 days to fix this problem and I do not see any solution.

Maybe developers of Lucene could suggest something...

Thanks and sorry for my bad English.

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[2]: ParallelMultiSearcher

2006-09-21 Thread Yonik Seeley

On 9/21/06, Yura Smolsky <[EMAIL PROTECTED]> wrote:


ok. I have big issue when I try to search ParallelMultiSearcher for
PrefixQuery. This query is being rewritten to BooleanQuery during
search. This causes Similarity to calculate docFreq for each Term in the
BooleanQuery. So if we have a lot of results for some PrefixQuery then
we have a lot of calls to docFreq method of Searchable object passed
to ParallelMultiSearcher.


IDF often does not make sense for auto-expanding queries (range, prefix, etc).
If you don't need the idf factor that makes rarer terms count more,
then use a PrefixFilter wrapped in a ConstantScoreQuery.

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ConstantScoreQuery.html
http://incubator.apache.org/solr/docs/api/org/apache/solr/search/PrefixFilter.html

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Analysis/tokenization of compound words

2006-09-21 Thread Binkley, Peter
Aspell has some support for compound words that might be useful to look
at:

http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word
s

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]




 

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 19, 2006 10:22 AM
To: java-user@lucene.apache.org
Subject: Analysis/tokenization of compound words

Hi,

How do people typically analyze/tokenize text with compounds (e.g.
German)?  I took a look at GermanAnalyzer hoping to see how one can deal
with that, but it turns out GermanAnalyzer doesn't treat compounds in
any special way at all.

One way to go about this is to have a word dictionary and a tokenizer
that processes input one character at a time, looking for a word match
in the dictionary after each processed characters.  Then,
CompoundWordLikeThis could be broken down into multiple tokens/words and
returned at a set of tokens at the same position.  However, somehow this
doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd
love to hear about it.

Thanks,
Otis



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



writer.minMergeDocs in lucene 2.0

2006-09-21 Thread Ismail Siddiqui

Hi all,
I am trying to index a database.. indexing is taking quite a time..
i am trtying to tune it.for which i am trying to increase minMergeDocs..
in lucen 1.4.3 there is field called writer.minMergeDocs.. i found
writer.setMaxMergeDocs() method in lucne 2.0
but not any method called writer.setMinMergeDocs()..


can any one help.. how can i increase minimum docucment merge size.



thanks all


Ismail Siddiqui


Re: writer.minMergeDocs in lucene 2.0

2006-09-21 Thread Yonik Seeley

On 9/21/06, Ismail Siddiqui <[EMAIL PROTECTED]> wrote:

in lucen 1.4.3 there is field called writer.minMergeDocs.. i found
writer.setMaxMergeDocs() method in lucne 2.0
but not any method called writer.setMinMergeDocs()..


Try setMaxBufferedDocs()


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



analyzer to populate more that one field of Lucene document

2006-09-21 Thread Boris Galitsky

I need to create two fields for Lucene documents populated
1) by numbers
2) by other strings
3) by values of another specific format

What kind of Analyzer would do it?

Using the customized analyzer, the current code is like

IndexWriter indexWriter = new IndexWriter(indexDir, analyzer, true);
Document doc = new Document();
   doc.add(new Field("numeric_contents", new FileReader(f))); // 
numeric tokens
   doc.add(new Filed("other_contents", new FileReader(f)));   //the 
same file but other than numeric tokens


Thanks
--
Boris Galitsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: analyzer to populate more that one field of Lucene document

2006-09-21 Thread Erick Erickson

I think you want a PerFieldAnalyzerWrapper. It allows you to make a
different analyzer for each field in your document. You'll have to write the
code to extract the file contents in your desired formats for each field,
but you probably do that already ...

You can instantiate your IndexWriter with an instance of a
PerFieldAnalyzerWrapper and it all "just happens" after that..



From the javadoc for PerFieldAnalyzerWrapper...

<<< This analyzer is used to facilitate scenarios where different fields
require different analysis techniques.>>>

Best
Erick

On 9/21/06, Boris Galitsky <[EMAIL PROTECTED]> wrote:


I need to create two fields for Lucene documents populated
1) by numbers
2) by other strings
3) by values of another specific format

What kind of Analyzer would do it?

Using the customized analyzer, the current code is like

IndexWriter indexWriter = new IndexWriter(indexDir, analyzer, true);
Document doc = new Document();
doc.add(new Field("numeric_contents", new FileReader(f))); //
numeric tokens
doc.add(new Filed("other_contents", new FileReader(f)));   //the
same file but other than numeric tokens

Thanks
--
Boris Galitsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: analyzer to populate more that one field of Lucene document

2006-09-21 Thread Boris Galitsky

Thanks a lot Erick
Boris

* Erick Erickson <[EMAIL PROTECTED]> [Thu, 21 Sep 2006 20:53:42 
-0400]:

I think you want a PerFieldAnalyzerWrapper. It allows you to make a
different analyzer for each field in your document. You'll have to 

write

the
code to extract the file contents in your desired formats for each
field,
but you probably do that already ...

You can instantiate your IndexWriter with an instance of a
PerFieldAnalyzerWrapper and it all "just happens" after that..


>From the javadoc for PerFieldAnalyzerWrapper...
<<< This analyzer is used to facilitate scenarios where different 

fields

require different analysis techniques.>>>

Best
Erick

On 9/21/06, Boris Galitsky <[EMAIL PROTECTED]> wrote:
>
> I need to create two fields for Lucene documents populated
> 1) by numbers
> 2) by other strings
> 3) by values of another specific format
>
> What kind of Analyzer would do it?
>
> Using the customized analyzer, the current code is like
>
> IndexWriter indexWriter = new IndexWriter(indexDir, analyzer, true);
> Document doc = new Document();
> doc.add(new Field("numeric_contents", new FileReader(f))); //
> numeric tokens
> doc.add(new Filed("other_contents", new FileReader(f)));
//the
> same file but other than numeric tokens
>
> Thanks
> --
> Boris Galitsky.
>
> 

-

> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
Boris Galitsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]