weird error with SVN of Lucene

2006-06-27 Thread Yura Smolsky
Hello.

I have encountered weird error with CO of Lucene, when I try to build
PyLucene:

[EMAIL PROTECTED] /cygdrive/d/workshop/PyLucene
$ make
svn co -r 417135 http://svn.apache.org/repos/asf/lucene/java/trunk 
lucene-java-2.0.0-417135
svn: REPORT request failed on '/repos/asf/!svn/bc/417505/lucene/java/trunk'
svn: REPORT of '/repos/asf/!svn/bc/417505/lucene/java/trunk': 400 Bad Request 
(http://svn.apache.org
)
make: *** [lucene-java-2.0.0-417135] Error 1

I can CO PyLucene, but I can't Lucene. Can you?

Thanks for answering such foolish question. :)

--
Yura Smolsky,
http://altervisionmedia.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: weird error with SVN of Lucene

2006-06-27 Thread Yura Smolsky
Hello, Chris.

CH> : svn co -r 417135
CH> http://svn.apache.org/repos/asf/lucene/java/trunk lucene-java-2.0.0-417135
CH> : svn: REPORT request failed on
CH> '/repos/asf/!svn/bc/417505/lucene/java/trunk'
CH> : svn: REPORT of '/repos/asf/!svn/bc/417505/lucene/java/trunk':
CH> 400 Bad Request (http://svn.apache.org
CH> : )
CH> : make: *** [lucene-java-2.0.0-417135] Error 1
CH> :
CH> : I can CO PyLucene, but I can't Lucene. Can you?

CH> yep, perhaps just a temporary glitch on the server? .. can you reproduce?

i am trying to CO or update it for 2 hours... can you perform updates or COs?

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[3]: weird error with SVN of Lucene

2006-06-27 Thread Yura Smolsky
Hello, Chris.

i have tried CO it with httpS. and it works. very weird... i dont use
any proxies or other firewalls. also CO from other sites work fine...
can anyone explain this weird behavior of TortoiseSVN 1.3.3, Build
6219 ? :)

CH> : i am trying to CO or update it for 2 hours... can you perform updates or 
COs?

CH> both.  I tried the exact revision checkout you specified in your orriginal
CH> email...

CH> svn co -r 417135
CH> http://svn.apache.org/repos/asf/lucene/java/trunk lucene-java-2.0.0-417135

CH> ...and had no problems.

CH> perhaps there is something odd about your SVN client?


CH> -Hoss


CH> -
CH> To unsubscribe, e-mail: [EMAIL PROTECTED]
CH> For additional commands, e-mail: [EMAIL PROTECTED]





--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ParallelMultiSearcher and docFreq

2006-09-14 Thread Yura Smolsky
Hello.

Here is the situation. I have ParallelMultiSearcher object
initializated with two or more RemoteSearchable's.

I run PrefixQuery search on some keyword field, say "link". When I run
search starting just with letter "w" (link:w*) then I should have like 5k
results.

As I know when I perform search on ParallelMultiSearcher query is
being rewritten at first. So my prefix search is being rewritten with
"link:wordlist.com link:web.com and so on about 2-3k of terms". Then as I
understand from debugging for each such term ParallelMultiSearcher performs 
docFreq
requests to RemoteSearchables (2-3k calls). So we have many requests
to docFreq method and these operations take like 95% of all search time.

I see that we have docFreqs method for RemoteSearchable, but it has
not being used.

Is there any way to get rid of those multiple calls of docFreq?

Maybe I am not correct, then please tell me whats wrong.

Thanks.

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParallelMultiSearcher and docFreq

2006-09-15 Thread Yura Smolsky
Hello, Yura.

Does anyone understand my email? Maybe my English is too bad...

Thanks.

YS> Here is the situation. I have ParallelMultiSearcher object
YS> initializated with two or more RemoteSearchable's.

YS> I run PrefixQuery search on some keyword field, say "link". When I run
YS> search starting just with letter "w" (link:w*) then I should have like 5k
YS> results.

YS> As I know when I perform search on ParallelMultiSearcher query is
YS> being rewritten at first. So my prefix search is being rewritten with
YS> "link:wordlist.com link:web.com and so on about 2-3k of terms". Then as I
YS> understand from debugging for each such term
YS> ParallelMultiSearcher performs docFreq
YS> requests to RemoteSearchables (2-3k calls). So we have many requests
YS> to docFreq method and these operations take like 95% of all search time.

YS> I see that we have docFreqs method for RemoteSearchable, but it has
YS> not being used.

YS> Is there any way to get rid of those multiple calls of docFreq?



--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: ParallelMultiSearcher and docFreq

2006-09-15 Thread Yura Smolsky
Hello, Ronald.

What I have found that nothing except createWeight uses that
docFreqs(Term[]) method...
Maybe I need to parallelize it... But I dont understand something.

When does Multisearcher.createWeight() is being called, b/c only this method
used docFreqs and this method creates HashMap of docFreqs of terms? Is
this method is being user for rewrite of query inside of
ParallelMultiSearher?

Also this method calls docFreqs of RemoteSearchables, I should be
receiving calls of docFreqs(Term[]) to the RemoteSearchable objects,
but I do not. Can somebody explain this?

And from which place am I receive those multiple calls of docFreq
method?

Thanks.

HRCLD> I understand...because I've experienced it.  I think the answer is to
HRCLD> 'parallelize' the docFreq process...and or try to make use of the
HRCLD> docFreq(Terms[]).  By passing an Array of Terms, you can avoid the 'call
HRCLD> per Term' per remote and just make a single docFreq call per remote.

HRCLD> You might have to extend the ParallelMultiSearcher and create a threaded
HRCLD> docFreq method.


--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ParallelMultiSearcher

2006-09-21 Thread Yura Smolsky
Hello, java-user.

Does anyone here uses ParallelMultiSearcher for searching big arrays
of data? I have some questions about PrefixQuery search..

Thanks in advance.

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: ParallelMultiSearcher

2006-09-21 Thread Yura Smolsky
Hello, Ronnie.

RK> Dont ask to ask, just ask! ;)

ok. I have big issue when I try to search ParallelMultiSearcher for
PrefixQuery. This query is being rewritten to BooleanQuery during
search. This causes Similarity to calculate docFreq for each Term in the
BooleanQuery. So if we have a lot of results for some PrefixQuery then
we have a lot of calls to docFreq method of Searchable object passed
to ParallelMultiSearcher. In my case this Searchable object exists on the
other computer (network). Search became very slow b/c
of those multiple calls of docFreq over net.

I am not sure if this question for users mail list. But I have spent
about 3 days to fix this problem and I do not see any solution.

Maybe developers of Lucene could suggest something...

Thanks and sorry for my bad English.

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[4]: ParallelMultiSearcher

2006-09-22 Thread Yura Smolsky
Hello, Yonik.

>> ok. I have big issue when I try to search ParallelMultiSearcher for
>> PrefixQuery. This query is being rewritten to BooleanQuery during
>> search. This causes Similarity to calculate docFreq for each Term in the
>> BooleanQuery. So if we have a lot of results for some PrefixQuery then
>> we have a lot of calls to docFreq method of Searchable object passed
>> to ParallelMultiSearcher.

YS> IDF often does not make sense for auto-expanding queries (range, prefix, 
etc).
YS> If you don't need the idf factor that makes rarer terms count more,
YS> then use a PrefixFilter wrapped in a ConstantScoreQuery.

YS> 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ConstantScoreQuery.html
YS> 
http://incubator.apache.org/solr/docs/api/org/apache/solr/search/PrefixFilter.html

Thank you so much! PrefixFilter has eliminated counting idf's.. I
wonder why it is not inside of Lucene yet :)

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to enhance speed of sorted search

2006-09-25 Thread Yura Smolsky
Hello, java-user.

I have a set of documents with two fields:
1. "summary" which is tokenized, stored. it contains some text
2. "date", which is untokenized, stored. it contains seconds from
epoch aligned to the right and padded with zeroes on the left

I perform searches which are sorted by "date" field.
my query is phrase one - summary:"some text bla bla"
my sort is SortField("modified", SortField.INT, True)

I don't need to sort documents using "summary" field by relevance. I
think that by default Lucene spends some operations to score documents using
query.

Is there any way to speed up queries? Maybe should I replace query
with filter?

The problem that I have very big index, like 80 Gb and I have a lot of
different queries (about 480 per minute).

--
Yura Smolsky


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: how to enhance speed of sorted search

2006-09-25 Thread Yura Smolsky
Hello, Aviran.

but searcher.explain returns that idf's and scores were still counted.
correct me if i am wrong.

MAENN> AFAIK when you sort Lucene does not calculate the relevance score.

MAENN> Aviran
MAENN> http://www.aviransplace.com 

MAENN> -Original Message-----
MAENN> From: Yura Smolsky [mailto:[EMAIL PROTECTED] 
MAENN> Sent: Monday, September 25, 2006 4:39 AM
MAENN> To: java-user@lucene.apache.org
MAENN> Subject: how to enhance speed of sorted search

MAENN> Hello, java-user.

MAENN> I have a set of documents with two fields:
MAENN> 1. "summary" which is tokenized, stored. it contains some text 2.
MAENN> "date", which is untokenized, stored. it contains seconds from epoch
MAENN> aligned to the right and padded with zeroes on the left

MAENN> I perform searches which are sorted by "date" field.
MAENN> my query is phrase one - summary:"some text bla bla"
MAENN> my sort is SortField("modified", SortField.INT, True)

MAENN> I don't need to sort documents using "summary" field by relevance. I
MAENN> think that by default Lucene spends some operations to score documents
MAENN> using query.

MAENN> Is there any way to speed up queries? Maybe should I replace query with
MAENN> filter?

MAENN> The problem that I have very big index, like 80 Gb and I have a lot of
MAENN> different queries (about 480 per minute).

MAENN> --
MAENN> Yura Smolsky


MAENN> -----
MAENN> To unsubscribe, e-mail: [EMAIL PROTECTED]
MAENN> For additional commands, e-mail: [EMAIL PROTECTED]





--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[3]: how to enhance speed of sorted search

2006-09-26 Thread Yura Smolsky
Hello, Chris.

CH> 3) most likely, if you are seeing "slow" performance from sorted searches,
CH> the time spent "scoring" the results isn't the biggest contributor to how
CH> long thesearch takes -- it tends to be negligable for most queries.  A
CH> better question is: are you reusing the exact same IndexReader /
CH> IndexSearcher instance for every querey? ... if not, that right there is
CH> going to be your biggest problem, because it will prevent you from being
CH> able to reuse teh "FieldCache" needed when sorting results.

Sure I do reuse IndexSearcher :) and second query is always faster
than the first one...

I am thinking should be this faster

query = QueryParser("text", StandardAnalyzer()).parse("good boy")
IndexSearcher.search(
  new ConstantScoreQuery(new QueryFilter(query)),
  sortByIntField)

than usual search

IndexSearcher.search(
  query,
  sortByIntField)

Is there anyway I could use filter to get needed results from the
query?

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[4]: how to enhance speed of sorted search

2006-09-28 Thread Yura Smolsky
Hello, Chris.

CH> : I am thinking should be this faster
CH> The ConstantScoreQuery wrapped arround the QueryFilter might in fact be
CH> faster then the raw query -- have your tried it to see?

I will try when PrefixFilter will be added in Lucene. b/c I have some
technical issues and I use PyLucene. But on my local index of 400 mb is
it not slower than usual :)

CH> you might be able to shave a little bit of speed off by accessing the bits
CH> from the Filter directly and iterating over them yourself to check the
CH> FieldCache ad build up your sorted list of the first "N" -- i think that
CH> would save you one method call per match (the score method of
CH> ConstantScoreQuery)

What is the "sorted list of the first "N""?

CH> At some point you just have to wonder if it's fast enough?

:)

CH> how long does a typically sorted query take for you right now?

it takes 0.2 up to 0.4. this is only searching without iterating over
hits and fetching docs.

CH> how many documents are in your index?

about 30 millions

CH> how many matches do you typically have?

it varies greatly. from 7k to 100k


--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



highlighter and phrase search

2005-03-10 Thread Yura Smolsky
Hello, java-user.

I have two documents:
1. content:A V A B
2. content:A B C D
When I do search for content:"A B" (exact phrase search) and
StandardAnalyzer(),
when I use Highlighter I receive following highlighted results:

1. _A_ V _A B_
2. _A B_ C D

Actually "A" in the first result does not need to be highlighted b/c
requested keyword was "A B".

Is this a bug or I do not understand something?

Yura Smolsky.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



exact match

2005-04-04 Thread Yura Smolsky
Hello, java-user.

I have documents with tokenized, indexes and stored field. This field
contain one-two words usually. I need to be able to search exact
matches for two words.
For example search "John" should return documents with field
containing "John" only, not "John Doe" or "John Foo".

Any ideas?

Yura Smolsky.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: exact match

2005-04-05 Thread Yura Smolsky
Hello, Erik.


EH> On Apr 4, 2005, at 4:34 PM, Yura Smolsky wrote:
>> Hello, java-user.
>>
>> I have documents with tokenized, indexes and stored field. This field
>> contain one-two words usually. I need to be able to search exact
>> matches for two words.
>> For example search "John" should return documents with field
>> containing "John" only, not "John Doe" or "John Foo".
>>
>> Any ideas?
EH> Use an untokenized field to search on in the case of finding an exact
EH> match.

And no other ways to reach this?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[4]: exact match

2005-04-06 Thread Yura Smolsky
Hello, Erik.

I have very large index (200Gb) with big amount of documents.
I have field "author", which stores name and this fields is tokenized,
indexed, stored.

This field contains values of following examples:
"John"
"John Doe"
"Bill"
"Bill Gates"

I do not want to reindex all documents again. and I want to search by
this fields with exactly search. Like if I search "John" then it will
return documents with fields "author" containing phrase "John" only
and not "John Doe". Now it returns all combinations with word John and
I just want to have documents with the word "John" only.

Solution might not simple, but I can manage that. I cannot manage
reindexing all documents again. :)

Does it clear to you?



EH> On Apr 5, 2005, at 5:44 AM, Yura Smolsky wrote:
>> EH> On Apr 4, 2005, at 4:34 PM, Yura Smolsky wrote:
>>>> Hello, java-user.
>>>>
>>>> I have documents with tokenized, indexes and stored field. This field
>>>> contain one-two words usually. I need to be able to search exact
>>>> matches for two words.
>>>> For example search "John" should return documents with field
>>>> containing "John" only, not "John Doe" or "John Foo".
>>>>
>>>> Any ideas?
>> EH> Use an untokenized field to search on in the case of finding an
>> exact
>> EH> match.
>>
>> And no other ways to reach this?

EH> Not that I know of.  Could you give us a more concrete example of what
EH> you're trying to achieve?

EH> Erik


EH> -
EH> To unsubscribe, e-mail: [EMAIL PROTECTED]
EH> For additional commands, e-mail: [EMAIL PROTECTED]





Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: Search performance under high load

2005-04-07 Thread Yura Smolsky
Hello, mark.

mh> 2) My app uses long queries, some of which include
mh> very common terms. Using the "MoreLikeThis" query to
mh> drop common terms drastically improved performance. If
mh> your "killer queries" are long ones you could spot
mh> them and service them with a MoreLikeThis or simply
mh> limit the number of allowed terms in the query string.

Can you please explain what is "MoreLikeThis" query in the Lucene?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]