date:20070715

Serving remote lucene client - RMI vs HTTP

2007-07-15 Thread kumarlimbu

Hi Everyone,

We are using lucene,nutch and spring framework to create a specialized
search engine. Due to growing traffic we are trying to scale. By doing some
tests we found out that the bottle neck was lucene search. We used some
heavy traffic simulation and logged the time taken by each portion of the
server response and found out that the bulk of the time was spent in
searching from lucene index.

In order to accomodate higher traffic we are planning on splitting our
application in 2 portions:
1. Web application (on 1 machine)
2. Search application (one more than 1 machine)

Each one of the application will reside on a (possibly) separate machines.
We are looking forward to scaling by adding more than 1 machine dedicated to
searching as lucene search seems to be the bottleneck.

Web application will provide the front-end to the user. All static pages,
images and the style information will reside on this machine. It will also
serve dynamic pages but all the searching will take place on the search
application. Web application will send search parameters to the search
application and after searching, it will send back the results to the web
application which will format it and display it to the user.

What we are unable to decide is whether to use RMI (
http://www.soft-amis.com/index.html?return=http://www.soft-amis.com/cluster4spring/index.html
cluster4spring )or simple HTTP POST/GET request with response in XML
format. We did some research and found out that RMI (with clustering
support) might be more suitable for our needs. Unfortunately our team is not
familiar with RMI and so we don't know if there will be any issues with it
during implementation.
Advantage of using simple GET/POST is we have more control which searcher
app to use and when. An important criteria for us is to disable searching
from the searcher app who's index is being updated. This is very important
for us. Is there anyway in RMI to inform the web (client) application that a
particular server is unavailable (due to index being updated).

We would also like to know if anybody has implemented lucene searching in a
similar fashion. Thanks for your help.

Kumar Limbu

--
View this message in context:
http://www.nabble.com/Serving-remote-lucene-client---RMI-vs-HTTP-tf4084167.html#a11608209
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Serving remote lucene client - RMI vs HTTP

2007-07-15 Thread Ken Krugler


Hi Everyone,

We are using lucene,nutch and spring framework to create a specialized
search engine. Due to growing traffic we are trying to scale. By doing some
tests we found out that the bottle neck was lucene search. We used some
heavy traffic simulation and logged the time taken by each portion of the
server response and found out that the bulk of the time was spent in
searching from lucene index.

In order to accomodate higher traffic we are planning on splitting our
application in 2 portions:
1. Web application (on 1 machine)
2. Search application (one more than 1 machine)


[snip]

Nutch already supports distributed Lucene searchers, using Hadoop RPC.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Serving remote lucene client - RMI vs HTTP

2007-07-15 Thread qi wu

Hi Kumar,
I am also interested in this question. I have used simple HTTP POST/GET request 
with open search,and it requires about extra 15ms for  transferring search 
results between two machines.And want to know whether RMI can give a better 
performance.

Thanks
-Qi
- Original Message - 
From: "kumarlimbu" <[EMAIL PROTECTED]>
To: 
Sent: Monday, July 16, 2007 10:10 AM
Subject: Serving remote lucene client - RMI vs HTTP


> 
> Hi Everyone,
> 
> We are using lucene,nutch and spring framework to create a specialized
> search engine. Due to growing traffic we are trying to scale. By doing some
> tests we found out that the bottle neck was lucene search. We used some
> heavy traffic simulation and logged the time taken by each portion of the
> server response and found out that the bulk of the time was spent in
> searching from lucene index.
> 
> In order to accomodate higher traffic we are planning on splitting our
> application in 2 portions:
> 1. Web application (on 1 machine)
> 2. Search application (one more than 1 machine)
> 
> Each one of the application will reside on a (possibly) separate machines.
> We are looking forward to scaling by adding more than 1 machine dedicated to
> searching as lucene search seems to be the bottleneck. 
> 
> Web application will provide the front-end to the user. All static pages,
> images and the style information will reside on this machine. It will also
> serve dynamic pages but all the searching will take place on the search
> application. Web application will send search parameters to the search
> application and after searching, it will send back the results to the web
> application which will format it and display it to the user.
> 
> What we are unable to decide is whether to use RMI (
> http://www.soft-amis.com/index.html?return=http://www.soft-amis.com/cluster4spring/index.html
> cluster4spring  )or simple HTTP POST/GET request with response in XML
> format. We did some research and found out that RMI (with clustering
> support) might be more suitable for our needs. Unfortunately our team is not
> familiar with RMI and so we don't know if there will be any issues with it
> during implementation. 
> Advantage of using simple GET/POST is we have more control which searcher
> app to use and when. An important criteria for us is to disable searching
> from the searcher app who's index is being updated. This is very important
> for us. Is there anyway in RMI to inform the web (client) application that a
> particular server is unavailable (due to index being updated).
> 
> We would also like to know if anybody has implemented lucene searching in a
> similar fashion. Thanks for your help.
> 
> Kumar Limbu
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Serving-remote-lucene-client---RMI-vs-HTTP-tf4084167.html#a11608209
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>

Re: Serving remote lucene client - RMI vs HTTP

2007-07-15 Thread Grant Ingersoll


Hi Kumar,

I am curious about where the bulk of time was spent in Lucene.  I am  
not doubting your numbers or analysis, just want to know if there was  
anything that could be improved in Lucene and make sure you are  
solidly in need of going to multiple machines as it adds an extra  
level of complexity.  Can you provide info about number of documents,  
# of updates, # of queries etc. ?  What kind of analysis, etc. are  
you doing?  That being said, putting the web app on one machine and  
the search application on the other makes sense much in the same way  
it makes sense in most cases to do this for a database.


As for POST/GET vs. RMI, have a look at Solr, esp. its replication  
capabilities for load balancing search servers.  I think it answers  
the question in favor of the POST/GET approach.  In fact, you may be  
able to drop in Solr to your situation w/o too much work.


Cheers,
Grant


On Jul 15, 2007, at 10:10 PM, kumarlimbu wrote:



Hi Everyone,

We are using lucene,nutch and spring framework to create a specialized
search engine. Due to growing traffic we are trying to scale. By  
doing some
tests we found out that the bottle neck was lucene search. We used  
some
heavy traffic simulation and logged the time taken by each portion  
of the

server response and found out that the bulk of the time was spent in
searching from lucene index.

In order to accomodate higher traffic we are planning on splitting our
application in 2 portions:
1. Web application (on 1 machine)
2. Search application (one more than 1 machine)

Each one of the application will reside on a (possibly) separate  
machines.
We are looking forward to scaling by adding more than 1 machine  
dedicated to

searching as lucene search seems to be the bottleneck.

Web application will provide the front-end to the user. All static  
pages,
images and the style information will reside on this machine. It  
will also
serve dynamic pages but all the searching will take place on the  
search

application. Web application will send search parameters to the search
application and after searching, it will send back the results to  
the web

application which will format it and display it to the user.

What we are unable to decide is whether to use RMI (
http://www.soft-amis.com/index.html?return=http://www.soft-amis.com/ 
cluster4spring/index.html

cluster4spring  )or simple HTTP POST/GET request with response in XML
format. We did some research and found out that RMI (with clustering
support) might be more suitable for our needs. Unfortunately our  
team is not
familiar with RMI and so we don't know if there will be any issues  
with it

during implementation.
Advantage of using simple GET/POST is we have more control which  
searcher
app to use and when. An important criteria for us is to disable  
searching
from the searcher app who's index is being updated. This is very  
important
for us. Is there anyway in RMI to inform the web (client)  
application that a

particular server is unavailable (due to index being updated).

We would also like to know if anybody has implemented lucene  
searching in a

similar fashion. Thanks for your help.

Kumar Limbu

--
View this message in context: http://www.nabble.com/Serving-remote- 
lucene-client---RMI-vs-HTTP-tf4084167.html#a11608209

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Serving remote lucene client - RMI vs HTTP

2007-07-15 Thread kumarlimbu


Hi EVeryone,

Thank you all for your replies.

And reply to your questions Grant:
We have more than 3 Million document in our index.
We get more than 150,000 searches (queries) per day. We expect this no to go
up. This count only includes actual user search and doesn't include robot
searches and other internal searches (for sending emails to user etc).
And our final queries sent to lucene are quite complicated. This is because
we need to confirm to a lot of criteria (some are set by users and some are
internal logics). I don't think we can simplify our queries.

And importantly we are looking to scale to a larger no. of traffic. Our
current performance in terms of speed is ok. 

We are not sure about implementing using Solr because we crawl only specific
type of sites and our crawling mechanism has proven to be quite stable. We
are only looking to scale to larger no. of users and doing a new
implementation using Solr might be an overkill as time is a huge factor.

Best Regards,
Kumar


Grant Ingersoll-6 wrote:
> 
> Hi Kumar,
> 
> I am curious about where the bulk of time was spent in Lucene.  I am  
> not doubting your numbers or analysis, just want to know if there was  
> anything that could be improved in Lucene and make sure you are  
> solidly in need of going to multiple machines as it adds an extra  
> level of complexity.  Can you provide info about number of documents,  
> # of updates, # of queries etc. ?  What kind of analysis, etc. are  
> you doing?  That being said, putting the web app on one machine and  
> the search application on the other makes sense much in the same way  
> it makes sense in most cases to do this for a database.
> 
> As for POST/GET vs. RMI, have a look at Solr, esp. its replication  
> capabilities for load balancing search servers.  I think it answers  
> the question in favor of the POST/GET approach.  In fact, you may be  
> able to drop in Solr to your situation w/o too much work.
> 
> Cheers,
> Grant
> 
> 
> On Jul 15, 2007, at 10:10 PM, kumarlimbu wrote:
> 
>>
>> Hi Everyone,
>>
>> We are using lucene,nutch and spring framework to create a specialized
>> search engine. Due to growing traffic we are trying to scale. By  
>> doing some
>> tests we found out that the bottle neck was lucene search. We used  
>> some
>> heavy traffic simulation and logged the time taken by each portion  
>> of the
>> server response and found out that the bulk of the time was spent in
>> searching from lucene index.
>>
>> In order to accomodate higher traffic we are planning on splitting our
>> application in 2 portions:
>> 1. Web application (on 1 machine)
>> 2. Search application (one more than 1 machine)
>>
>> Each one of the application will reside on a (possibly) separate  
>> machines.
>> We are looking forward to scaling by adding more than 1 machine  
>> dedicated to
>> searching as lucene search seems to be the bottleneck.
>>
>> Web application will provide the front-end to the user. All static  
>> pages,
>> images and the style information will reside on this machine. It  
>> will also
>> serve dynamic pages but all the searching will take place on the  
>> search
>> application. Web application will send search parameters to the search
>> application and after searching, it will send back the results to  
>> the web
>> application which will format it and display it to the user.
>>
>> What we are unable to decide is whether to use RMI (
>> http://www.soft-amis.com/index.html?return=http://www.soft-amis.com/ 
>> cluster4spring/index.html
>> cluster4spring  )or simple HTTP POST/GET request with response in XML
>> format. We did some research and found out that RMI (with clustering
>> support) might be more suitable for our needs. Unfortunately our  
>> team is not
>> familiar with RMI and so we don't know if there will be any issues  
>> with it
>> during implementation.
>> Advantage of using simple GET/POST is we have more control which  
>> searcher
>> app to use and when. An important criteria for us is to disable  
>> searching
>> from the searcher app who's index is being updated. This is very  
>> important
>> for us. Is there anyway in RMI to inform the web (client)  
>> application that a
>> particular server is unavailable (due to index being updated).
>>
>> We would also like to know if anybody has implemented lucene  
>> searching in a
>> similar fashion. Thanks for your help.
>>
>> Kumar Limbu
>>
>> -- 
>> View this message in context: http://www.nabble.com/Serving-remote- 
>> lucene-client---RMI-vs-HTTP-tf4084167.html#a11608209
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> --
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [

Re: search through all fields

2007-07-15 Thread Mohammad Norouzi


On 7/14/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


I think he means index all your different fields into a single field
named "all".  Not sure what makes it special, it is just like any
other field.




but that really impossible ! because I have near millions records to be
indexed so this job will decrease the time of indexing and increase the
index size

--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/

Re: Serving remote lucene client - RMI vs HTTP

2007-07-15 Thread Chris Hostetter


: And our final queries sent to lucene are quite complicated. This is because
: we need to confirm to a lot of criteria (some are set by users and some are
: internal logics). I don't think we can simplify our queries.

in my experience, your queries can always be made simpler by making your
indexing more complex -- that's not neccessarily a good tradeoff to make
in every situation, but it's usually possible to "denormalize" your index
to help decrease complexity.

: We are not sure about implementing using Solr because we crawl only specific
: type of sites and our crawling mechanism has proven to be quite stable. We

whether or not you want to use Solr is really independent of what kidn of
crawler you currently have -- you'd still use hte same crawler, but
dpeneding on how you wanted to leverage Solr either you'd change your
crawler to POST your new documents to Solr instead of writing the the
index directly, or you could keep writing to the index directly and just
use Solr to serve the searches (assuming you configure the Solr schema.xml
to match up with the fields/analyzers your crawler uses when writing to
the index so it can query on the right fields)

if you have no interest in using Solr to manage your index, even if you
have no interest in using Solr to search your index over HTTP, you might
want to take a look at the distribution scripts that come with Solr to
provide replication.  At the core they are incredibly simple, just
taking advantage of rsync, hard links, and properties of the lucene
fileformat to help minimize the amount of data that needs to go over the
wire when you want to index on one box and then replicate that index to 10
other boxes to distribute the search load.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Serving remote lucene client - RMI vs HTTP

Re: Serving remote lucene client - RMI vs HTTP

Re: Serving remote lucene client - RMI vs HTTP

Re: Serving remote lucene client - RMI vs HTTP

Re: Serving remote lucene client - RMI vs HTTP

Re: search through all fields

Re: Serving remote lucene client - RMI vs HTTP

7 matches

Site Navigation

Mail list logo

Footer information