Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Chris Collins
Kevin I would be curious to know more about your merging issues.  As I
mentioned I am concerned about merge time and in my case its against a filer
that of course have high latency.  The other issue is that I effectively index
things with a primary key.  I need to ensure an efficient way of preventing old
records from trampling on new records , this occurs due to potential out of
order set of writes to the index from multiple nodes in a processing farm.

C

--- Kevin Burton <[EMAIL PROTECTED]> wrote:

> Bill Au wrote:
> 
> >Optimize is disk I/O bound.  So I am not sure what multiple CPUs will buy
> you.
> >  
> >
> 
> Now on my system with large indexes... I often have the CPU at 100%...
> 
> Kevin
> 
> -- 
> 
> 
> Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
> See irc.freenode.net #rojo if you want to chat.
> 
> Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
> 
>Kevin A. Burton, Location - San Francisco, CA
>   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Bill Au
That's not true in my case.  The CPU never went over 50%.  I/O wait is
often greater
the CPU and can be as high as 90%.

Bill

On 6/10/05, Kevin Burton <[EMAIL PROTECTED]> wrote:
> Bill Au wrote:
> 
> >Optimize is disk I/O bound.  So I am not sure what multiple CPUs will buy 
> >you.
> >
> >
> 
> Now on my system with large indexes... I often have the CPU at 100%...
> 
> Kevin
> 
> --
> 
> 
> Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
> See irc.freenode.net #rojo if you want to chat.
> 
> Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
> 
>Kevin A. Burton, Location - San Francisco, CA
>   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SimilarityDelegator examples ?

2005-06-10 Thread Robichaud, Jean-Philippe
Hi Everyone, 

 

 I've been using Lucene a lot and I would like to know how the
SimilarityDelegator should be used.  I would like to override only the
lengthNorm member of the DefaultSimilarity and I understand that this is
exactly the purpose of SimilarityDelegator ?  Am I right? Does this class
have another usage?  A simple code example would be fine!

 

Thanks, 

 

Jp

 



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Chris Collins
Yes, that would line up with being pretty much cpu bound.  So if you were to
have 2 xeon's with HT then you kinda have almost 4 resources (threads) of
execution you could take advantage of.

So from my current tests where I have a multiple threads producing work for an
index and one index writer (one thread doing addDocument), I am seeing that I
am cpu bound on the indexer. Since I am on a dual xeon with HT, I could if I
was using 4 indices improve my throughput by > 1x but < 4x.

C 

--- Kevin Burton <[EMAIL PROTECTED]> wrote:

> Bill Au wrote:
> 
> >Optimize is disk I/O bound.  So I am not sure what multiple CPUs will buy
> you.
> >  
> >
> 
> Now on my system with large indexes... I often have the CPU at 100%...
> 
> Kevin
> 
> -- 
> 
> 
> Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
> See irc.freenode.net #rojo if you want to chat.
> 
> Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
> 
>Kevin A. Burton, Location - San Francisco, CA
>   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread John Haxby

Chris Collins wrote:


Ok that part isnt surprising.  However only about 1% of 30% of the merge was
spent in the OS.flush call (not very IO bound at all with this controller).
 

On Linux, at least, measuring the time taken in OS.flush is not a good 
way to determine if you're I/O bound -- all that does is transfer the 
data to the kernel.   Later, possibly much later, the kernel will 
actually write the data to the disk.


The upshot of this is that if the size of the index is around the size 
of physical memory in the system, optimizing will appear CPU bound.   
Once the index exceeds the size of physical memory, you'll see the 
effects of I/O.   OS.flush will still probably be ver quick, but you'll 
see a lot of I/O wait if you run, say, top.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question on lucene sandbox highlighter

2005-06-10 Thread Terence Lai
Hi all,

I have a couple questions regarding to the Highlighter.

Question 1:
===
I download the highlighter source files. When I compile the code, I got the 
following error:


org/apache/lucene/search/highlight/TokenSources.java [19:1] cannot resolve 
symbol
symbol  : class TermVectorOffsetInfo 
location: package index
import org.apache.lucene.index.TermVectorOffsetInfo;


Note that I have lucene 1.4.2 jar file in my class path. However, it does not 
have org.apache.lucene.index.TermVectorOffsetInfo. Does anyone know whether I 
am missing some other jar files?


Question 2:
===
I use lucene to search HTML document. Before I create the the seach index, I 
used another open source parser to remove all the HTML tag from the search 
field contents so that the HTML tag will not be part of the searchable values.

Now, I would like to apply the highlighter to my original HTML document. Is 
there any way for me to ignore the HTML tag while I perform the hightlight. For 
example, my search criteria is "html". I don't what the highlighter to 
highlight "" tag.


Thanks,
Terence
   




--
Get your free email account from http://www.trekspace.com
  Your Internet Virtual Desktop!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Chris Collins
Hi John, your comments are correct.  But based on the fact we know on our box
we have almost 80MB sustainable bandwidth and very low latency to disk per
second, and observing that the io we are doing in lucene is small in comparison
a 1 second I am reasonably confident that this time spent is not far out (for
this run).  

As I may have mentioned before, in my case we have reasonably fast hardware
raid.  At that point the bottlenecks of course change.  We also have the case
where we write to a filer which has good bandwidth but high latency.  Here we
see that merge is io bound as you would expect.  That’s why I assume changing
the buffer sizes of the FS streams could help assuming the merge operations
read and write to the segments in a linear fashion…in this case, the latency is
not really a function of the disks, but a function of the latency in the rpc
between the client (indexer) and the filer.  By increasing the buffer sizes we
would reduce the amount of RPC’s. 

>From an IO bound point of view one needs to consider if you have saturated the
device or you are just stuck waiting for the disk to rotate around.  Long gone
is escalator algorithm as the preferred disk optimization of seagate :-}, disk
can take many instructions and re-order them to minimize latency issues. If its
a latency issue and not necessarily bandwidth then using overlapping io can
improve throughput (splitting the index and having multiple writer threads
would give you that). In fact in my silly filer example having multiple writers
does show good effect.  Of course this depends on if you can finagle your
application to allow you to split the indices.

Further I have done longer runs to plot throughput over time (16M doc crawls). 
I only profiled 4k docs since I didn’t want to wait forever with JProbe.  Not
sure what the correct jargon is here so excuse my description.  The in memory
objects were merged out to disk but we didnt get the second order effect of the
maybeMerge function finding enough segments on that level to trigger the
merging of multiple segments for the next tier (segments * mergefactor). 
Indexer throughput is not of course constant, over time the time to index one
document does increase when you take into account the cost of the merges.  But
due to the pyramid effect of how the merger works, the larger order merges of
course happen less and less.  

Back to my observations.  From the CPU part of indexing, the inversion aspect
is dwarfed by the standard tokenizer.  My hat off to Doug (what is hogging the
cpu is auto generated code :-} )Given multiple cores / ht/ smp your certainly
can capitalize on them if you so wish to write the code.  Not all IO bound
problems are created equal, if it is merely latency then you still have room to
improve throughput if you massage your indexing approach.  Using a single
indexing thread and seeing your io bound should not be a reason to give up :-} 

As you can tell I have two indexing worlds, one where my disk is fast (CPU
Bound) and one where it is slow (IO Bound).  I have to capitalize on the
effects of both to get my job done and each of them have distinctive
challenges.

Regards

Chris
--- John Haxby <[EMAIL PROTECTED]> wrote:

> Chris Collins wrote:
> 
> >Ok that part isnt surprising.  However only about 1% of 30% of the merge was
> >spent in the OS.flush call (not very IO bound at all with this controller).
> >  
> >
> On Linux, at least, measuring the time taken in OS.flush is not a good 
> way to determine if you're I/O bound -- all that does is transfer the 
> data to the kernel.   Later, possibly much later, the kernel will 
> actually write the data to the disk.
> 
> The upshot of this is that if the size of the index is around the size 
> of physical memory in the system, optimizing will appear CPU bound.   
> Once the index exceeds the size of physical memory, you'll see the 
> effects of I/O.   OS.flush will still probably be ver quick, but you'll 
> see a lot of I/O wait if you run, say, top.
> 
> jch
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Usenet Bridge or LSA Support?

2005-06-10 Thread Mike Winter
Pardon me if this has been asked before, but I was wondering if there 
exists a Lucene -> Usenet bridge or support for latent semantic 
scoring?  Thanks for any information.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Peter A. Friend


On Jun 9, 2005, at 11:52 PM, Chris Collins wrote:

In that case I have a different performance issue, that is that  
FSInputStream
and FSOutputStream inherit the buffer size of 1k from OS and IS   
This would be
useful to increase to reduce the amount of RPC's to the filer when  
doing merges
. assuming that reads and writes are sequential (CIFS supports  
a 64k block
and NFS supports upto I think 32k).  I haven't spent much time on  
this so far
so its not like I know its hard todo.  From preliminary experiments  
its obvious

that changing the OS buffersize is not the thing todo.

If anyone has successfully increased the FSOutputStream and  
FSInputStream
buffers and got it not to blow up on array copies I would love to  
know the

short cut.


I just started up with Lucene, and I have been looking at the NFS  
issues. Since the OS doesn't report the block size in use by the  
Netapp, EMC, whatever, you need to tweak it manually. I found this in  
src/java/org/apache/lucene/store/OutputStream.java:


/** Abstract class for output to a file in a Directory.  A random- 
access output

* stream.  Used for all Lucene index output operations.
* @see Directory
* @see InputStream
*/
public abstract class OutputStream {
  static final int BUFFER_SIZE = 1024;

I changed that value to 8k, and based on the truss output from an  
index run, it is working. Haven't gotten much beyond that to see if  
it causes problems elsewhere. The value also needs to be altered on  
the read end of things. Ideally, this will be made settable via a  
system property.


Peter


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Chris Collins
How many documents did you try to index?  I am using a relatively large
minMergeDoc that causes me to run out of memory when I make such a change. (I
am using 1/2 gb of heap btw).  I believe changing it in the outputstream object
means that a lot of in memory only objects use that size too...I assume that
the real bang for the buck is for the FSOutputStream and FSInputStream. 
Unravelling that case drops me into array copy issues that I have to debug.

I dont know I would of used truss in this regard, this only points out what
size hit the kernel not what went over the wire.  I would suggest using
ethereal to ensure thats how its ending up on the wire.  As for what goes over
the wire, thats something the cifs / nfs client negotiates with the server.  I
believe NetApp for instance supports upto 32k on nfs and almost 64k with cifs. 


Regards

<>

--- "Peter A. Friend" <[EMAIL PROTECTED]> wrote:

> 
> On Jun 9, 2005, at 11:52 PM, Chris Collins wrote:
> 
> > In that case I have a different performance issue, that is that  
> > FSInputStream
> > and FSOutputStream inherit the buffer size of 1k from OS and IS   
> > This would be
> > useful to increase to reduce the amount of RPC's to the filer when  
> > doing merges
> > . assuming that reads and writes are sequential (CIFS supports  
> > a 64k block
> > and NFS supports upto I think 32k).  I haven't spent much time on  
> > this so far
> > so its not like I know its hard todo.  From preliminary experiments  
> > its obvious
> > that changing the OS buffersize is not the thing todo.
> >
> > If anyone has successfully increased the FSOutputStream and  
> > FSInputStream
> > buffers and got it not to blow up on array copies I would love to  
> > know the
> > short cut.
> 
> I just started up with Lucene, and I have been looking at the NFS  
> issues. Since the OS doesn't report the block size in use by the  
> Netapp, EMC, whatever, you need to tweak it manually. I found this in  
> src/java/org/apache/lucene/store/OutputStream.java:
> 
> /** Abstract class for output to a file in a Directory.  A random- 
> access output
> * stream.  Used for all Lucene index output operations.
> * @see Directory
> * @see InputStream
> */
> public abstract class OutputStream {
>static final int BUFFER_SIZE = 1024;
> 
> I changed that value to 8k, and based on the truss output from an  
> index run, it is working. Haven't gotten much beyond that to see if  
> it causes problems elsewhere. The value also needs to be altered on  
> the read end of things. Ideally, this will be made settable via a  
> system property.
> 
> Peter
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



view index file

2005-06-10 Thread avrootshell

Hi,

  I'm curious to know,if there is any way to view the .cfs file(the 
index file created).


Someone plz shred some light on this.

Tia.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: view index file

2005-06-10 Thread Nader Henein
The one browsing utility I've come across to browse through Lucene 
Indecies was Luke (I use it successfully to debug index issues) check it 
out, http://www.getopt.org/luke/


Hope this answers your question

Nader Henein

avrootshell wrote:


Hi,

  I'm curious to know,if there is any way to view the .cfs file(the 
index file created).


Someone plz shred some light on this.

Tia.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









--
Nader S. Henein
Senior Applications Developer

Bayt.com





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: view index file

2005-06-10 Thread Aalap Parikh
Hi,

Use Luke. It's an excellent tool and everybody in the
Lucene community uses that.

http://www.getopt.org/luke/

Aalap.

--- avrootshell <[EMAIL PROTECTED]> wrote:

> Hi,
> 
>I'm curious to know,if there is any way to view
> the .cfs file(the 
> index file created).
> 
> Someone plz shred some light on this.
> 
> Tia.
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 




__ 
Discover Yahoo! 
Stay in touch with email, IM, photo sharing and more. Check it out! 
http://discover.yahoo.com/stayintouch.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Peter A. Friend


On Jun 10, 2005, at 9:33 AM, Chris Collins wrote:


How many documents did you try to index?


Only about 4000 at the moment.


  I am using a relatively large
minMergeDoc that causes me to run out of memory when I make such a  
change. (I

am using 1/2 gb of heap btw).


I was running out of memory as well until I gave Java a larger heap  
to work with. I am assuming that a dedicated indexing machine (as  
well as search) is going to need a mountain of memory. I figure I  
will be giving Java gigs to play with.



I believe changing it in the outputstream object
means that a lot of in memory only objects use that size too.


This I need to look into. At a guess, I would think that there would  
be an OutputStream object for each open segment, and each file in  
that segment. A consolidated index *might* use less but of course we  
are trying to improve performance here, and the consolidated index  
does incur a cost. Assuming 10 segments and 10 files within each  
segment, that's 100 OutputStream objects or 809,600 bytes. That'll  
grow quickly with merge tweaks. Those larger writes do save a bunch  
of system calls and make (maybe) better use of your filers block  
size. This grows quickly with maxMerge tweaks. Of course this could  
be utterly incorrect, I need to look into this a bit more carefully.


I dont know I would of used truss in this regard, this only points  
out what
size hit the kernel not what went over the wire.  I would suggest  
using

ethereal to ensure thats how its ending up on the wire.


True, hadn't gotten that far yet. :-)

Peter



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Kevin Burton

Chris Collins wrote:


Well I am currently looking at merging too.  In my application merging will
occur against a filer (read as higher latency device).  I am currently working
on how to stage indices on local disk before moving to a filer.  Assume I must
move to a filer eventually for whatever crazzy reason I need todont ask it
aint funny :-}

In that case I have a different performance issue, that is that FSInputStream
and FSOutputStream inherit the buffer size of 1k from OS and IS  This would be
useful to increase to reduce the amount of RPC's to the filer when doing merges
. assuming that reads and writes are sequential (CIFS supports a 64k block
and NFS supports upto I think 32k). 

Yeah.. I already did this actually ... on local disks the performance 
benefit wasn't noticable.  The variables are  private/final ... I made 
them public and non-final and it worked.


Note that OutputStream has a bug when I set it higher... I don't have 
the trace I'm afraid...



I haven't spent much time on this so far
so its not like I know its hard todo.  From preliminary experiments its obvious
that changing the OS buffersize is not the thing todo. 


If anyone has successfully increased the FSOutputStream and FSInputStream
buffers and got it not to blow up on array copies I would love to know the
short cut


Maybe that was my problem...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Kevin Burton

Peter A. Friend wrote:



I changed that value to 8k, and based on the truss output from an  
index run, it is working. Haven't gotten much beyond that to see if  
it causes problems elsewhere. The value also needs to be altered on  
the read end of things. Ideally, this will be made settable via a  
system property.


Has anyone tried to tweak this on a RAID array on XFS?  Its confusing to figure 
out the ideal read size.

My performance benchmarks didn't show any benefit to setting this variable 
higher but I'm worried this is due to caching.

I tried to flush the caches by creating a 5G file and then cating that to 
/dev/null but I have no way to verify that this actually works.

I just made the BUFFER_SIZE veriables non-final so that I can set them at any time. 


Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question on lucene sandbox highlighter

2005-06-10 Thread Erik Hatcher


On Jun 10, 2005, at 11:28 AM, Terence Lai wrote:

Hi all,

I have a couple questions regarding to the Highlighter.

Question 1:
===
I download the highlighter source files. When I compile the code, I  
got the following error:



org/apache/lucene/search/highlight/TokenSources.java [19:1] cannot  
resolve symbol

symbol  : class TermVectorOffsetInfo
location: package index
import org.apache.lucene.index.TermVectorOffsetInfo;


Note that I have lucene 1.4.2 jar file in my class path. However,  
it does not have org.apache.lucene.index.TermVectorOffsetInfo. Does  
anyone know whether I am missing some other jar files?


The latest Highlighter source code is now specific to the TRUNK of  
the core Lucene API (which will be Lucene 1.9/2.0).  You will need to  
pull a previous version somehow (I'm not sure if the Subversion  
repository for contrib goes back that far or you'll need to get at  
the CVS attic for jakarta-lucene-sandbox).


You can get a binary of a 1.4 compatible Highlighter JAR from the  
source code that comes with Lucene in Action at http:// 
www.lucenebook.com


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in clustered environment (Tomcat)

2005-06-10 Thread Nader Henein
Considering you have all your servers on one machine a simple memory failure and the whole thing goes south. But you're right, we have an independent Lucene index sitting next to each one of our webservers on each machine, but they are all updated from a central location powered and organized by an application that accesses our persistent store on an 
oracle database and creates XML files which are then copied to each of the Lucene servers and indexed, if the central utility fails, then the backup kicks in, at worst the indecies aren't up to date for as long as it takes to point the webservers to the Oracle Standby. I wrote a preliminary paper (will send you separately coz the mailing list doesn't allow attachments) about Lucene strategies in a clustered environment, this is a bout 6 months old, I've gone a long way since and I'm finalizing a newer version which I hope to publish so as to offer a solid case study to anyone out there taking that step. Once again this paper is old, but it should get you going.


Nader Henein



Ben wrote:


Wouldn't it defeat the purpose of clustering if you have a single
server to manage a single index? What would happen if this server
failed?

Cheers,
Ben

On 6/8/05, Ben <[EMAIL PROTECTED]> wrote:
 


How about using JavaGroups to notify other nodes in the cluster about
the changes?

Essentially, each node has the same index stored in a different
location. When one node updates/deletes a record, other nodes will get
a notification about the changes and update their index accordingly?
By using this method, I don't have to modify my Lucene code, I just
need to add additional code to notify other nodes. I believe this
method also scales better.

Cheers,
Ben


On 6/7/05, Nader Henein <[EMAIL PROTECTED]> wrote:
   


I realize I've already asked you this question, but do you need 100%
real time, because you could run batch them every 2 minutes, and
concerning Parallel search, unless you really need it, it's overkill in
this case, a communal index will serve you well and will be much easier
to maintain. You have to way requirement vs. complexity/ debug time.

Nader Henein

Ben wrote:

 


When you say your cluster is on a single machine, do you mean that you have 
multiple webservers on the same machine all of which search a single Lucene 
index?


 


Yes, this is my case.



   


Do you use Lucene as your persistent store or do you have a DB back there?


 


I use Lucene to search for data stored in a PostgreSQL server.



   


what is your current update/delete strategy because real time inserts from the 
webservers directly to the index will not work because you can't have multiple 
writers.


 


I have to do this in real time, what are the available solutions? My
application has the ability to do batch update/delete to a Lucene
index but I would like to do this in real time.

One solution I am thinking is to have each cluster has it own index
and use parallel search. This makes my application even more complex.



   


I strongly recommend Quartz, it's rock solid and really versatile.


 


I am using Quartz, it is really great and supports cluster.

Thanks,
Ben


On 6/7/05, Nader Henein <[EMAIL PROTECTED]> wrote:


   


When you say your cluster is on a single machine, do you mean that you
have multiple webservers on the same machine all of which search a
single Lucene index? Because if that's the case, your solution is
simple, as long as you persist to a single DB and then designate one of
your servers (or even another server) to update/delete the index. Do you
use Lucene as your persistent store or do you have a DB back there? and
what is your current update/delete strategy because real time inserts
 


from the webservers directly to the index will not work because you
   


can't have multiple writers. Updating a dirty flag on rows that need to
be indexed/deleted, or using a table for this task and then batching
your updates would be ideal, and if you're using server specific
scheduling, I strongly recommend Quartz, it's rock solid and really
versatile.

My two cents.

Nader Henein


Ben wrote:



 


My cluster is on a single machine and I am using FS index.

I have already integrated Lucene into my web application for use in a
non-clustered environment. I don't know what I need to do to make it
work in a clustered environment.

Thanks,
Ben

On 6/7/05, Nader Henein <[EMAIL PROTECTED]> wrote:




   


IMHO, Issues that you need to consider

 * Atomicity of updates and deletes if you are using multiple indexes
   on multiple machines (the case if your cluster is over a wide network)
 * Scheduled indecies to core data comparison and sanitization
   (intensive)

This all depends on what the volume of change is on your index and
whether you'll be using a Memory resident index or an FS index.

This should start the ball rolling, we've been using Lucene successfully
on a distributed cluste

Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Chris Collins
Dont forget that when a document is indexed it starts life in its own segment. 
If you have min merge of 4k you could have an awefull lot of 1 doc segments on
the segment stack.thats why I run out of memory.  If that is the case that
each of these at some point has a buffer of 8k or say 64k you blow up pretty
quickly.

regards

C

--- "Peter A. Friend" <[EMAIL PROTECTED]> wrote:

> 
> On Jun 10, 2005, at 9:33 AM, Chris Collins wrote:
> 
> > How many documents did you try to index?
> 
> Only about 4000 at the moment.
> 
> >   I am using a relatively large
> > minMergeDoc that causes me to run out of memory when I make such a  
> > change. (I
> > am using 1/2 gb of heap btw).
> 
> I was running out of memory as well until I gave Java a larger heap  
> to work with. I am assuming that a dedicated indexing machine (as  
> well as search) is going to need a mountain of memory. I figure I  
> will be giving Java gigs to play with.
> 
> > I believe changing it in the outputstream object
> > means that a lot of in memory only objects use that size too.
> 
> This I need to look into. At a guess, I would think that there would  
> be an OutputStream object for each open segment, and each file in  
> that segment. A consolidated index *might* use less but of course we  
> are trying to improve performance here, and the consolidated index  
> does incur a cost. Assuming 10 segments and 10 files within each  
> segment, that's 100 OutputStream objects or 809,600 bytes. That'll  
> grow quickly with merge tweaks. Those larger writes do save a bunch  
> of system calls and make (maybe) better use of your filers block  
> size. This grows quickly with maxMerge tweaks. Of course this could  
> be utterly incorrect, I need to look into this a bit more carefully.
> 
> > I dont know I would of used truss in this regard, this only points  
> > out what
> > size hit the kernel not what went over the wire.  I would suggest  
> > using
> > ethereal to ensure thats how its ending up on the wire.
> 
> True, hadn't gotten that far yet. :-)
> 
> Peter
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Chris Collins
Yeh I think the bug is related to an array copy that expects 1k blocks (if I
recall it was RAMDirectory or something like that).  

C

--- Kevin Burton <[EMAIL PROTECTED]> wrote:

> Chris Collins wrote:
> 
> >Well I am currently looking at merging too.  In my application merging will
> >occur against a filer (read as higher latency device).  I am currently
> working
> >on how to stage indices on local disk before moving to a filer.  Assume I
> must
> >move to a filer eventually for whatever crazzy reason I need todont ask
> it
> >aint funny :-}
> >
> >In that case I have a different performance issue, that is that
> FSInputStream
> >and FSOutputStream inherit the buffer size of 1k from OS and IS  This would
> be
> >useful to increase to reduce the amount of RPC's to the filer when doing
> merges
> >. assuming that reads and writes are sequential (CIFS supports a 64k
> block
> >and NFS supports upto I think 32k). 
> >
> Yeah.. I already did this actually ... on local disks the performance 
> benefit wasn't noticable.  The variables are  private/final ... I made 
> them public and non-final and it worked.
> 
> Note that OutputStream has a bug when I set it higher... I don't have 
> the trace I'm afraid...
> 
> > I haven't spent much time on this so far
> >so its not like I know its hard todo.  From preliminary experiments its
> obvious
> >that changing the OS buffersize is not the thing todo. 
> >
> >If anyone has successfully increased the FSOutputStream and FSInputStream
> >buffers and got it not to blow up on array copies I would love to know the
> >short cut
> >
> Maybe that was my problem...
> 
> Kevin
> 
> -- 
> 
> 
> Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
> See irc.freenode.net #rojo if you want to chat.
> 
> Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
> 
>Kevin A. Burton, Location - San Francisco, CA
>   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]