Re: 30 milllion+ docs on a single server

Mark Miller Sat, 12 Aug 2006 14:09:22 -0700

Frustrated is the word :) I have looked at Solr...what I am worriedabout there is this: Solr says it requires an OS that supports hardlinks. Currently Windows does not to my knowledge. Someone seemed tomake a comment that Windows could be supported...from what I know Idon't think so. Not a deal breaker per say but then there is this: Ihave done a lot with the lucene API. I have created a custom querylanguage to lucene query parser. I have changed the standard parser. Ihave made heavy use of Multi-Searchers. I am really tied into the LuceneAPI. I am worried about how easy it will be to integrate that into Solr.Perhaps I can just grab the distributed part of Solr but I do not know.I have so much to do that worrying about a distributed search seems liketoo big a scope for now. It seemed to me that breaking up the index withan RMI searcher was the easiest approach anyway. In the end...I wouldreally like to stay on one server. This server will prob have multipleprocs...should I make sure I incorporate a parallel searcher option?

In the end I am really just hoping for some more insight into this exactquestion:

Can I index 30 million+ docs that range in size form 2-10kb on a singleserver in a Windows environment (accesss to a max of about 1.5 gig ofRAM). The average search will need to be sorted by field not relevancy.

Do you think its possible or a pipe dream? I realize I need to test tofind out...but I am looking for someone with experience to pipe inbefore I get to that point.


Thanks for the response so far...I love the lucene mailing list.

Thanks,
Mark


Ray Tsang wrote:

i've indexed 80m records and now up to 200m.. it can be done, andcould'vebeen done better. like the other said, architecture is important.have you
considered looking into solr?  i haven't kept up with it (and many of the
mailing lists...), but looks very interesting.

ray,

On 8/12/06, Jason Polites <[EMAIL PROTECTED]> wrote:
Sounds like you're a bit frustrated.  Cheer up, the simple fact is that
engineering and business rarely see eye-to-eye.  Just focus on the fact
that
what you have learnt from the process will help you, and they paidfor it
;)

On the issue at hand...Lucene should scale to this level, but you need a
good architecture behind it.  Google has good indexing tech, but it's
their
architecture that allows them to spread the index across thousands of
servers which really gives it grunt (to the point that they inventedtheir
own RAID-style file system).

Just think very carefully about the architecture underpinning the index.
Lucene is core-tech.  It's up to you to provide the framework to make it
hum.

On 8/12/06, Mark Miller <[EMAIL PROTECTED]> wrote:
>
> Tomi NA wrote:
> > On 8/12/06, Mark Miller <[EMAIL PROTECTED]> wrote:
> >> I've made a nice little archive application with lucene. I madeit to
> >> handle our largest need: 2.5 million docs or so on a single server.
Now
> >> the powers that be say: lets use it for a 30+ million document
archive
> >> on a single server! (each doc size maybe 10k max...as small as a1 or> >> 2k) Please tell me why we are in trouble...please tell me why weare
> >> not. I have tested up to 2 million docs without much trouble but 30
> >> million...the average search will include a sort on a field as
> >> well...can I search 30+ million docs with a sort? Man am I worried
> about
> >> that. Maybe the server will have 8 procs and 12 billion gigs ofRAM.
> >> Mabye. Even still, Tomcat seems to be able to launch with a max of
1.5
> >> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds
> like
> >> too much of a load to me for a single server. Not that they carewhat
I
> >> think...I only wrote the thing (man I hate my job, offer me anew one
> :)
> >> )...please...comments?
> >>
> >> Cheers,
> >>
> >> Miserable Mark
> >
> > I don't really understand what you're so worried about. Either it'll
> > work well with the setup you have, or it won't. It's really the size
> > of it. ;)
> > Seriously, you have a number of relatively cheap possibilities athand
> > to improve search performance: storing the index on a RAID 5 disk
> > array will let you read the indices very fast, using multicore CPUs,
> > adding memory and even if all that isn't good enough, you can always
> > use a small cluster (say, 4 nodes) of very, very inexpensive PCs
> > filled with a GB of RAM. You don't have to keep them inside the
> > regular UPS/backup/voult-protected area as the indices can always be
> > rebuilt (unlike e.g. data in transactional systems) and between 4 of
> > them they might cost like an entry-level server.
> > I'll let the experts speak now. :)
> >
> > t.n.a.
> >
> >---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> Thanks for the tip...I am not too worried...I am miserable because I
> live in Dilbert land, not this particular incident. Spreading to
> multiple servers is a possibility but one I want to avoid...I wrotethis> app on the side since our current product is crap...it still needsa lot> of work and thinking about distributing lucene at this point is alittle> much...I never even have time to work on this project as it isbecuase I> am currently tasked with porting the crap old project to Windows. Ineed> to do a bunch to shore up what I have. No one cares though...theythink> that I have done nothing (or can't understand what I have done)while at
> the same time they want to use what I havn't done to do what I made it
> for as well as this new super archive of 30 million + docs...in theend> I'll be looking for a new job...still curious about lucene scalingto 30> million docs with a sort on every search though (yes I know thesort is
> cached...worries me too though...the sort will be on multiple and
> different fields depending no what the searcher wants...uggg...thesize
> of the caches....)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: 30 milllion+ docs on a single server

Reply via email to