Frustrated is the word :) I have looked at Solr...what I am worried about there is this: Solr says it requires an OS that supports hard links. Currently Windows does not to my knowledge. Someone seemed to make a comment that Windows could be supported...from what I know I don't think so. Not a deal breaker per say but then there is this: I have done a lot with the lucene API. I have created a custom query language to lucene query parser. I have changed the standard parser. I have made heavy use of Multi-Searchers. I am really tied into the Lucene API. I am worried about how easy it will be to integrate that into Solr. Perhaps I can just grab the distributed part of Solr but I do not know. I have so much to do that worrying about a distributed search seems like too big a scope for now. It seemed to me that breaking up the index with an RMI searcher was the easiest approach anyway. In the end...I would really like to stay on one server. This server will prob have multiple procs...should I make sure I incorporate a parallel searcher option?

In the end I am really just hoping for some more insight into this exact question:

Can I index 30 million+ docs that range in size form 2-10kb on a single server in a Windows environment (accesss to a max of about 1.5 gig of RAM). The average search will need to be sorted by field not relevancy.

Do you think its possible or a pipe dream? I realize I need to test to find out...but I am looking for someone with experience to pipe in before I get to that point.

Thanks for the response so far...I love the lucene mailing list.

Thanks,
Mark


Ray Tsang wrote:
i've indexed 80m records and now up to 200m.. it can be done, and could've been done better. like the other said, architecture is important. have you
considered looking into solr?  i haven't kept up with it (and many of the
mailing lists...), but looks very interesting.

ray,

On 8/12/06, Jason Polites <[EMAIL PROTECTED]> wrote:

Sounds like you're a bit frustrated.  Cheer up, the simple fact is that
engineering and business rarely see eye-to-eye.  Just focus on the fact
that
what you have learnt from the process will help you, and they paid for it
;)

On the issue at hand...Lucene should scale to this level, but you need a
good architecture behind it.  Google has good indexing tech, but it's
their
architecture that allows them to spread the index across thousands of
servers which really gives it grunt (to the point that they invented their
own RAID-style file system).

Just think very carefully about the architecture underpinning the index.
Lucene is core-tech.  It's up to you to provide the framework to make it
hum.

On 8/12/06, Mark Miller <[EMAIL PROTECTED]> wrote:
>
> Tomi NA wrote:
> > On 8/12/06, Mark Miller <[EMAIL PROTECTED]> wrote:
> >> I've made a nice little archive application with lucene. I made it to
> >> handle our largest need: 2.5 million docs or so on a single server.
Now
> >> the powers that be say: lets use it for a 30+ million document
archive
> >> on a single server! (each doc size maybe 10k max...as small as a 1 or > >> 2k) Please tell me why we are in trouble...please tell me why we are
> >> not. I have tested up to 2 million docs without much trouble but 30
> >> million...the average search will include a sort on a field as
> >> well...can I search 30+ million docs with a sort? Man am I worried
> about
> >> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
> >> Mabye. Even still, Tomcat seems to be able to launch with a max of
1.5
> >> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds
> like
> >> too much of a load to me for a single server. Not that they care what
I
> >> think...I only wrote the thing (man I hate my job, offer me a new one
> :)
> >> )...please...comments?
> >>
> >> Cheers,
> >>
> >> Miserable Mark
> >
> > I don't really understand what you're so worried about. Either it'll
> > work well with the setup you have, or it won't. It's really the size
> > of it. ;)
> > Seriously, you have a number of relatively cheap possibilities at hand
> > to improve search performance: storing the index on a RAID 5 disk
> > array will let you read the indices very fast, using multicore CPUs,
> > adding memory and even if all that isn't good enough, you can always
> > use a small cluster (say, 4 nodes) of very, very inexpensive PCs
> > filled with a GB of RAM. You don't have to keep them inside the
> > regular UPS/backup/voult-protected area as the indices can always be
> > rebuilt (unlike e.g. data in transactional systems) and between 4 of
> > them they might cost like an entry-level server.
> > I'll let the experts speak now. :)
> >
> > t.n.a.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> Thanks for the tip...I am not too worried...I am miserable because I
> live in Dilbert land, not this particular incident. Spreading to
> multiple servers is a possibility but one I want to avoid...I wrote this > app on the side since our current product is crap...it still needs a lot > of work and thinking about distributing lucene at this point is a little > much...I never even have time to work on this project as it is becuase I > am currently tasked with porting the crap old project to Windows. I need > to do a bunch to shore up what I have. No one cares though...they think > that I have done nothing (or can't understand what I have done) while at
> the same time they want to use what I havn't done to do what I made it
> for as well as this new super archive of 30 million + docs...in the end > I'll be looking for a new job...still curious about lucene scaling to 30 > million docs with a sort on every search though (yes I know the sort is
> cached...worries me too though...the sort will be on multiple and
> different fields depending no what the searcher wants...uggg...the size
> of the caches....)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to