Re: Wikipedia Index

2012-06-19 Thread Greg Bowyer
the documents into content and title thanksshaimaa From: luc...@mikemccandless.com Date: Tue, 19 Jun 2012 19:48:24 -0400 Subject: Re: Wikipedia Index To: java-user@lucene.apache.org I have the index locally ... but it's really impractical to send it especially if you already have the source

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
I only have the source text on a mysql database Do you know where I can download it in xml and is it possible to split the documents into content and title thanksshaimaa > From: luc...@mikemccandless.com > Date: Tue, 19 Jun 2012 19:48:24 -0400 > Subject: Re: Wikipedia Index > T

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
title.if there is any place where I can get > the index, that will save me great time > regardsshaimaa > >> From: luc...@mikemccandless.com >> Date: Tue, 19 Jun 2012 16:29:39 -0400 >> Subject: Re: Wikipedia Index >> To: java-user@lucene.apache.org >> >>

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
3 GB RAM is plenty for indexing Wikipedia (eg, that nightly benchmark uses a 2 GB heap). 2 cores just means it'll take longer than more cores... just use 2 indexing threads. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 19, 2012 at 5:26 PM, Reyna Melara wrote: > Could it be possib

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
rom: luc...@mikemccandless.com > Date: Tue, 19 Jun 2012 16:29:39 -0400 > Subject: Re: Wikipedia Index > To: java-user@lucene.apache.org > > Likely the bottleneck is pulling content from the database? Maybe > test just that and see how long it takes? > > 24 hours is w

Re: Wikipedia Index

2012-06-19 Thread Reyna Melara
Could it be possible to index Wikipedia in a 2 core machine with 3 GB in RAM? I have had the same problem trying to index it. I've tried with a dump from april 2011. Thanks Reyna CIC-IPN Mexico 2012/6/19 Michael McCandless > Likely the bottleneck is pulling content from the database? Maybe >

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
Likely the bottleneck is pulling content from the database? Maybe test just that and see how long it takes? 24 hours is way too long to index all of Wikipedia. For example, we index Wikipedia every night for our trunk/4.0 performance tests, here: http://people.apache.org/~mikemccand/luceneb