It depends on what you want, but the wikipedia data dumps can be found here
http://en.wikipedia.org/wiki/Wikipedia:Database_download
On 19/06/12 17:03, Elshaimaa Ali wrote:
I only have the source text on a mysql database
Do you know where I can download it in xml and is it possible to split the
I only have the source text on a mysql database
Do you know where I can download it in xml and is it possible to split the
documents into content and title
thanksshaimaa
> From: luc...@mikemccandless.com
> Date: Tue, 19 Jun 2012 19:48:24 -0400
> Subject: Re: Wikipedia Index
> To: java-user@lucene
I have the index locally ... but it's really impractical to send it
especially if you already have the source text locally.
Maybe index directly from the source text instead of via a database?
Lucene's benchmark contrib/module has code to decode the XML into
documents...
Mike McCandless
http://b
3 GB RAM is plenty for indexing Wikipedia (eg, that nightly benchmark
uses a 2 GB heap).
2 cores just means it'll take longer than more cores... just use 2
indexing threads.
Mike McCandless
http://blog.mikemccandless.com
On Tue, Jun 19, 2012 at 5:26 PM, Reyna Melara wrote:
> Could it be possib
Hmm which Lucene version are you using? For 3.x before 3.4, there was
a bug (https://issues.apache.org/jira/browse/LUCENE-3418) where we
failed to actually fsync...
More below:
On Tue, Jun 19, 2012 at 4:54 PM, Chris Gioran
wrote:
> On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless
> wrote:
>
Thanks Mike for the prompt replyDo you have a fully indexed version of the
wikipedia, I mainly need two fields for each document the indexed content of
the wikipedia articles and the title.if there is any place where I can get the
index, that will save me great time
regardsshaimaa
> From: l
Could it be possible to index Wikipedia in a 2 core machine with 3 GB in
RAM? I have had the same problem trying to index it.
I've tried with a dump from april 2011.
Thanks
Reyna
CIC-IPN
Mexico
2012/6/19 Michael McCandless
> Likely the bottleneck is pulling content from the database? Maybe
>
On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless
wrote:
> This shouldn't normally happen, even on crash, kill -9, power loss, etc.
>
> It can only mean either there is a bug in Lucene, or there's something
> wrong with your hardware/IO system, or the fsync operation doesn't
> actually work on t
Likely the bottleneck is pulling content from the database? Maybe
test just that and see how long it takes?
24 hours is way too long to index all of Wikipedia. For example, we
index Wikipedia every night for our trunk/4.0 performance tests, here:
http://people.apache.org/~mikemccand/luceneb
Hi everybody
I'm using Lucene3.6 to index Wikipedia documents which is over 3 million
article, the data is on a mysql database and it is taking more than 24 hours so
far.Do you know any tips that can speed up the indexing process
here is mycode:
public static void main(String[] args) {
This shouldn't normally happen, even on crash, kill -9, power loss, etc.
It can only mean either there is a bug in Lucene, or there's something
wrong with your hardware/IO system, or the fsync operation doesn't
actually work on the IO system.
You can run CheckIndex to see what's broken (then, add
Hello everyone,
I am having a problem with a lucene store. When starting an
IndexWriter on it, it throws the following exception:
Caused by: java.io.IOException: read past EOF:
MMapIndexInput(path="/path/to/index/_drs.cfs")
at
org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte
Based on this link http://www2002.org/CDROM/refereed/643/node6.html , I'm
calculating Okapi similarity between the query document and another
document as below using Lucene:
I have indexed the documents using 3 fields. I want to give higher weight
to field 2 and field 3. I can't use Lucene's boost
I found this is the correct way of calculating Average Document length of
document having tree fields
byte[] normsDocLengthArrField1 = indexReader.norms("filed1");
byte[] normsDocLengthArrField2 = indexReader.norms("filed2");
byte[] normsDocLengthArrField3 = indexReader.norms("filed3");
double s
14 matches
Mail list logo