Re: Format of Wikipedia Index

2018-01-22 Thread Will Martin
From the javadoc for DocMaker: * *doc.stored* - specifies whether fields should be stored (default *false*). * *doc.body.stored* - specifies whether the body field should be stored (default = *doc.stored*). So ootb you won't get content stored. Does this help? regards -will On 1/22/2

Format of Wikipedia Index

2018-01-22 Thread Armins Stepanjans
Hi, I have a question regarding the format of the Index created by DocMaker, from EnWikiContentSource. After creating the Index from dump of all Wikipedia's articles ( https://dumps.wikimedia.org/enwiki/latest/enwiki-latest- pages-articles-multistream.xml.bz2), I'm having trouble understanding th

Re: Wikipedia Index

2012-06-19 Thread Greg Bowyer
the documents into content and title thanksshaimaa From: luc...@mikemccandless.com Date: Tue, 19 Jun 2012 19:48:24 -0400 Subject: Re: Wikipedia Index To: java-user@lucene.apache.org I have the index locally ... but it's really impractical to send it especially if you already have the source

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
I only have the source text on a mysql database Do you know where I can download it in xml and is it possible to split the documents into content and title thanksshaimaa > From: luc...@mikemccandless.com > Date: Tue, 19 Jun 2012 19:48:24 -0400 > Subject: Re: Wikipedia Index > T

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
title.if there is any place where I can get > the index, that will save me great time > regardsshaimaa > >> From: luc...@mikemccandless.com >> Date: Tue, 19 Jun 2012 16:29:39 -0400 >> Subject: Re: Wikipedia Index >> To: java-user@lucene.apache.org >> >>

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
3 GB RAM is plenty for indexing Wikipedia (eg, that nightly benchmark uses a 2 GB heap). 2 cores just means it'll take longer than more cores... just use 2 indexing threads. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 19, 2012 at 5:26 PM, Reyna Melara wrote: > Could it be possib

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
rom: luc...@mikemccandless.com > Date: Tue, 19 Jun 2012 16:29:39 -0400 > Subject: Re: Wikipedia Index > To: java-user@lucene.apache.org > > Likely the bottleneck is pulling content from the database? Maybe > test just that and see how long it takes? > > 24 hours is w

Re: Wikipedia Index

2012-06-19 Thread Reyna Melara
Could it be possible to index Wikipedia in a 2 core machine with 3 GB in RAM? I have had the same problem trying to index it. I've tried with a dump from april 2011. Thanks Reyna CIC-IPN Mexico 2012/6/19 Michael McCandless > Likely the bottleneck is pulling content from the database? Maybe >

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
Likely the bottleneck is pulling content from the database? Maybe test just that and see how long it takes? 24 hours is way too long to index all of Wikipedia. For example, we index Wikipedia every night for our trunk/4.0 performance tests, here: http://people.apache.org/~mikemccand/luceneb

Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
Hi everybody I'm using Lucene3.6 to index Wikipedia documents which is over 3 million article, the data is on a mysql database and it is taking more than 24 hours so far.Do you know any tips that can speed up the indexing process here is mycode: public static void main(String[] args) {