date:20120619

Re: Wikipedia Index

2012-06-19 Thread Greg Bowyer

It depends on what you want, but the wikipedia data dumps can be found here http://en.wikipedia.org/wiki/Wikipedia:Database_download On 19/06/12 17:03, Elshaimaa Ali wrote: I only have the source text on a mysql database Do you know where I can download it in xml and is it possible to split the

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali

I only have the source text on a mysql database Do you know where I can download it in xml and is it possible to split the documents into content and title thanksshaimaa > From: luc...@mikemccandless.com > Date: Tue, 19 Jun 2012 19:48:24 -0400 > Subject: Re: Wikipedia Index > To: java-user@lucene

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless

I have the index locally ... but it's really impractical to send it especially if you already have the source text locally. Maybe index directly from the source text instead of via a database? Lucene's benchmark contrib/module has code to decode the XML into documents... Mike McCandless http://b

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless

3 GB RAM is plenty for indexing Wikipedia (eg, that nightly benchmark uses a 2 GB heap). 2 cores just means it'll take longer than more cores... just use 2 indexing threads. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 19, 2012 at 5:26 PM, Reyna Melara wrote: > Could it be possib

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Michael McCandless

Hmm which Lucene version are you using? For 3.x before 3.4, there was a bug (https://issues.apache.org/jira/browse/LUCENE-3418) where we failed to actually fsync... More below: On Tue, Jun 19, 2012 at 4:54 PM, Chris Gioran wrote: > On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless > wrote: >

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali

Thanks Mike for the prompt replyDo you have a fully indexed version of the wikipedia, I mainly need two fields for each document the indexed content of the wikipedia articles and the title.if there is any place where I can get the index, that will save me great time regardsshaimaa > From: l

Re: Wikipedia Index

2012-06-19 Thread Reyna Melara

Could it be possible to index Wikipedia in a 2 core machine with 3 GB in RAM? I have had the same problem trying to index it. I've tried with a dump from april 2011. Thanks Reyna CIC-IPN Mexico 2012/6/19 Michael McCandless > Likely the bottleneck is pulling content from the database? Maybe >

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Chris Gioran

On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless wrote: > This shouldn't normally happen, even on crash, kill -9, power loss, etc. > > It can only mean either there is a bug in Lucene, or there's something > wrong with your hardware/IO system, or the fsync operation doesn't > actually work on t

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless

Likely the bottleneck is pulling content from the database? Maybe test just that and see how long it takes? 24 hours is way too long to index all of Wikipedia. For example, we index Wikipedia every night for our trunk/4.0 performance tests, here: http://people.apache.org/~mikemccand/luceneb

Wikipedia Index

2012-06-19 Thread Elshaimaa Ali

Hi everybody I'm using Lucene3.6 to index Wikipedia documents which is over 3 million article, the data is on a mysql database and it is taking more than 24 hours so far.Do you know any tips that can speed up the indexing process here is mycode: public static void main(String[] args) {

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Michael McCandless

This shouldn't normally happen, even on crash, kill -9, power loss, etc. It can only mean either there is a bug in Lucene, or there's something wrong with your hardware/IO system, or the fsync operation doesn't actually work on the IO system. You can run CheckIndex to see what's broken (then, add

zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Chris Gioran

Hello everyone, I am having a problem with a lucene store. When starting an IndexWriter on it, it throws the following exception: Caused by: java.io.IOException: read past EOF: MMapIndexInput(path="/path/to/index/_drs.cfs") at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte

Different Weights to Lucene fields with Okapi Similarity

2012-06-19 Thread Kasun Perera

Based on this link http://www2002.org/CDROM/refereed/643/node6.html , I'm calculating Okapi similarity between the query document and another document as below using Lucene: I have indexed the documents using 3 fields. I want to give higher weight to field 2 and field 3. I can't use Lucene's boost

Re: Calculating Average Document Length with Lucene

2012-06-19 Thread Kasun Perera

I found this is the correct way of calculating Average Document length of document having tree fields byte[] normsDocLengthArrField1 = indexReader.norms("filed1"); byte[] normsDocLengthArrField2 = indexReader.norms("filed2"); byte[] normsDocLengthArrField3 = indexReader.norms("filed3"); double s

Re: Wikipedia Index

RE: Wikipedia Index

Re: Wikipedia Index

Re: Wikipedia Index

Re: zero sized cfs files in index lead to IOException: read past EOF

RE: Wikipedia Index

Re: Wikipedia Index

Re: zero sized cfs files in index lead to IOException: read past EOF

Re: Wikipedia Index

Wikipedia Index

Re: zero sized cfs files in index lead to IOException: read past EOF

zero sized cfs files in index lead to IOException: read past EOF

Different Weights to Lucene fields with Okapi Similarity

Re: Calculating Average Document Length with Lucene

14 matches

Site Navigation

Mail list logo

Footer information