Re: Indexing Wikipedia dumps

2008-01-02 Thread Grant Ingersoll
I put up a patch and would appreciate testing/feedback. It's not perfect, but it handles most things, I think. -Grant On Dec 28, 2007, at 12:19 PM, Grant Ingersoll wrote: See https://issues.apache.org/jira/browse/LUCENE-1103 On Dec 18, 2007, at 1:31 PM, Marcelo Ochoa wrote: Hi All: Just

Re: Indexing Wikipedia dumps

2007-12-28 Thread Grant Ingersoll
See https://issues.apache.org/jira/browse/LUCENE-1103 On Dec 18, 2007, at 1:31 PM, Marcelo Ochoa wrote: Hi All: Just to add simple hack, I had posted at my Blog an entry named "Uploading WikiPedia Dumps to Oracle databases": http://marceloochoa.blogspot.com/2007_12_01_archive.html with instr

Re: Indexing Wikipedia dumps

2007-12-18 Thread Marcelo Ochoa
Hi All: Just to add simple hack, I had posted at my Blog an entry named "Uploading WikiPedia Dumps to Oracle databases": http://marceloochoa.blogspot.com/2007_12_01_archive.html with instructions to upload WikiPedia Dumps to Oracle XMLDB, it means transforming an XML file to an object-relationa

Re: Indexing Wikipedia dumps

2007-12-14 Thread Dawid Weiss
Good pointers, thanks. I asked because I did have a problem like this a few months ago -- none of the existing parsers solved it for me (back then). D. Petite Abeille wrote: On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote: Just incidentally -- do you know of something that would parse the

Re: Indexing Wikipedia dumps

2007-12-13 Thread Petite Abeille
On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote: Just incidentally -- do you know of something that would parse the wikipedia markup (to plain text, for example)? If you find out, let us know :) You may want to check the partial ANTLR grammar for Wikitext: http://www.mediawiki.org/wiki/User

Re: Indexing Wikipedia dumps

2007-12-12 Thread Dawid Weiss
Note that the current code doesn't actually do anything with the wiki syntax, but I would think as long as the other language is in the same format you should be fine. Just incidentally -- do you know of something that would parse the wikipedia markup (to plain text, for example)? D.

Re: Indexing Wikipedia dumps

2007-12-12 Thread Andy Goodell
My firm uses a parser based on javax.xml.stream.XMLStreamReader to break (english and nonenglish) wikipedia xml dumps into lucene-style "documents and fields." We use wikipedia to test our language-specific code, so we've probably indexed 20 wikipedia dumps. - andy g On Dec 11, 2007 9:35 PM, Oti

Re: Indexing Wikipedia dumps

2007-12-12 Thread Karl Wettin
12 dec 2007 kl. 06.35 skrev Otis Gospodnetic: I need to index a Wikipedia dump. I know there is code in contrib/ benchmark for indexing *English* Wikipedia for benchmarking purposes. However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want

RE: Indexing Wikipedia dumps

2007-12-12 Thread Steven Parkes
ed an acceptable analyzer, which StandardAnalyzer might not be. -Original Message- From: Michael McCandless [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 12, 2007 2:29 AM To: java-user@lucene.apache.org Subject: Re: Indexing Wikipedia dumps I haven't actually tried it, but I

Re: Indexing Wikipedia dumps

2007-12-12 Thread Grant Ingersoll
Note that the current code doesn't actually do anything with the wiki syntax, but I would think as long as the other language is in the same format you should be fine. -Grant On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote: I haven't actually tried it, but I think very likely the cu

Re: Indexing Wikipedia dumps

2007-12-12 Thread Michael McCandless
I haven't actually tried it, but I think very likely the current code in contrib/benchmark might be able to extract non-English Wikipedia dump as well? Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think if you just change the docs.file to reference your downloaded XML f

Re: Indexing Wikipedia dumps

2007-12-12 Thread mark harwood
Subject: Re: Indexing Wikipedia dumps Database? I imagine I can avoid that Wiki dump.gz -> gunzip -> parse -> index no? Otis - Original Message From: Chris Lu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, December 12, 2007 1:55:02 AM Subject: Re

Re: Indexing Wikipedia dumps

2007-12-12 Thread Otis Gospodnetic
Database? I imagine I can avoid that Wiki dump.gz -> gunzip -> parse -> index no? Otis - Original Message From: Chris Lu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, December 12, 2007 1:55:02 AM Subject: Re: Indexing Wikipedia dumps F

Re: Indexing Wikipedia dumps

2007-12-11 Thread Chris Lu
For a quick java approach, give yourself 3 minutes and try to use DBSight to access the database. You can simply use "select * from mw_searchindex" as a starting point. It'll build the index for you. However, you may need to pluggin your custom analyzer for media wiki's format(Or maybe not). -- C

Re: Indexing Wikipedia dumps

2007-12-11 Thread Matt Kangas
Otis, if you're willing to use some non-Java code for your task... 1) Wikipedia uses Lucene for their full-text searches, and the module is part of Mediawiki. You could use this as follows: - Install Mediawiki - Load your Wikipedia dump into MW (and MySQL) - Build a search index for the Lucene