I put up a patch and would appreciate testing/feedback. It's not
perfect, but it handles most things, I think.
-Grant
On Dec 28, 2007, at 12:19 PM, Grant Ingersoll wrote:
See https://issues.apache.org/jira/browse/LUCENE-1103
On Dec 18, 2007, at 1:31 PM, Marcelo Ochoa wrote:
Hi All:
Just
See https://issues.apache.org/jira/browse/LUCENE-1103
On Dec 18, 2007, at 1:31 PM, Marcelo Ochoa wrote:
Hi All:
Just to add simple hack, I had posted at my Blog an entry named
"Uploading WikiPedia Dumps to Oracle databases":
http://marceloochoa.blogspot.com/2007_12_01_archive.html
with instr
Hi All:
Just to add simple hack, I had posted at my Blog an entry named
"Uploading WikiPedia Dumps to Oracle databases":
http://marceloochoa.blogspot.com/2007_12_01_archive.html
with instructions to upload WikiPedia Dumps to Oracle XMLDB, it
means transforming an XML file to an object-relationa
Good pointers, thanks. I asked because I did have a problem like this a few
months ago -- none of the existing parsers solved it for me (back then).
D.
Petite Abeille wrote:
On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
Just incidentally -- do you know of something that would parse the
On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
Just incidentally -- do you know of something that would parse the
wikipedia markup (to plain text, for example)?
If you find out, let us know :)
You may want to check the partial ANTLR grammar for Wikitext:
http://www.mediawiki.org/wiki/User
Note that the current code doesn't actually do anything with the wiki
syntax, but I would think as long as the other language is in the same
format you should be fine.
Just incidentally -- do you know of something that would parse the wikipedia
markup (to plain text, for example)?
D.
My firm uses a parser based on javax.xml.stream.XMLStreamReader to
break (english and nonenglish) wikipedia xml dumps into lucene-style
"documents and fields." We use wikipedia to test our
language-specific code, so we've probably indexed 20 wikipedia dumps.
- andy g
On Dec 11, 2007 9:35 PM, Oti
12 dec 2007 kl. 06.35 skrev Otis Gospodnetic:
I need to index a Wikipedia dump. I know there is code in contrib/
benchmark for indexing *English* Wikipedia for benchmarking
purposes. However, I'd like to index a non-English dump, and I
actually don't need it for benchmarking, I just want
ed an acceptable analyzer, which StandardAnalyzer might not be.
-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 12, 2007 2:29 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing Wikipedia dumps
I haven't actually tried it, but I
Note that the current code doesn't actually do anything with the wiki
syntax, but I would think as long as the other language is in the same
format you should be fine.
-Grant
On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote:
I haven't actually tried it, but I think very likely the cu
I haven't actually tried it, but I think very likely the current code
in contrib/benchmark might be able to extract non-English Wikipedia
dump as well?
Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think
if you just change the docs.file to reference your downloaded XML
f
Subject: Re: Indexing Wikipedia dumps
Database? I imagine I can avoid that Wiki dump.gz -> gunzip ->
parse -> index no?
Otis
- Original Message
From: Chris Lu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, December 12, 2007 1:55:02 AM
Subject: Re
Database? I imagine I can avoid that Wiki dump.gz -> gunzip -> parse ->
index no?
Otis
- Original Message
From: Chris Lu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, December 12, 2007 1:55:02 AM
Subject: Re: Indexing Wikipedia dumps
F
For a quick java approach, give yourself 3 minutes and try to use
DBSight to access the database. You can simply use "select * from
mw_searchindex" as a starting point. It'll build the index for you.
However, you may need to pluggin your custom analyzer for media wiki's
format(Or maybe not).
--
C
Otis, if you're willing to use some non-Java code for your task...
1) Wikipedia uses Lucene for their full-text searches, and the module
is part of Mediawiki. You could use this as follows:
- Install Mediawiki
- Load your Wikipedia dump into MW (and MySQL)
- Build a search index for the Lucene
15 matches
Mail list logo