Re: Indexing Wikipedia dumps

2007-12-12 Thread Dawid Weiss
Note that the current code doesn't actually do anything with the wiki syntax, but I would think as long as the other language is in the same format you should be fine. Just incidentally -- do you know of something that would parse the wikipedia markup (to plain text, for example)? D.

Re: Searching on plurals and phrases in a single field

2007-12-12 Thread Lucifer Hammer
Hi Erick, Thanks for the great idea, it's exactly the kind of suggestion I was looking for! Lucifer On Dec 12, 2007 2:34 PM, Erick Erickson <[EMAIL PROTECTED]> wrote: > I faced a very similar requirement and solved it by indexing multiple > tokens at the same place. For instance, say you're ind

Re: Boost One Term Query

2007-12-12 Thread Jens Grivolla
Erick Erickson wrote: I don't believe you can compare scores across queries in any meaningful way. I actually investigated this to some degree in my thesis, comparing different participating systems from the TREC campaigns. It turns out that some systems' scores (e.g. the top scores for a gi

Re: Refreshing RAMDirectory

2007-12-12 Thread Ruslan Sivak
Michael McCandless wrote: Ruslan Sivak wrote: This seems to be problematic though. There are other things that depend on the reader that is not so obvious. For example, IndexReader reader=getReader(); IndexSearcher searcher=new IndexSearcher(reader); Hits hits=searcher.search(query); searc

Re: Handling Indexed, Stored and Tokenized fields

2007-12-12 Thread Doron Cohen
Seems that PerFieldAnalyzerWrapper would be convenient here? Doron On Dec 12, 2007 10:41 PM, ts01 <[EMAIL PROTECTED]> wrote: > > Hi, > > We have a requirement to index as well as store multiple fields in a > document, each with its own special tokenizer. The following seems to > provide a way to

Re: Refreshing RAMDirectory

2007-12-12 Thread Michael McCandless
Ruslan Sivak wrote: This seems to be problematic though. There are other things that depend on the reader that is not so obvious. For example, IndexReader reader=getReader(); IndexSearcher searcher=new IndexSearcher(reader); Hits hits=searcher.search(query); searcher.close(); reader.close(

Re: Refreshing RAMDirectory

2007-12-12 Thread Ruslan Sivak
This seems to be problematic though. There are other things that depend on the reader that is not so obvious. For example, IndexReader reader=getReader(); IndexSearcher searcher=new IndexSearcher(reader); Hits hits=searcher.search(query); searcher.close(); reader.close(); Iterator i=hits.itera

Re: Accessing parsed content in Nutch

2007-12-12 Thread Doron Cohen
You would probably get better and quicker answer in Nutch mailing lists: http://lucene.apache.org/nutch/mailing_lists.html Doron On Dec 12, 2007 11:16 PM, Developer Developer <[EMAIL PROTECTED]> wrote: > I believe nutch stores parsed content somewhere. Can you please let me > know > how I can

Accessing parsed content in Nutch

2007-12-12 Thread Developer Developer
I believe nutch stores parsed content somewhere. Can you please let me know how I can access the parsed content given a url ? Thanks !

Handling Indexed, Stored and Tokenized fields

2007-12-12 Thread ts01
Hi, We have a requirement to index as well as store multiple fields in a document, each with its own special tokenizer. The following seems to provide a way to index multiple fields each with its own tokenizer: Field(String name, Reader reader) The following seems to provide a way to Index and

Re: Refreshing RAMDirectory

2007-12-12 Thread Michael McCandless
You need to keep a reader open so long as you plan to use any of its methods from any thread. The reader does close exactly when you ask it to (when you call reader.close()). You should not have to "open a new reader for every method call" -- you only need to open a new reader (and in y

Re: Refreshing RAMDirectory

2007-12-12 Thread Ruslan Sivak
Thank you to everyone for your comments. I didn't realize that readers need to be kept open and won't close exactly when you ask them too. I have restructured my code to keep the RamDirectory cached, and to open a new reader for every method call. This seems to be working fine. Russ Erick

extract info after indexing

2007-12-12 Thread spk spk
Hi! All, I will like to extract some information regarding some word in a field. below are info I will like to have: 1. frequency count of that word 2. word after been analyzed... Any chance I can use Lucene to do that? spking

Re: Searching on plurals and phrases in a single field

2007-12-12 Thread Erick Erickson
I faced a very similar requirement and solved it by indexing multiple tokens at the same place. For instance, say you're indexing the word "foxes". Index something like fox$ and foxes at the same position (see SynonymAnalyzer in Lucene In Action for an example). You probably MUST index the multiple

Searching on plurals and phrases in a single field

2007-12-12 Thread Lucifer Hammer
Hi, We've got a requirement that we need to give our users the ability to search on exact phrases within a field, or, if they prefer, they can match on plurals(either via stems, or another plural algorithm). However, the cases are mutually exclusive, for example given the following field in the

Re: Indexing Wikipedia dumps

2007-12-12 Thread Andy Goodell
My firm uses a parser based on javax.xml.stream.XMLStreamReader to break (english and nonenglish) wikipedia xml dumps into lucene-style "documents and fields." We use wikipedia to test our language-specific code, so we've probably indexed 20 wikipedia dumps. - andy g On Dec 11, 2007 9:35 PM, Oti

Re: Advice regarding fuzzy phrase searching

2007-12-12 Thread Jose Luna
Mark, Russ, thanks for the replies. Mark, this looks great, I think it's exactly what I was looking for. I think this should definitely be added to Lucene when it is stable enough. I suspect there are others that would find it useful. JLuna Mark Miller wrote: Take a look at: https://issue

Re: Indexing Wikipedia dumps

2007-12-12 Thread Karl Wettin
12 dec 2007 kl. 06.35 skrev Otis Gospodnetic: I need to index a Wikipedia dump. I know there is code in contrib/ benchmark for indexing *English* Wikipedia for benchmarking purposes. However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want

Re: Refreshing RAMDirectory

2007-12-12 Thread Erick Erickson
Even if you could tell a reader is closed, you'd wind up with unmaintainable code. I envision you have a bunch of places where you'd do something like if (reader.isClosed()) { reader = create a new reader. } But practically, you'd be opening a new reader someplace, closing it someplace else,

RE: Indexing Wikipedia dumps

2007-12-12 Thread Steven Parkes
Probably want a combination of extractWikipedia.alg and wikipedia.alg? You want the EnwikiDocMaker from extractWikipedia.alg which reads the uncompressed xml file but rather than using WriteLineDoc, you want to go ahead and index as wikipedia.alg does. (Ditch the query part.) You'll need an accep

Re: OutOfMemoryError on small search in large, simple index

2007-12-12 Thread Lars Clausen
On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote: > I've now made trial runs with no norms on the two indexed fields, and > also tried with varying TermIndexIntervals. Omitting the norms saves > about 4MB on 50 million entries, much less than I expected. Seems there's a reason we still use

Re: Indexing Wikipedia dumps

2007-12-12 Thread Grant Ingersoll
Note that the current code doesn't actually do anything with the wiki syntax, but I would think as long as the other language is in the same format you should be fine. -Grant On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote: I haven't actually tried it, but I think very likely the cu

Re: OutOfMemoryError on small search in large, simple index

2007-12-12 Thread Lars Clausen
On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote: > Increasing > the TermIndexInterval by a factor of 4 gave no measurable savings. Following up on myself because I'm not 100% sure that the indexes have the term index intervals I expect, and I'd like to check. Where can I see what term ind

Re: OutOfMemoryError on small search in large, simple index

2007-12-12 Thread Lars Clausen
On Tue, 2007-11-13 at 07:26 -0800, Chris Hostetter wrote: > : > Can it be right that memory usage depends on size of the index rather > : > than size of the result? > : > : Yes, see IndexWriter.setTermIndexInterval(). How much RAM are you giving to > : the JVM now? > > and in general: yes. Luc

Re: Refreshing RAMDirectory

2007-12-12 Thread Michael McCandless
Ruslan Sivak wrote: Michael McCandless wrote: Ruslan Sivak wrote: I have an index of about 10mb. Since it's so small, I would like to keep it loaded in memory, and reload it about every minute or so, assuming that it has changed on disk. I have the following code, which works, except

Re: Indexing Wikipedia dumps

2007-12-12 Thread Michael McCandless
I haven't actually tried it, but I think very likely the current code in contrib/benchmark might be able to extract non-English Wikipedia dump as well? Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think if you just change the docs.file to reference your downloaded XML f

Re: Indexing Wikipedia dumps

2007-12-12 Thread mark harwood
Otis, I've used this to index wikipedia from XML before now: http://schmidt.devlib.org/software/lucene-wikipedia.html Cheers Mark - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 12 December, 2007 8:18:49 AM Subject: Re: Inde

Basic Named Entity Indexing

2007-12-12 Thread chris.b
I'm not even sure if it can be considered Named Entity Recognition, but what the hell... so here's my problem... I was asked to retrieve a the named entities out of a collection of documents, and I've thought of two ways of doing so (not sure if either of them work)... a) index the documents by w

(~) opertor query....

2007-12-12 Thread Shakti_Sareen
Hi All, I am parsing this query: "Auto* machine"~4. Will it work? If yes then right now it's not working. Can anyone help on this? Thanks & Regards Shakti Sareen DISCLAIMER: This email (including any attachments) is intended for the sole use of the in

Re: Indexing Wikipedia dumps

2007-12-12 Thread Otis Gospodnetic
Database? I imagine I can avoid that Wiki dump.gz -> gunzip -> parse -> index no? Otis - Original Message From: Chris Lu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, December 12, 2007 1:55:02 AM Subject: Re: Indexing Wikipedia dumps For a quick java approa