date:20071212

Re: Indexing Wikipedia dumps

2007-12-12 Thread Dawid Weiss

Note that the current code doesn't actually do anything with the wiki syntax, but I would think as long as the other language is in the same format you should be fine. Just incidentally -- do you know of something that would parse the wikipedia markup (to plain text, for example)? D.

Re: Searching on plurals and phrases in a single field

2007-12-12 Thread Lucifer Hammer

Hi Erick, Thanks for the great idea, it's exactly the kind of suggestion I was looking for! Lucifer On Dec 12, 2007 2:34 PM, Erick Erickson <[EMAIL PROTECTED]> wrote: > I faced a very similar requirement and solved it by indexing multiple > tokens at the same place. For instance, say you're ind

Re: Boost One Term Query

2007-12-12 Thread Jens Grivolla

Erick Erickson wrote: I don't believe you can compare scores across queries in any meaningful way. I actually investigated this to some degree in my thesis, comparing different participating systems from the TREC campaigns. It turns out that some systems' scores (e.g. the top scores for a gi

Re: Refreshing RAMDirectory

2007-12-12 Thread Ruslan Sivak

Michael McCandless wrote: Ruslan Sivak wrote: This seems to be problematic though. There are other things that depend on the reader that is not so obvious. For example, IndexReader reader=getReader(); IndexSearcher searcher=new IndexSearcher(reader); Hits hits=searcher.search(query); searc

Re: Handling Indexed, Stored and Tokenized fields

2007-12-12 Thread Doron Cohen

Seems that PerFieldAnalyzerWrapper would be convenient here? Doron On Dec 12, 2007 10:41 PM, ts01 <[EMAIL PROTECTED]> wrote: > > Hi, > > We have a requirement to index as well as store multiple fields in a > document, each with its own special tokenizer. The following seems to > provide a way to

Re: Refreshing RAMDirectory

2007-12-12 Thread Michael McCandless

Ruslan Sivak wrote: This seems to be problematic though. There are other things that depend on the reader that is not so obvious. For example, IndexReader reader=getReader(); IndexSearcher searcher=new IndexSearcher(reader); Hits hits=searcher.search(query); searcher.close(); reader.close(

Re: Refreshing RAMDirectory

2007-12-12 Thread Ruslan Sivak

This seems to be problematic though. There are other things that depend on the reader that is not so obvious. For example, IndexReader reader=getReader(); IndexSearcher searcher=new IndexSearcher(reader); Hits hits=searcher.search(query); searcher.close(); reader.close(); Iterator i=hits.itera

Re: Accessing parsed content in Nutch

2007-12-12 Thread Doron Cohen

You would probably get better and quicker answer in Nutch mailing lists: http://lucene.apache.org/nutch/mailing_lists.html Doron On Dec 12, 2007 11:16 PM, Developer Developer <[EMAIL PROTECTED]> wrote: > I believe nutch stores parsed content somewhere. Can you please let me > know > how I can

Accessing parsed content in Nutch

2007-12-12 Thread Developer Developer

I believe nutch stores parsed content somewhere. Can you please let me know how I can access the parsed content given a url ? Thanks !

Handling Indexed, Stored and Tokenized fields

2007-12-12 Thread ts01

Hi, We have a requirement to index as well as store multiple fields in a document, each with its own special tokenizer. The following seems to provide a way to index multiple fields each with its own tokenizer: Field(String name, Reader reader) The following seems to provide a way to Index and

Re: Refreshing RAMDirectory

2007-12-12 Thread Michael McCandless

You need to keep a reader open so long as you plan to use any of its methods from any thread. The reader does close exactly when you ask it to (when you call reader.close()). You should not have to "open a new reader for every method call" -- you only need to open a new reader (and in y

Re: Refreshing RAMDirectory

2007-12-12 Thread Ruslan Sivak

Thank you to everyone for your comments. I didn't realize that readers need to be kept open and won't close exactly when you ask them too. I have restructured my code to keep the RamDirectory cached, and to open a new reader for every method call. This seems to be working fine. Russ Erick

extract info after indexing

2007-12-12 Thread spk spk

Hi! All, I will like to extract some information regarding some word in a field. below are info I will like to have: 1. frequency count of that word 2. word after been analyzed... Any chance I can use Lucene to do that? spking

Re: Searching on plurals and phrases in a single field

2007-12-12 Thread Erick Erickson

I faced a very similar requirement and solved it by indexing multiple tokens at the same place. For instance, say you're indexing the word "foxes". Index something like fox$ and foxes at the same position (see SynonymAnalyzer in Lucene In Action for an example). You probably MUST index the multiple

Searching on plurals and phrases in a single field

2007-12-12 Thread Lucifer Hammer

Hi, We've got a requirement that we need to give our users the ability to search on exact phrases within a field, or, if they prefer, they can match on plurals(either via stems, or another plural algorithm). However, the cases are mutually exclusive, for example given the following field in the

Re: Indexing Wikipedia dumps

2007-12-12 Thread Andy Goodell

My firm uses a parser based on javax.xml.stream.XMLStreamReader to break (english and nonenglish) wikipedia xml dumps into lucene-style "documents and fields." We use wikipedia to test our language-specific code, so we've probably indexed 20 wikipedia dumps. - andy g On Dec 11, 2007 9:35 PM, Oti

Re: Advice regarding fuzzy phrase searching

2007-12-12 Thread Jose Luna

Mark, Russ, thanks for the replies. Mark, this looks great, I think it's exactly what I was looking for. I think this should definitely be added to Lucene when it is stable enough. I suspect there are others that would find it useful. JLuna Mark Miller wrote: Take a look at: https://issue

Re: Indexing Wikipedia dumps

2007-12-12 Thread Karl Wettin

12 dec 2007 kl. 06.35 skrev Otis Gospodnetic: I need to index a Wikipedia dump. I know there is code in contrib/ benchmark for indexing *English* Wikipedia for benchmarking purposes. However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want

Re: Refreshing RAMDirectory

2007-12-12 Thread Erick Erickson

Even if you could tell a reader is closed, you'd wind up with unmaintainable code. I envision you have a bunch of places where you'd do something like if (reader.isClosed()) { reader = create a new reader. } But practically, you'd be opening a new reader someplace, closing it someplace else,

RE: Indexing Wikipedia dumps

2007-12-12 Thread Steven Parkes

Probably want a combination of extractWikipedia.alg and wikipedia.alg? You want the EnwikiDocMaker from extractWikipedia.alg which reads the uncompressed xml file but rather than using WriteLineDoc, you want to go ahead and index as wikipedia.alg does. (Ditch the query part.) You'll need an accep

Re: OutOfMemoryError on small search in large, simple index

2007-12-12 Thread Lars Clausen

On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote: > I've now made trial runs with no norms on the two indexed fields, and > also tried with varying TermIndexIntervals. Omitting the norms saves > about 4MB on 50 million entries, much less than I expected. Seems there's a reason we still use

Re: Indexing Wikipedia dumps

2007-12-12 Thread Grant Ingersoll

Note that the current code doesn't actually do anything with the wiki syntax, but I would think as long as the other language is in the same format you should be fine. -Grant On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote: I haven't actually tried it, but I think very likely the cu

Re: OutOfMemoryError on small search in large, simple index

2007-12-12 Thread Lars Clausen

On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote: > Increasing > the TermIndexInterval by a factor of 4 gave no measurable savings. Following up on myself because I'm not 100% sure that the indexes have the term index intervals I expect, and I'd like to check. Where can I see what term ind

Re: OutOfMemoryError on small search in large, simple index

2007-12-12 Thread Lars Clausen

On Tue, 2007-11-13 at 07:26 -0800, Chris Hostetter wrote: > : > Can it be right that memory usage depends on size of the index rather > : > than size of the result? > : > : Yes, see IndexWriter.setTermIndexInterval(). How much RAM are you giving to > : the JVM now? > > and in general: yes. Luc

Re: Refreshing RAMDirectory

2007-12-12 Thread Michael McCandless

Ruslan Sivak wrote: Michael McCandless wrote: Ruslan Sivak wrote: I have an index of about 10mb. Since it's so small, I would like to keep it loaded in memory, and reload it about every minute or so, assuming that it has changed on disk. I have the following code, which works, except

Re: Indexing Wikipedia dumps

2007-12-12 Thread Michael McCandless

I haven't actually tried it, but I think very likely the current code in contrib/benchmark might be able to extract non-English Wikipedia dump as well? Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think if you just change the docs.file to reference your downloaded XML f

Re: Indexing Wikipedia dumps

2007-12-12 Thread mark harwood

Otis, I've used this to index wikipedia from XML before now: http://schmidt.devlib.org/software/lucene-wikipedia.html Cheers Mark - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 12 December, 2007 8:18:49 AM Subject: Re: Inde

Basic Named Entity Indexing

2007-12-12 Thread chris.b

I'm not even sure if it can be considered Named Entity Recognition, but what the hell... so here's my problem... I was asked to retrieve a the named entities out of a collection of documents, and I've thought of two ways of doing so (not sure if either of them work)... a) index the documents by w

(~) opertor query....

2007-12-12 Thread Shakti_Sareen

Hi All, I am parsing this query: "Auto* machine"~4. Will it work? If yes then right now it's not working. Can anyone help on this? Thanks & Regards Shakti Sareen DISCLAIMER: This email (including any attachments) is intended for the sole use of the in

Re: Indexing Wikipedia dumps

2007-12-12 Thread Otis Gospodnetic

Database? I imagine I can avoid that Wiki dump.gz -> gunzip -> parse -> index no? Otis - Original Message From: Chris Lu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, December 12, 2007 1:55:02 AM Subject: Re: Indexing Wikipedia dumps For a quick java approa

Re: Indexing Wikipedia dumps

Re: Searching on plurals and phrases in a single field

Re: Boost One Term Query

Re: Refreshing RAMDirectory

Re: Handling Indexed, Stored and Tokenized fields

Re: Refreshing RAMDirectory

Re: Refreshing RAMDirectory

Re: Accessing parsed content in Nutch

Accessing parsed content in Nutch

Handling Indexed, Stored and Tokenized fields

Re: Refreshing RAMDirectory

Re: Refreshing RAMDirectory

extract info after indexing

Re: Searching on plurals and phrases in a single field

Searching on plurals and phrases in a single field

Re: Indexing Wikipedia dumps

Re: Advice regarding fuzzy phrase searching

Re: Indexing Wikipedia dumps

Re: Refreshing RAMDirectory

RE: Indexing Wikipedia dumps

Re: OutOfMemoryError on small search in large, simple index

Re: Indexing Wikipedia dumps

Re: OutOfMemoryError on small search in large, simple index

Re: OutOfMemoryError on small search in large, simple index

Re: Refreshing RAMDirectory

Re: Indexing Wikipedia dumps

Re: Indexing Wikipedia dumps

Basic Named Entity Indexing

(~) opertor query....

Re: Indexing Wikipedia dumps

30 matches

Site Navigation

Mail list logo

Footer information