Re: Indexing Wikipedia dumps

2007-12-11 Thread Chris Lu
For a quick java approach, give yourself 3 minutes and try to use DBSight to access the database. You can simply use "select * from mw_searchindex" as a starting point. It'll build the index for you. However, you may need to pluggin your custom analyzer for media wiki's format(Or maybe not). -- C

Re: DEFAULT_OPERATOR_AND globally ?

2007-12-11 Thread Andre Halama
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Helmut Jarausch schrieb: > I know how to set DEFAULT_OPERATOR_AND for an individual QueryParser > Objekt (after creation) > > Since I always want this to be set, is there a means to set a (global) > option such that any QueryParser object has this de

Re: DEFAULT_OPERATOR_AND globally ?

2007-12-11 Thread Andre Halama
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Helmut Jarausch schrieb: Hi Helmut, > I know how to set DEFAULT_OPERATOR_AND for an individual QueryParser > Objekt (after creation) > > Since I always want this to be set, is there a means to set a (global) > option such that any QueryParser object

Re: Indexing XML document

2007-12-11 Thread Otis Gospodnetic
Liaqat, Out of curiosity - what are you using to analyze and index Urdu? AraMorph or something else? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Liaqat Ali <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, Decembe

Re: Indexing Wikipedia dumps

2007-12-11 Thread Matt Kangas
Otis, if you're willing to use some non-Java code for your task... 1) Wikipedia uses Lucene for their full-text searches, and the module is part of Mediawiki. You could use this as follows: - Install Mediawiki - Load your Wikipedia dump into MW (and MySQL) - Build a search index for the Lucene

Indexing Wikipedia dumps

2007-12-11 Thread Otis Gospodnetic
Hi, I need to index a Wikipedia dump. I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarking purposes. However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want to end up with a Lucene index. Any suggestions

Re: Refreshing RAMDirectory

2007-12-11 Thread Ruslan Sivak
Michael McCandless wrote: Ruslan Sivak wrote: I have an index of about 10mb. Since it's so small, I would like to keep it loaded in memory, and reload it about every minute or so, assuming that it has changed on disk. I have the following code, which works, except it doesn't reload the cha

Re: Out of memory?

2007-12-11 Thread Otis Gospodnetic
Bob, Move the following line in your if block: Sort sort = new Sort(sortColumn, desc); That will fix your OOM problem. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Bob Daha <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday,

Re: help required ... ~ operator

2007-12-11 Thread Otis Gospodnetic
Have you tried with ~3 or ~4? Just curious... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Shakti_Sareen <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, December 11, 2007 3:13:47 AM Subject: help required ... ~ operator

Re: like search in NOT operator

2007-12-11 Thread Otis Gospodnetic
Shakti, I think you provided the answer: "sign* NOT Machine" or "sign* -Machine" Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Shakti_Sareen <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, December 11, 2007 4:48:58 AM Subje

Re: Refreshing RAMDirectory

2007-12-11 Thread Michael McCandless
Ruslan Sivak wrote: I have an index of about 10mb. Since it's so small, I would like to keep it loaded in memory, and reload it about every minute or so, assuming that it has changed on disk. I have the following code, which works, except it doesn't reload the changes. protected String

Re: DEFAULT_OPERATOR_AND globally ?

2007-12-11 Thread Daniel Noll
On Wednesday 12 December 2007 03:34:08 Helmut Jarausch wrote: > Hi, > > I know how to set DEFAULT_OPERATOR_AND for an individual QueryParser > Objekt (after creation) > > Since I always want this to be set, is there a means to set a (global) > option such that any QueryParser object has this defaul

Re: Refreshing RAMDirectory

2007-12-11 Thread Ruslan Sivak
The on-disk index gets updated. Something like this: The second indexDoc function is what does the actual indexing, but this should have the relevant content. public void indexDoc(int userId) throws ClassNotFoundException, SQLException, CorruptIndexException, IOException { IndexWri

Re: Advice regarding fuzzy phrase searching

2007-12-11 Thread Mark Miller
Take a look at: https://issues.apache.org/jira/browse/LUCENE-794 This is an extension to the Highlighter that highlights span and proximity queries. If you rewrite the query it will also do fuzzy queries. I am sure you can easily steal some of the code to do what you want. Keep in mind, beca

Re: Refreshing RAMDirectory

2007-12-11 Thread Erick Erickson
I can't speak to the errors, but how is the index being updated? An indexwriter buffers changes and periodically flushes them out to disk. So the writer may not have flushed your data, depending upon how it's written. Best Erick On Dec 11, 2007 5:37 PM, Ruslan Sivak <[EMAIL PROTECTED]> wrote: >

Re: Advice regarding fuzzy phrase searching

2007-12-11 Thread Ruslan Sivak
Look into SpanNearQuery. It has a slop which lets you say how close you want the terms to be. For a single document, if you are going to be doing a lot of these searches, I recommend using a MemoryIndex. Russ Jose Luna wrote: Hello, I am looking for some advice regarding which tools I migh

Refreshing RAMDirectory

2007-12-11 Thread Ruslan Sivak
I have an index of about 10mb. Since it's so small, I would like to keep it loaded in memory, and reload it about every minute or so, assuming that it has changed on disk. I have the following code, which works, except it doesn't reload the changes. protected String indexName; protected Ind

Re: Crawling in Nutch

2007-12-11 Thread Developer Developer
use luke to explroe the index. the content is present in the content field. However, it is not stored so you can only search on it. On Aug 1, 2007 9:59 AM, Srinivasarao Vundavalli <[EMAIL PROTECTED]> wrote: > Hi, > Where does (in which field) nutch stores the content of a document > while in

Re: Applying SpellChecker to a phrase

2007-12-11 Thread smokey
Thanks for pointing me to the right class to use. On Dec 11, 2007 3:23 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: > Yes that's right, my mistake. > > In fact even after reading your comment I was puzzled > because PhraseScorer indeed requires *all* phrase-positions > to be satisfied in order to m

Advice regarding fuzzy phrase searching

2007-12-11 Thread Jose Luna
Hello, I am looking for some advice regarding which tools I might use to solve my problem. I apologize ahead of time for the long explanation. Problem Description: I would like to index a set of very large HTML documents. I would then be able to run two different kinds of queries: proximi

DEFAULT_OPERATOR_AND globally ?

2007-12-11 Thread Helmut Jarausch
Hi, I know how to set DEFAULT_OPERATOR_AND for an individual QueryParser Objekt (after creation) Since I always want this to be set, is there a means to set a (global) option such that any QueryParser object has this default operator. Many thanks for a hint, Helmut Jarausch Lehrstuhl fuer Nume

DEFAULT_OPERATOR_AND globally ?

2007-12-11 Thread Helmut Jarausch
Hi, I know how to set DEFAULT_OPERATOR_AND for an individual QueryParser Objekt (after creation) Since I always want this to be set, is there a means to set a (global) option such that any QueryParser object has this default operator. Many thanks for a hint, Helmut Jarausch Lehrstuhl fuer Nume

RE: Post processing to get around TooManyClauses?

2007-12-11 Thread Beard, Brian
I had a similar problem (I think). Look at using a WildcardFilter (below), possibly wrapped in a CachingWrapperFilter, depending if you want to re-use it. I over-rode the method QueryParser.getWildcardQuery to customize it. In your case you would probably have to specifically detect for the presenc

Re: Post processing to get around TooManyClauses?

2007-12-11 Thread d33mb33
Ok I'm still struggling with this and a QueryFilter didn't help me one bit :-( I'm trying to query for books by "Charles Dickens" that start with "m". I have constructed a QueryFilter for the author search and a PrefixQuery for the title search. A simplified version of my code is below. '

like search in NOT operator

2007-12-11 Thread Shakti_Sareen
Hi all, I am using StandardAnalyzer() to index the data. Actual data is: "signals by magnets of different strength" I want to search for "sign* NOT Machine".how can I do that?? I am using QueryParser. Please help on this issue. Thanks Shakti Sareen DISCLAIMER: This email (incl

Re: Applying SpellChecker to a phrase

2007-12-11 Thread Doron Cohen
Yes that's right, my mistake. In fact even after reading your comment I was puzzled because PhraseScorer indeed requires *all* phrase-positions to be satisfied in order to match. The answer is that the OR logic is taken care of by MultipleTermPositions, so the scorer does not need to be aware of a

help required ... ~ operator

2007-12-11 Thread Shakti_Sareen
Hi all, I am using StandardAnalyzer() to index the data. Actual data is: "signals by magnets of different strength" when I am parsing a query: "signals strength"~2 , I am getting a hit. But when I am parsing a query "strength signals"~2 , I am not getting a hit. WHY???it should work