Re: Ignore Words Problem

2007-03-22 Thread Chris Hostetter
What part of Grant and Karl's answers to you the last time you asked this question wasn't clear? have you tried it? http://www.nabble.com/Re%3A-Common-Words-ignoring-problem-p9550886.html http://www.nabble.com/Re%3A-Common-Words-ignoring-problem-p9567881.html : I want to be make sure, if this s

MergeFactor and MaxBufferedDocs value should ...?

2007-03-22 Thread SK R
Hi, I've looked the uses of MergeFactor and MaxBufferedDocs. If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100 segments will be merged in RAMDir when 100 docs arrived. At the end of 350th doc added to writer , RAMDir have 2 merged segment files + 50 seperate segment files

Re: how ungrouped query handled?

2007-03-22 Thread SK R
Thanks for your reply and this useful links. On 3/23/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: see also the FAQ "Why am I getting no hits / incorrect hits?" which points to... http://wiki.apache.org/lucene-java/BooleanQuerySyntax ...I've just added some more words of wisdom there from p

Ignore Words Problem

2007-03-22 Thread aslam bari
I want to be make sure, if this statement is Right or not? "I am using StatndardAnaylyzer for Indexing documents. Bydefault it ignores some words when doing indexing. But when we search something, Lucene again include the ignore words in searching".??? Myproblem is that:- I indexed a word documen

Re: Questions about Indexing

2007-03-22 Thread Daniel Noll
Maryam wrote: Hi, I have three questions about indexing: 1) I am indexing HTML documents, how can I do "stop removal" before indexing, I dont want to index stop words? The same way you would do it for indexing text documents: StopFilter. 2) I can have an access to the terms in one documen

Re: contrib/benchmark questions

2007-03-22 Thread Doron Cohen
Hi Grant, I think you resolved the question already, but just to make sure... Grant Ingersoll <[EMAIL PROTECTED]> wrote on 22/03/2007 20:41:27: > > On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote: > > > I think I see in the ReadTask that it is the res var that is being > > incremented and wou

Re: Combining score from two or more hits

2007-03-22 Thread Chris Hostetter
: Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector : variant of TopDocHitCollector. The problem is not adjusting the score, it's : what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2 : knowing that the original query resulted in hits on H

Re: Reverse search

2007-03-22 Thread karl wettin
23 mar 2007 kl. 03.07 skrev Melanie Langlois: Thanks Karl, the performances graph is really amazing! I have to say that it would not have think this way around would be faster, but sounds nice if I can use this, make everything easier to manage. I'm just wondering what did you consider when

Re: How can I index Phrases in Lucene?

2007-03-22 Thread karl wettin
23 mar 2007 kl. 04.25 skrev Ryan McKinley: Is there any way to find frequent phrases without knowing what you are looking for? I think you are looking for association rules. Try searching for Levelwise-Scan. Weka contains GPLed Java code. Cite seer is your best friend for whitepapers. htt

Re: contrib/benchmark questions

2007-03-22 Thread Grant Ingersoll
On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote: I think I see in the ReadTask that it is the res var that is being incremented and would have to be altered. I guess I can go by elapsed time, but even that seems slightly askew. I think this is due to the withRetrieve() function overh

Re: How can I index Phrases in Lucene?

2007-03-22 Thread Ryan McKinley
Is there any way to find frequent phrases without knowing what you are looking for? I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that seems kind of clunky particularly if the phrase length is large. Is there any position offset magic that will surface frequent phrases automati

Re: contrib/benchmark questions

2007-03-22 Thread Grant Ingersoll
OK, Doron (and other benchmarkers!), on to search: Here's my alg file: #Indexing declaration up here OpenReader { "SrchSameRdr" Search > : 5000 { "SrchTrvSameRdr" SearchTrav > : 5000 { "SrchTrvSameRdrTopTen" SearchTrav(10) > : 5000 { "SrchTrvRetLoadAllSameRdr" SearchTravRet > :

Questions about Indexing

2007-03-22 Thread Maryam
Hi, I have three questions about indexing: 1) I am indexing HTML documents, how can I do "stop removal" before indexing, I dont want to index stop words? 2) I can have an access to the terms in one document, but how can I have access to the document name that these terms has been appeared? 3)

RE: Reverse search

2007-03-22 Thread Melanie Langlois
Thanks Karl, the performances graph is really amazing! I have to say that it would not have think this way around would be faster, but sounds nice if I can use this, make everything easier to manage. I'm just wondering what did you consider when you build your graph, only the time to run the que

Re: problem in reading an index

2007-03-22 Thread karl wettin
23 mar 2007 kl. 02.09 skrev Daniel Noll: Maryam wrote: Hi, I have written this piece of code to read the index, mainly to see what terms are in each document and what the frequency of each term in the document is. This piece of code correctly calculates the number of docs in the index, but I d

Re: Reverse search

2007-03-22 Thread karl wettin
23 mar 2007 kl. 02.12 skrev Melanie Langlois: I want to manage user subscriptions to specific documents. So I would like to store the subscription (query) into the lucene directory, and whenever I receive a new document, I will search all the matching subscriptions to send the documents to

Reverse search

2007-03-22 Thread Melanie Langlois
Hello, I want to manage user subscriptions to specific documents. So I would like to store the subscription (query) into the lucene directory, and whenever I receive a new document, I will search all the matching subscriptions to send the documents to all subcribers. For instance if a user s

Re: problem in reading an index

2007-03-22 Thread Daniel Noll
Maryam wrote: Hi, I have written this piece of code to read the index, mainly to see what terms are in each document and what the frequency of each term in the document is. This piece of code correctly calculates the number of docs in the index, but I don’t know why variable myTermFreq[] is nul

problem in reading an index

2007-03-22 Thread Maryam
Hi, I have written this piece of code to read the index, mainly to see what terms are in each document and what the frequency of each term in the document is. This piece of code correctly calculates the number of docs in the index, but I don’t know why variable myTermFreq[] is null. Would you ple

Re: Extracting formatted text from PDF files

2007-03-22 Thread Daniel Noll
Mike O'Leary wrote: Please forgive the laziness inherent in this question, as I haven't looked through the PDFBox code yet. I am wondering if that code supports extracting text from PDF files while preserving such things as sequences of whitespace between characters and other layout and formattin

Re: Speeding up looping over Hits

2007-03-22 Thread Erick Erickson
Oh yeah.. By only loading the relevant fields, my query times reduced by over 90%. I actually wrote that up on the mailing list if you wanted to try to find it, but it took Andreas' message to remind me... Erick On 3/22/07, Santa Clause <[EMAIL PROTECTED]> wrote: Another thing you may want to

Software Product Development Job Opportunity (Baltimore, MD)

2007-03-22 Thread Lesko, Matt
Official job description & info to submit a resume: http://www.systemsalliance.com/careers/internal-jobs/baltimore/Software_ Engineer_MD.html Located 15 minutes North of Baltimore in Sparks, MD Position is on a team, working with myself and others, maintaining and developing an existing co

Re: Speeding up looping over Hits

2007-03-22 Thread Santa Clause
Another thing you may want to look at is the newer version 2.1.0 and getFieldable. I think that will lazy load the data, that way you are only reading the parts of the document that you need at that moment rather than the whole thing. Someone please correct me if I am wrong or point to what

Re: Combining score from two or more hits

2007-03-22 Thread Antony Bowesman
Erick Erickson wrote: Don't know if it's useful or not, but if you used TopDocs instead, you have access to an array of ScoreDoc which you could modify freely. In my app, I used a FieldSortedHitQueue to re-sort things when I needed to. Thanks Erick, I've been using TopDocs, but am playing with

Re: Extracting formatted text from PDF files

2007-03-22 Thread Soeren Pekrul
Mike O'Leary wrote: Please forgive the laziness inherent in this question, as I haven't looked through the PDFBox code yet. I am wondering if that code supports extracting text from PDF files while preserving such things as sequences of whitespace between characters and other layout and formattin

Re: how ungrouped query handled?

2007-03-22 Thread Chris Hostetter
see also the FAQ "Why am I getting no hits / incorrect hits?" which points to... http://wiki.apache.org/lucene-java/BooleanQuerySyntax ...I've just added some more words of wisdom there from past emails. : Date: Thu, 22 Mar 2007 09:51:15 -0400 : From: Erick Erickson <[EMAIL PROTECTED]> : Reply

Re: How can I index Phrases in Lucene?

2007-03-22 Thread Erick Erickson
Well, you don't index phrases, it's done for you. You should try something like the following Create a SpanNearQuery with your terms. Specify an appropriate slop (probably 0 assuming you want them all next to each other). Now use call getSpans and count ... You may have to do something with

How can I index Phrases in Lucene?

2007-03-22 Thread Maryam
Hi, I know how to index terms in lucene, now I wanna see how can I index phrases like "information retreival" in lucene and calculate the number of times that phrase has appeared in the document. Is there any way to do it in Lucene? Thanks

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Andrzej Bialecki
rubdabadub wrote: On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. So long as it meets the Apache license conditions regarding the distribution it's not f

Re: Speeding up looping over Hits

2007-03-22 Thread Erick Erickson
Your timing differences are probably because of caching. But this has been mentioned many times in the archive, that a Hits object is intended to allow fast, simple retrieval of the first few documents in a result set (100 if memory serves). Each 100 or so calls to next() causes the search to be r

Speeding up looping over Hits

2007-03-22 Thread Andreas Guther
Hi, While looking into performance enhancement for our search feature I noticed a significant difference in Documents access time while looping over Hits. I wrote a test application search for a list of search terms and then for each returned Hits object loops twice over every single hits.doc(i).

Re: how ungrouped query handled?

2007-03-22 Thread Erick Erickson
This is a pretty common issue that I've been grappling with by chance recently. The main point is that the parser is NOT a boolean logic parser. Search the mail archive for the thread "bad query parser bug" and you'll find a good discussion. I tried using PrecedenceQueryParser, but that didn

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub
Good to hear :-) I am curious, how many custom changes are you making to the code that this is even an issue? Perhaps submitting patches and working to get them committed would be a more efficient strategy. Well there are 3 problems I see. 1. There are very good patches on all of the lucene

Re: Combining score from two or more hits

2007-03-22 Thread Erick Erickson
Don't know if it's useful or not, but if you used TopDocs instead, you have access to an array of ScoreDoc which you could modify freely. In my app, I used a FieldSortedHitQueue to re-sort things when I needed to. ERick On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: I have indexed obj

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Grant Ingersoll
On Mar 22, 2007, at 8:16 AM, rubdabadub wrote: On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. People may find it and think it is the official Apache Luce

Re: Spelt, for better spelling correction

2007-03-22 Thread Martin Haye
Otis, I hadn't really thought about this, but it would be easy to build a dictionary from an existing Lucene index. Tha main caveat is that it would only work with "stored" fields. That's because this spellchecker boosts accuracy using pair frequencies in addition to term frequencies, and Lucene

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub
On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. People may find it and think it is the official Apache Lucene, since it is branded that way. I'm not a lawye

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Grant Ingersoll
Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. People may find it and think it is the official Apache Lucene, since it is branded that way. I'm not a lawyer, so I don't know for sure. I think you have t

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub
On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Is the point of this that you can make "commits" to Lucene so that you don't lose your changes on trunk? Not only that. But I can make as many local branch as I like ..for example customer X, customer Y. This way I can support X and Y as th

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Grant Ingersoll
Is the point of this that you can make "commits" to Lucene so that you don't lose your changes on trunk? On Mar 22, 2007, at 7:14 AM, rubdabadub wrote: Hi: First of all apology to those friends who follow all the list. Often times I work offline and I do not have any commit rights to any of

bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub
Hi: First of all apology to those friends who follow all the list. Often times I work offline and I do not have any commit rights to any of the projects. All the modifications I make for various clients and trying to keep up to date with latest trunk somehow makes it difficult for me to just sti

how ungrouped query handled?

2007-03-22 Thread SK R
Hi, Can anyone explain how lucene handles the belowed query? My query is *field1:source AND (field2:name OR field3:dest)* . I've given this string to queryparser and then searched by using searcher. It returns correct results. It's query.toString() print is :: +field1:source +(field2:name f

Re: Querying fragments of a tree structure

2007-03-22 Thread Emanuel Schleussinger
Hi Erick, excellent insight, thanks a lot. As you would expect, this method works a treat. thanks a lot for your time! Emanuel - Original Message - From: "Erick Erickson" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, March 21, 2007 2:12:49 PM (GMT+0100) Europe/Berl

Re: indexing rss feeds in multiple languages

2007-03-22 Thread Antony Bowesman
Melanie Langlois wrote: Well, thanks, sounds like the best option to me. Does anybody use the PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on the performances when using different analyzers. I've not done any specifc comparisons between using a single Analyzer and m