SpellChecker Index - remove words?

2007-01-10 Thread Josh Joy
Hi All, The spellchecker api is very nice to use, and I can easily add words to the index. However, because the words I am adding are from another index that is user generated per se (meaning it may have spelling errors), how can I safely remove words from the spell checker index? If I know the

Re: Technology Preview of new Lucene QueryParser

2007-01-10 Thread Chris Hostetter
: It works like this: "A -B -C" would be expressed as "A ! B ! C" : By binary, I mean that each operator must connect two clauses...in that : case A is connected to B and C is connected to A ! B. : I avoid the single prohibit clause issue, -query, by not really allowing so do you convert A ! B !

Re: Text storing design and performance question

2007-01-10 Thread Jason Pump
Renaud, one optimization you can do on this is to try the first 10kb, see if it finds text worth highlighting, if not, with a slight overlap try the next 9.9kb - 19.9kb or just 9.9kb -> end if you're feeling lazy. This assumes that most good matches are at the start of the document, and that th

Re: Technology Preview of new Lucene QueryParser

2007-01-10 Thread Mark Miller
Hey Hoss, I didn't realize that I had left out the field stuff...I really am still working on a lot with the parser's documentation and I apologize. Mark: I only read your querysyntax.php and didnt' dig into the source, but i'm curious about the "There are no unary operators in Qsol syntax" st

Re: Technology Preview of new Lucene QueryParser

2007-01-10 Thread Chris Hostetter
: http://www.myhardshadow.com/qsol.php Mark: I only read your querysyntax.php and didnt' dig into the source, but i'm curious about the "There are no unary operators in Qsol syntax" statement what is the Qsol equivilent of the QueryParser syntax: "A -B -C" It's also not clear to me how diffe

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Erick Erickson
To answer questions like "what really happens" in terms of a lucene query, I've been helped greatly by two things... query.toString(); and Luke. Of the two, luke (google lucene luke) is quickest. It will show you what lucene request is produced by various query strings etc. Sorry if you alread

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Chris Hostetter
: I should assume, though, that parenthesis work as expected? So where I was : doing things like: : ( A OR B ) AND ( C OR D ), that means that +(A B) +(C D) is actually : happening? yes ... anywhere i used a simple example like "A" or "foo" could be repalced with a parenthetical expression whose

Re: Spelling Correction api

2007-01-10 Thread Chris Hostetter
(disclaimer: i've never acctaully used the SpellChecker contib, just read the docs) : I've been reading through the spelling correction API and I'm confused. : It looks like you tell it the directory to hold the spelling correction : DB and then give it an IndexReader and a field to retrieve spel

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
On 1/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: I'm guessing there is suppose to be some sort of table structure to the mail you send ... it doesn't work in plain text mail readers so i'm not sure whta ou were trying to say. My bad, I was using GMail, and it was trying to produce a ver

Spelling Correction api

2007-01-10 Thread Simon Wistow
I've been reading through the spelling correction API and I'm confused. It looks like you tell it the directory to hold the spelling correction DB and then give it an IndexReader and a field to retrieve spelling suggestions from. But then I'd have to redo that operation everytime a new document

RE: isCurrent says no, but contents still invisible

2007-01-10 Thread Benson Margulies
OK, all is well. I had a truly embarrassing logic error that I introduced while getting all the closing and opening straightened out. Thanks for the patience. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e

RE: isCurrent says no, but contents still invisible

2007-01-10 Thread Benson Margulies
If all is well with the lifecycle, should IndexReader.numDocs() return an accurate count? -Original Message- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 10, 2007 5:54 PM To: java-user@lucene.apache.org Subject: RE: isCurrent says no, but contents still invisible

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Chris Hostetter
: This is now my generalized understanding of the parser's operators. Am I : closer? I'm guessing there is suppose to be some sort of table structure to the mail you send ... it doesn't work in plain text mail readers so i'm not sure whta ou were trying to say. In a nut shell... 1) Lucene's Q

RE: isCurrent says no, but contents still invisible

2007-01-10 Thread Doron Cohen
"Benson Margulies" <[EMAIL PROTECTED]> wrote on 10/01/2007 14:26:42: > Oh, boy, what a mistake. I thought I was being clever by creating a > Directory object. All that did was prevent the writer from ever quite > flushing because I wasn't closing THAT. > No need to close the directory object for

RE: isCurrent says no, but contents still invisible

2007-01-10 Thread Benson Margulies
Oh, boy, what a mistake. I thought I was being clever by creating a Directory object. All that did was prevent the writer from ever quite flushing because I wasn't closing THAT. -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 10, 2007 5:22 PM To:

RE: isCurrent says no, but contents still invisible

2007-01-10 Thread Benson Margulies
Yea, that part I got. -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 10, 2007 5:22 PM To: java-user@lucene.apache.org Subject: Re: isCurrent says no, but contents still invisible And don't forget that you need to close and re-open the reader to

Re: isCurrent says no, but contents still invisible

2007-01-10 Thread Erick Erickson
And don't forget that you need to close and re-open the reader to pick up the changes... [EMAIL PROTECTED] . On 1/10/07, Benson Margulies <[EMAIL PROTECTED]> wrote: #2 is a possible issue. I stared at the code some more: The test case adds up to : Create all the objects. Add three docs.

Re: Speed of grouped queries

2007-01-10 Thread sdeck
I guess I never saw this request. Here is my answer. Carrot would give me things like this Genre - Horror Scary Movie - (40) - Luke Perry (10) Which is not what I am going for. Basically, think someone clicks on the "Horror" tab and they see all of the articles for every movie/actor w

RE: isCurrent says no, but contents still invisible

2007-01-10 Thread Benson Margulies
#2 is a possible issue. I stared at the code some more: The test case adds up to : Create all the objects. Add three docs. Add a fourth doc. Do a query aimed at the fourth doc. isCurrent() returns false. Close reader/searcher, open reader/searcher, numDocs() in the reader returns 3. Not 4. H

RE: Text storing design and performance question

2007-01-10 Thread Renaud Waldura
It could well be. Or maybe not; one could speculate the JDBC overhead (especially over a network) would make DB access slower than disk (local filesystem). YMMV as they say. -Original Message- From: moraleslos [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 10, 2007 12:14 PM To: java

Re: sort on a searchable field

2007-01-10 Thread Erick Erickson
I'm pretty sure you can search UN_TOKENIZED fields, just be sure the analyzers you use for querying don't break your query input up The javadoc implies so (for the Sort class...) ** he fields used to determine sort order must be carefully chosen. Documents must contain a single te

Re: isCurrent says no, but contents still invisible

2007-01-10 Thread Doron Cohen
That's strange. Since you don't close the writer usually adding the doc would not modify the index (unless adding the doc triggered a merge). You may want to check that: 1. writer and reader really opened against the same path; 2. reader isCurrent state also before adding the doc and after re-open

isCurrent says no, but contents still invisible

2007-01-10 Thread Benson Margulies
I'm trying what should be the dumbest possible example of concurrency management with 2.0 in Java with an ordinary FSDirectory. I create an IndexWriter from a pathname, an IndexReader from the same pathname, and an IndexSearcher from the reader. I add one document. I call isCurrent() on

Re: sort on a searchable field

2007-01-10 Thread Yonik Seeley
On 1/10/07, moraleslos <[EMAIL PROTECTED]> wrote: From what I understand about Lucene, one can only sort on a field that is indexed but not tokenized (and hence not searchable). I have content that can be searched by keyword and also a date string, e.g. text:Lucene AND date:[2007-01-01 TO 2007-0

sort on a searchable field

2007-01-10 Thread moraleslos
>From what I understand about Lucene, one can only sort on a field that is indexed but not tokenized (and hence not searchable). I have content that can be searched by keyword and also a date string, e.g. text:Lucene AND date:[2007-01-01 TO 2007-01-10] Since my date is searchable, I need to inde

RE: Text storing design and performance question

2007-01-10 Thread moraleslos
Maybe keeping the data in the DB would make it quicker? Seems like the I/O performance would cause most of the performance issues you're seeing. -los Renaud Waldura-5 wrote: > > We used to store a big text field for highlighting purposes too, and it > proved a big pain. The index was gigant

Re: Text storing design and performance question

2007-01-10 Thread moraleslos
So I guess what you're stating is that Lucene is really fast on searches that doing repeated searches for pagination wouldn't make much of a difference (even though its redundant)... I'll give this a shot. Thanks Mark. -los markrmiller wrote: > > Being stateless should not be much of an issu

Re: Remove Docs from Index

2007-01-10 Thread Doron Cohen
Fernando, this code seems okay, in what sense is it "not working well"? One thing to verify is that the "contentPid" field you are deleting by was added to the index with Index.UN_TOKENIZED, otherwise the analyzer in use while indexing might have broken or lower-cased that term (e.g. "contentPid:C

Re: Remove Docs from Index

2007-01-10 Thread Erick Erickson
When you say it isn't working very well, what do you mean? It's slow? It's not removing what you expect? If the latter, I suspect you're tokenizing the term when you index such that it's not being found correctly You need to, somehow, get the Lucene document ID to remove. You can either do a

RE: Text storing design and performance question

2007-01-10 Thread Renaud Waldura
We used to store a big text field for highlighting purposes too, and it proved a big pain. The index was gigantic, it took forever to build, and the search performance would sometimes suffer from it (just a hunch). Now we keep this big text field on disk (in a file), and feed it to the highlighter

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Mark Miller
So, if I understand you right, a simple query of NOT ORANGES gets me every document that does not contain the word oranges, while a separate query with -ORANGES added will force the score to zero for all documents in which oranges does not appear. One's a selector, the other is a filter. Not

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
Based on responses from Steven Rowe <[EMAIL PROTECTED]> and Mark Miller < [EMAIL PROTECTED]>: Lucene uses a scoring system that behaves similarly to a boolean system. ... more information in the October 2006 thread "QueryParser is Badly Broken

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Steven Rowe
Walt Stoneburner wrote: > Do I have correct and complete understanding of the two operators? Not entirely complete :) - more information in the October 2006 thread "QueryParser is Badly Broken": -

Re: Text storing design and performance question

2007-01-10 Thread Mark Miller
Being stateless should not be much of an issue. As Erick mentioned, the highlighter just expects you to pass it the query again and the text to be highlighted. So when you show the pagination you just need to keep around what query generated the current page...then shove each piece of relevant tex

Remove Docs from Index

2007-01-10 Thread Fernando G Bernardino
Hi People! My app needs to update documents from index, so I have to remove and insert again, all right? First I wrote this code: --- IndexReader reader = null; try { String index = Webp.getProperty("webp.search.indexFolder"); Directory directory = FS

Re: hithighlighter bug

2007-01-10 Thread Steven Rowe
Jason wrote: > Hi all, > I have come across what I think is a curious but insidious bug with > the java lucene hit highlighter. [...] > when I search for -> Acquisition Plan <- > in my search results I get: > (ancilliary stuff deleted) > attached to the Acquisition > < em>Planand signed >

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
On 1/10/07, Mark Miller <[EMAIL PROTECTED]> wrote: The subtle part is that a scoring system is being used that operates in something of a boolean fashion, but that has subtle difference. Mark, -thank you-. This explains it beautifully. So, if I understand you right, a simple query of NOT OR

Re: Text storing design and performance question

2007-01-10 Thread moraleslos
Hi Mark, Looks like I've got to implement some sort of pagination for my clients. Problem is everything is stateless so looks like there's some work I need to do on my end. Thanks. -los markrmiller wrote: > > Usually a user cannot easily browse 50,000 on a single display, and so > you wou

Re: how can I filter my search to not include items containing a particular field and value?

2007-01-10 Thread Erick Erickson
As luck would have it, there's an explanation of the NOT operator in the thread below posted after your original one... *Getting a Better Understanding of Lucene's Search Operators * On 1/10/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Would something like the following work for you? B

Re: Ingnore Case in Sorting

2007-01-10 Thread Erick Erickson
You could always implement your own sorter, but I really have to ask why you are storing things in your index case-sensitive in the first place. In the above example, if you're searching on "banana", you'll get no hits. That may be your desired behavior You could also store a sort field in yo

Re: Text storing design and performance question

2007-01-10 Thread Mark Miller
Usually a user cannot easily browse 50,000 on a single display, and so you would only highlight the docs as they became visible to the user. This is generally a small amount...often one at a time. - Mark moraleslos wrote: Hi Erik, Would that slow performance a bit? For example, say I receiv

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Mark Miller
Lucene uses a scoring system that behaves similarly to a boolean system. Each piece of the query contributes to the score for each document...if a document scores 0, it is not returned in the results. To search for documents that must contain "apples" and may contain "oranges" use the query:

Re: how can I filter my search to not include items containing a particular field and value?

2007-01-10 Thread Erick Erickson
Would something like the following work for you? BooleanQuery bq = new BooleanQuery(); bq.add(your built-up query); bq.add(your not clause, MUSTNOT); Now you can use your bq as your query to search. NOTE: there is continual confusion what the - syntax really does, you might want to search the

Re: Text storing design and performance question

2007-01-10 Thread moraleslos
Hi Erik, Would that slow performance a bit? For example, say I receive 50,000 hits from a search. From your explanation, I have to retrieve the DB id from each hit, perform a query to the DB using the id to retrieve the full contents for each hit, run highlighter on each content, and then retur

Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
Hello, I'm trying to get a better understanding of Lucene's search operators as described in the documentation at http://lucene.apache.org/java/docs/queryparsersyntax.html The documentation goes out of its way to identify two operators, require and prohibit, but doesn't fully explain them against

Re: Text storing design and performance question

2007-01-10 Thread Erik Hatcher
You don't have to store a field to highlight text. If you've got it in your database, retrieve it from there and pass that string to the highlighter instead. Erik On Jan 10, 2007, at 10:45 AM, moraleslos wrote: I'm running into a little dilemma with Lucene highlighting and ind

Ingnore Case in Sorting

2007-01-10 Thread wawa
Hello, I wonder that there is anyway to ignore case(up & lower) in sorting. I just used Sort() method and Lucene sorted result Uppercase first and then lowercase. ex) Apple Banana apple Is there a way to sort like below?: Apple apple Banana -- View this message in context: http://www.nabble.

Text storing design and performance question

2007-01-10 Thread moraleslos
I'm running into a little dilemma with Lucene highlighting and indexing. I currently index anything and everything that gets inserted into a database. This database includes all the content that is searched. Now I'll have lots and lots of content, thinking of the range of 50GB+, all stored in t

how can I filter my search to not include items containing a particular field and value?

2007-01-10 Thread Jason
how can I filter my search to not include items containing a particular field and value? I want effectively to add -myfieldname:myvalue to the end of my search query, but I cant see how to do this via the api. I have a complex query built up via the api and just want to filter it based on fie