retrieving matched slop

2007-03-20 Thread Ruslan Sivak
I have an app that searches a single document against many queries. Lets say the document was The quick brown fox jumped over the lazy dog. and my queries are SpanNearQuery("quick","brown",50) SpanNearQuery("quick","fox",50) I would like to retrieve the slop or some sort of score that was ma

Re: TextMining.org Word extractor

2007-03-20 Thread Ryan Ackley
Someone pointed me there already. Looks interesting. Is there a mailing list for the incubator? Does anyone know the status of the proposal? On 3/20/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: If you are thinking about putting textmining library elsewhere, allow me to point out Tika: http:

Re: Spelt, for better spelling correction

2007-03-20 Thread Otis Gospodnetic
Boy, I'm looking forward to this! I read some of the background discussion. I think this might fit as a Lucene contrib, but we'll be able to tell when the code makes it into JIRA. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Searc

Re: TextMining.org Word extractor

2007-03-20 Thread Otis Gospodnetic
If you are thinking about putting textmining library elsewhere, allow me to point out Tika: http://wiki.apache.org/incubator/TikaProposal Better home for your lib, perhaps? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - S

Re: Search Question

2007-03-20 Thread Erick Erickson
I'm betting you can make SpanNearQuery work for you. In the simple case it's a bunch of SpanQuerys (which in its simplest form is just a Span version of TermQuery). The two other parameters are slop (See Lucene In Action for an explanation of this) and whether the terms must appear in the order th

Search Question

2007-03-20 Thread Santa Clause
Hello all, I have a how-to question. I have a field with these tokens in it (a b c f b g a) and I am searching on it with these tokens (a f e g a). So far this is easy I just set up a BooleanQuery with a bunch of optional TermQueries and get hits on (a f g a) but not (e) which is close to what

Re: question about getting all terms in a section of the documents

2007-03-20 Thread Antony Bowesman
Donna L Gresh wrote: Also, the terms.close() statement is outside the scope of terms. I changed to the following, is this correct and should the FAQ be changed? try { TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", ""));

Re: Spelt, for better spelling correction

2007-03-20 Thread Yonik Seeley
Sounds interesting Martin! Is the dictionary static, or is it generated from the corpus or from user queries? -Yonik On 3/20/07, Martin Haye <[EMAIL PROTECTED]> wrote: As part of XTF, an open source publishing engine that uses Lucene, I developed a new spelling correction engine specifically to

Spelt, for better spelling correction

2007-03-20 Thread Martin Haye
As part of XTF, an open source publishing engine that uses Lucene, I developed a new spelling correction engine specifically to provide "Did you mean..." links for misspelled queries. I and a small group are preparing this for submission as a contrib module to Lucene. And we're inviting interested

Re: TextMining.org Word extractor

2007-03-20 Thread Ryan Ackley
I've been out of the loop for a while. I just saw this recent thread and re-subscribed to the list. In the next month or two I will be able to put some time into the textmining library. Fast saved files are on the list of improvements as well as other features that have been requested. I would al

Re: Sort Performance Question

2007-03-20 Thread Peter W .
Hello, The response time for sorts depends on number of results. If you don't need all documents returned you could use a filter. One idea would be to use DateTools to save your dates as Strings and build your query with FilteredQuery passing in a custom filter to search this field. The filter

Re: Thank you...

2007-03-20 Thread Cass Costello
Heh - it used to be in my sig ... my bad. Thanks, all. :) http://www.stubhub.com On 3/20/07, bruce <[EMAIL PROTECTED]> wrote: hey cass... anyway you could let us know the site/app that we're powering!!! always good to see what's going on in the world! thanks -Original Message- F

Re: Obtaining the (indexed) terms in a field in a particular document

2007-03-20 Thread Erick Erickson
Well, depending upon your storage requirements, it's actually much easier than that. Assuming you're adding this field (or a duplicate) as UN_TOKENIZED (in this case, no need to store), you can just spin through all the terms for that field with TermDocs/TermEnum. The trick is to have your term st

Re: Sort Performance Question

2007-03-20 Thread Erik Hatcher
In a web application, I have generally cached IndexSearcher in application scope and reused it for all requests. You will have to balance the demand for timeliness of updates with the time it takes to build up the sort caches. You can't really have instantaneous viewing of newly added docu

RE: Sort Performance Question

2007-03-20 Thread David Seltzer
Erik, I'm not using a cached IndexSearcher. Is this an option in an environment where the underlying index changes on a second-by-second basis? At what layer would a cached IndexSearcher be cached? At the tomcat layer? Caching at the object layer seems like it might help, but it doesn't address m

RE: Thank you...

2007-03-20 Thread bruce
hey cass... anyway you could let us know the site/app that we're powering!!! always good to see what's going on in the world! thanks -Original Message- From: Cass Costello [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 20, 2007 12:58 PM To: solr-user@lucene.apache.org; java-user@lucene

Re: Sort Performance Question

2007-03-20 Thread Erik Hatcher
Are you using a cached IndexSearcher such that successive sorts on the same field will be more efficient? Erik On Mar 20, 2007, at 3:39 PM, David Seltzer wrote: Hi All, I have a sort performance question: I have a fairly large index consisting of chunks of full-text transcript

Thank you...

2007-03-20 Thread Cass Costello
...to everyone who helps make Lucene and Solr such fantastic tools. I'm the Platform Architect for a leading online event ticket after-marketplace (think eBay for tickets), and we've just completed a 12 month project to rewrite the Browse and Search components of our customer-facing site. Both r

Sort Performance Question

2007-03-20 Thread David Seltzer
Hi All, I have a sort performance question: I have a fairly large index consisting of chunks of full-text transcriptions of television, radio and other media, and I'm trying to make it searchable and sortable by date. The search front-end uses a parallelmultisearcher to search up to three

Re: Obtaining the (indexed) terms in a field in a particular document

2007-03-20 Thread Donna L Gresh
Thanks, I see what you are saying. Seems that if I create the field at index time with term vectors stored, then I can iterate through the documents and get both the unique identifier and the terms, right? My original question was imprecise in that I'm going to want to get all the terms for *al

Re: Obtaining the (indexed) terms in a field in a particular document

2007-03-20 Thread Erick Erickson
Sorry, but you have to have the Lucene document ID, which you can get either as part of a Hits or HitCollector or... or by using TermDocs/TermEnum on your unique id (my_id in your example). Erick On 3/20/07, Erick Erickson <[EMAIL PROTECTED]> wrote: You can do a document.get(field), *assuming*

Re: Obtaining the (indexed) terms in a field in a particular document

2007-03-20 Thread Erick Erickson
You can do a document.get(field), *assuming* you have stored the data (Field.Store.YES) at index time, although you may not get stop words. On 3/20/07, Donna L Gresh <[EMAIL PROTECTED]> wrote: My apologies if this is a simple question-- How can I get all the (stemmed and stop words removed, et

Obtaining the (indexed) terms in a field in a particular document

2007-03-20 Thread Donna L Gresh
My apologies if this is a simple question-- How can I get all the (stemmed and stop words removed, etc.) terms in a particular field of a particular document? Suppose my documents each consist of two fields, one with the name "my_id" and a unique identifier, and the other being some text string

Re: Eliminate duplicates

2007-03-20 Thread Doron Cohen
Another option for this might be IndexWriter.updateDocument(). "Erick Erickson" <[EMAIL PROTECTED]> wrote on 18/03/2007 15:28:09: > BTW, instead of searching with a query, it might be faster > to use TermEnum on your unique field. If TermEnum finds > a term like the one you're about to add, you a

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-20 Thread Doron Cohen
Lokeya <[EMAIL PROTECTED]> wrote on 18/03/2007 13:19:45: > > Yep I did that, and now my code looks as follows. > The time taken for indexing one file is now > => Elapsed Time in Minutes :: 0.3531 > which is really great I am jumping in late so appologies if I am missing something. However I don't

Re: can't get docFreq of phrase

2007-03-20 Thread SK R
Thanks a lot. On 3/20/07, karl wettin <[EMAIL PROTECTED]> wrote: 20 mar 2007 kl. 12.14 skrev SK R: > Hi Mark, > Thanks for your reply. > Could i get this match length (docFreq) without using > searcher.search(..) ? > > One more doubt is "Preformace for getting search

Re: can't get docFreq of phrase

2007-03-20 Thread mark harwood
>> Could i get this match length (docFreq) without using searcher.search(..) ? Yes, but it's likely to involve more code on your part. TermPositions is class you want to look at. See the PhraseQuery implementation for examples of how to use this. >> One more doubt is "Preformace for getting sea

Re: can't get docFreq of phrase

2007-03-20 Thread karl wettin
20 mar 2007 kl. 12.14 skrev SK R: Hi Mark, Thanks for your reply. Could i get this match length (docFreq) without using searcher.search(..) ? One more doubt is "Preformace for getting search length by using searcher.search(...) is same as using reader.docFreq(..)??

Re: can't get docFreq of phrase

2007-03-20 Thread SK R
Hi Mark, Thanks for your reply. Could i get this match length (docFreq) without using searcher.search(..) ? One more doubt is "Preformace for getting search length by using searcher.search(...) is same as using reader.docFreq(..)??; On 3/20/07, mark harwood <[EMAIL PROT

Re: can't get docFreq of phrase

2007-03-20 Thread mark harwood
IndexSearcher s=new IndexSearcher("/indexes/myindex"); PhraseQuery pq = new PhraseQuery(); pq.add(new Term("contents","test")); pq.add(new Term("contents","under")); int df=s.search(pq).length(); Cheers Mark - Original Message From: SK R <[EMAIL PRO

can't get docFreq of phrase

2007-03-20 Thread SK R
Hi, I can get docFreq. of single term like (f1:test) by using indexReader.docFreq(new Term("f1","test")). But can't get docFreq. of phrase term like f2:"test under") by the same method. Is anything wrong in this code? Please help me to resolve this problem. Thanks & Regards RSK

Re: Common Words ignoring problem

2007-03-20 Thread karl wettin
20 mar 2007 kl. 07.40 skrev thomas arni: You can adapt the source code of StopAnalyzer.java in the analysis package, or I suppose you can use the default constructor with a empty stop word list (but please check this). I often do this: analyzer = new ...Analyzer(Collection.EMPTY_SET); an