Re: Extremely Large Strings Comparison (slightly off-topic)

2008-11-14 Thread Aaron Schon
Thanks for responding Jonathan. I will look into k-grams approach. The objects could differ by small local changes. To provide some business context, the application requires indexing email messages and attachments. If the attachments differ by some threshold (users making edits/reviews), the

Extremely Large Strings Comparison (slightly off-topic)

2008-11-14 Thread Aaron Schon
hi I need to compare two Base64 representation strings of some MIME content that I am storing within a Lucene index. I need to efficiently compare them to find the closest match to a query Base64 string , post Lucene query. I am not sure of the best way to approach this, could I compare the hash

Re: Storing part of the field

2008-11-14 Thread Chris Hostetter
: The application which uses the index expects this in same field. So, can't use : two fields. be carefully about termiology here ... there are "org.apache.lucene.document.Field" objects, and then there are "fields" or "field names" you can index a Document containing multiple "Field" objects

Re: LUCENE-831 (complete cache overhaul) -> mem use

2008-11-14 Thread Mark Miller
Its hard to predict the future of LUCENE-831. I would bet that it will end up in Lucene at some point in one form or another, but its hard to say if that form will be whats in the available patches (I'm a contrib committer so I won't have any real say in that, so take that prediction with a gra

Re: Storing part of the field

2008-11-14 Thread Erick Erickson
H, I don't understand payloads, but it seems to me that it *might* apply. Search the mail list for "payload" and/or look at the docs. Payloads were added after the last time I had to really dig into Lucene. But from what I've seen going by on the thread, it may be what you need. But then I cou

Re: LUCENE-831 (complete cache overhaul) -> mem use

2008-11-14 Thread Pablo Saavedra
I have the same problem with cache and too many sorted fields, and had to implement a big workaround to be able to plug my own cache implementation in lucene 2.3.2. What I'd really like to see in the new cache implementation is easier pluggability and extension of the lucene classes, which is curre

Re: Phrase query-like query that doesn't requre all the terms?

2008-11-14 Thread Yonik Seeley
On Fri, Nov 14, 2008 at 12:05 PM, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote: > My problem with Phrase Query is that it requires > existence of all the terms in documents. I want them more > permissible. I want it to match with lower score. > Does dismax also requires all the terms? The mandato

LUCENE-831 (complete cache overhaul) -> mem use

2008-11-14 Thread Britske
Hi, I recently saw activity on LUCENE-831 (Complete overhaul of FieldCache API/Implementation) which I have interest in. I posted previously on this with my concern that given the current default cache I sometimes get OOM-errors because I have a lot of fields which are sorted on, which ultimate

Re: Storing part of the field

2008-11-14 Thread Ravi L
Thanks Erick! The application which uses the index expects this in same field. So, can't use two fields. Any ways, Thank you guys for quick your responses! thanks ravi On 14-Nov-08, at 6:38 PM, Erick Erickson wrote: As far as I know you can't do this with just one field. Why do you care?

RE: Phrase query-like query that doesn't requre all the terms?

2008-11-14 Thread Teruhiko Kurosaka
Yonik, Thank you for your reply. My problem with Phrase Query is that it requires existence of all the terms in documents. I want them more permissible. I want it to match with lower score. Does dismax also requires all the terms? > Solr's dismax parser can generate queries that do most of > t

RE: Multi -threaded indexing of large number of PDF documents

2008-11-14 Thread Sudarsan, Sithu D.
Hi All, Based on your valuable inputs, we tried a few experiments with number of threads. The observation is, if the number of threads are one less than the number of cores (we have 'main' as a separate thread. Essentially, including 'main' number of threads equal to number of cores), the indexi

Re: Scoped Search and Facets generation using Lucene

2008-11-14 Thread Otis Gospodnetic
Hi Mayur, Solr has built-in support for facets. I don't understand what you mean by scoped searches. Could you please give a concrete example? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: "Bapat, Mayur" <[EMAIL PROTECTED]> To: ja

Re: Phrase query-like query that doesn't requre all the terms?

2008-11-14 Thread Yonik Seeley
Solr's dismax parser can generate queries that do most of this... it's a combination of term queries and sloppy phrase queries. Simplest example: +(DEF GHI) "DEF GHI"~10^5 The only thing that it doesn't work for is the terms out of order (they will still be matched). You could use span queries i

Phrase query-like query that doesn't requre all the terms?

2008-11-14 Thread Teruhiko Kurosaka
PhraseQuery requires all the terms in the phrase exists in the field being searched. I am looking for a more permissible version of PhraseQuery which is sensitive to the order of the terms but allows missing terms, which would lower the score but still matches. For example, query "DEF GHI" would

Re: About counting term hits

2008-11-14 Thread Michael McCandless
I think to do this efficiently you'd need to modify Lucene's builtin query classes (eg TermQuery) such that during the scoring process, in addition to simply computing its contribution to the document's score, it would also record further information like total number of occurrences of ea

Re: Storing part of the field

2008-11-14 Thread Erick Erickson
As far as I know you can't do this with just one field. Why do you care? Storing two fields, one indexed but not stored and one stored but not indexed shouldn't use very many resources. Best Erick On Fri, Nov 14, 2008 at 3:06 AM, Ravi L <[EMAIL PROTECTED]> wrote: > Thanks Anshum! > > This can be

Re: [ANN] Luke 0.9 released

2008-11-14 Thread mark harwood
>>BTW, if you have a small test index with multiple commit points could you >>please send it to me off the list? See the "setup" method in the junit test "TestTransactionRollbackCapability2" attached here: https://issues.apache.org/jira/browse/LUCENE-1449 Cheers, Mark - Original Message

Re: [ANN] Luke 0.9 released

2008-11-14 Thread Andrzej Bialecki
mark harwood wrote: Hi Andrzej, Thanks for the update. Looks like you've been busy adding some great new features! I think you may have a bug in opening an index with prior commit points, though. I want to keep these in my index and so I opened it in Luke selecting the "open read only" and "kee

Re: [ANN] Luke 0.9 released

2008-11-14 Thread mark harwood
Hi Andrzej, Thanks for the update. Looks like you've been busy adding some great new features! I think you may have a bug in opening an index with prior commit points, though. I want to keep these in my index and so I opened it in Luke selecting the "open read only" and "keep all commit points

Re: Storing part of the field

2008-11-14 Thread Ravi L
Thanks Anshum! This can be possible. But, I am searching for is to do this with only one field. thanks ravi On 14-Nov-08, at 1:32 PM, Anshum wrote: Hi Ravi, In that case, you could have 2 fields. One of them would be indexed (i.e. "foo bar") and you could use the other only to store as p

Scoped Search and Facets generation using Lucene

2008-11-14 Thread Bapat, Mayur
Hi, Does Lucene support Scoped Searches? My intention is to index an XML String and search for a matching element/attribute value from that XML by specifying scope(path). Also is there any direct support for Facets building in Lucene? Regards, Mayur -

Re: Storing part of the field

2008-11-14 Thread Anshum
Hi Ravi, In that case, you could have 2 fields. One of them would be indexed (i.e. "foo bar") and you could use the other only to store as per your logic. Hope this solves your purpose. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opin