Re: SimpleFragmenter docs

2008-01-23 Thread Mike Klaas
Indeed--this is why the associated parameter is called maxAnalyzedChars in Solr. -Mike On 14-Jan-08, at 2:33 PM, Mark Miller wrote: I think your right, and thats not the only place...the whole handling of maxDocBytesToAnalyze in the main Highlighter class shares this issue. I guess the id

Re: Wikia search goes live today

2008-01-08 Thread Mike Klaas
On 7-Jan-08, at 11:49 PM, Lukas Vlcek wrote: This would be great! I am particularly interested how they are going about customized search (if they have a plan to do it). I mean if they can reorder raw search results based on some kind of collective knowledge (which is probably kept outsid

Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas
On 17-Dec-07, at 11:39 AM, Beyer,Nathan wrote: Would using Field.Index.UN_TOKENIZED be the same as tokenizing a field into one token? Indeed. -Mike -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Monday, December 17, 2007 12:53 PM To: java-user

Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas
ext (class name) - "org.apache.lucene.document.Document" Queries that would match - "org.apache", "org.apache.lucene.document" Queries that DO NOT match - "apache", "lucene", "document" -Nathan -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Mon

Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas
On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote: I have a few fields that use package names and class names and I've been looking for some suggestions for analyzing these fields. A few examples - Text (class name) - "org.apache.lucene.document.Document" Queries that would match - "org.apache" ,

Re: index and access to lines of a CSV file

2007-12-13 Thread Mike Klaas
On 13-Dec-07, at 3:26 PM, Tobias Rothe wrote: I got a quick question. I am handling hughe CSV files. They start with a key in the first column and are followed by data. I need to retrieve randomly this data based on the key. So it is kind of a search where I give a unique key and ideally ac

Re: Custom query parser

2007-11-22 Thread Mike Klaas
On 22-Nov-07, at 8:49 AM, Nicolas Lalevée wrote: Le jeudi 22 novembre 2007, Matthijs Bierman a écrit : Hi Nicolas, Why can't you extend the QueryParser and override the methods you want to modify? Because the query parser I would like to have is a very basic user one, ala google. The s

Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Mike Klaas
On 6-Nov-07, at 3:02 PM, Paul Elschot wrote: On Tuesday 06 November 2007 23:14:01 Mike Klaas wrote: Wait--shouldn't the outer-most BooleanQuery provide most of this speedup already (since it should be skipTo'ing between the nested BooleanQueries and the outermost). Is it the indir

Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Mike Klaas
On 29-Oct-07, at 9:43 AM, Paul Elschot wrote: On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote: +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e is much faster than (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e) where the second one is a result from BooleanQuery in BooleanQuery

Re: Generalized proximity query performance

2007-10-05 Thread Mike Klaas
On 5-Oct-07, at 11:27 AM, Chris Hostetter wrote: that's what i thought first too, and it is a problem i'd eventaully like to tackle ... it was the part about "c" being in a differnet field from "a" and "b" that confused me ... i don't know what that exactly is being suggested here. I'm

Re: Generalized proximity query performance

2007-10-05 Thread Mike Klaas
On 5-Oct-07, at 10:54 AM, Chris Hostetter wrote: : I am using a hand rolled query of the following form (implemented with : SpanNearQuery, not a sloppy PhraseQuery): : a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5 : : The obvious solution, "a b c"~5, is not applicable for my issues, becaus

Re: BoostingTermQuery performance

2007-10-02 Thread Mike Klaas
On 2-Oct-07, at 3:44 PM, Peter Keegan wrote: I have been experimenting with payloads and BoostingTermQuery, which I think are excellent additions to Lucene core. Currently, BoostingTermQuery extends SpanQuery. I would suggest changing this class to extend TermQuery and refactor the current v

Re: Tokenization question

2007-09-13 Thread Mike Klaas
On 13-Sep-07, at 12:37 PM, Dan Luria wrote: What I do is Doc1 = source_doc Doc2 = new Document() foreach (field f in doc1.getfields) { Doc2.Add(new Field(doc1.getField(key), doc1.getField(value)); } but when i pull the fields from Doc1, i never get the tokenized field.. it just doesnt appea

Re: Storing Host and IP Information in Lucene

2007-09-11 Thread Mike Klaas
On 10-Sep-07, at 8:37 PM, AnkitSinghal wrote: But i think the query like host:example* will not work in this case Actually it was typo in my question. I want to search for above type of query only. Hosts are best stored in reverse domain format: xyz.example.com -> com.example.xyz Then yo

Re: Indexing Speed using Java Lucene 2.0 and Lucene.NET 2.0

2007-09-10 Thread Mike Klaas
On 10-Sep-07, at 5:59 AM, Laxmilal Menaria wrote: Hello Everyone, I have created a Index Application using Java lucene 2.0 in java and Lucene.Net 2.0 in VB.net. Both application have same logic. But when I have indexed a database with 14000 rows from both application and same machine, I sur

Re: Extract terms not by reader, but by documents

2007-09-06 Thread Mike Klaas
On 6-Sep-07, at 11:48 AM, Grant Ingersoll wrote: On Sep 6, 2007, at 1:32 PM, Rafael Rossini wrote: Karl, I´m aware of IndexReader.getTermFreqVector, with this I can get all terms of a document, but I want all terms of a document that matched a query. Grant, Yes, I think I understand.

Re: Search performance question

2007-09-06 Thread Mike Klaas
On 6-Sep-07, at 4:41 AM, makkhar wrote: Hi, I have an index which contains more than 20K documents. Each document has the following structure : field : ID (Index and store) typical value - "1000" field : parameterName(index and store) typical value

Re: Highlighter that works with phrase and span queries

2007-08-29 Thread Mike Klaas
some axillary helper classes) for the old contrib Highlighter. Since the contrib Highlighter is pretty hardened at this point, I figured that was the best way to go. Or do you mean something different? - Mark Mike Klaas wrote: Mark, I'm still interested in integrating this into Solr-

Re: Lucille, a (new) Python port of Lucene

2007-08-28 Thread Mike Klaas
Not to mention Lupy. Hasn't it been relatively well-established that trying to create a performant search engine in a dynamic interpreted language is a show- stopper? After several failed ports of lucene (I can add to this my own, unreleased, attempt) I just don't see the point, except as a

Re: Indexing time linear?

2007-08-28 Thread Mike Klaas
On 23-Aug-07, at 2:48 AM, Barry Forrest wrote: Hi list, I'm trying to estimate how long it will take to index 10 million documents. If I measure how long it takes to index say 10,000 documents, can I extrapolate? Will it take roughly 1000 times longer to do the whole set? Segment mergin

Re: Highlighter that works with phrase and span queries

2007-08-27 Thread Mike Klaas
Mark, I'm still interested in integrating this into Solr--this is a feature that has been requested a few times. It would be easier to do so if it were a contrib/... thanks for the great work, -Mike On 27-Aug-07, at 4:21 AM, Mark Miller wrote: I am a bit unclear about your question. The

Re: Indexing

2007-08-24 Thread Mike Klaas
Note that Solr is expressedly designed for this kind of thing: every time you commit, a new searcher is opened in the background, warmed, and the swapped with the current one. It also support autocommit after X updates, or after the oldest update passes X milliseconds without being commit

Re: speedup indexing

2007-08-06 Thread Mike Klaas
On 6-Aug-07, at 5:49 PM, Chris Lu wrote: Seems this issue,LUCENE-834, is about query payload https://issues.apache.org/jira/browse/LUCENE-834 Can it help on indexing speed? That should be: https://issues.apache.org/jira/browse/LUCENE-843 On 8/6/07, testn <[EMAIL PROTECTED]> wrote: 2.

Re: Getting only the Ids, not the whole documents.

2007-08-03 Thread Mike Klaas
You still have a disk seek per doc if the index can't fit in memory (usually more costly than reading the fields) . Why not use FieldCache? -Mike On 2-Aug-07, at 5:41 PM, Mark Miller wrote: If you are just retrieving your custom id and you have more stored fields (and they are not tiny) yo

Re: Performance improvements using writer.delete vs reader.delete

2007-08-03 Thread Mike Klaas
On 3-Aug-07, at 3:27 AM, Mark Miller wrote: Also, IndexWriter probably buffers better than you would. If you buffer a delete with IndexWriter and then add a document that would be removed by that delete right after, when the buffered deletes are flushed, your latest doc will not be removed

Re: More IP/MAC indexing questions

2007-08-01 Thread Mike Klaas
On 1-Aug-07, at 11:34 AM, Joe Attardi wrote: On 8/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Use a SpanNearQuery with a slop of 0 and specify true for ordering. What that will do is require that the segments you specify must appear in order with no gaps. You have to construct this your

Re: Lucene Field score value

2007-07-31 Thread Mike Klaas
You can boost any clause of a query: http://lucene.apache.org/java/docs/queryparsersyntax.html title:foo^5 header:foo^2 body:foo On 31-Jul-07, at 1:00 PM, Askar Zaidi wrote: I'll have to use StringBuffer and get the Explanation in it as a String. Then parse StringBuffer to get the scores of

Re: Delete corrupted doc

2007-07-26 Thread Mike Klaas
On 26-Jul-07, at 10:18 AM, Rafael Rossini wrote: Yes, I optimized, but in the with SOLR. I don´t know why, but when optimize an index with SOLR, it leaves you with about 15 files, instead of the 3... You are probably not using the compound file format. Try setting: true in solrconfig

Re: Index partitioning by term

2007-07-04 Thread Mike Klaas
On 4-Jul-07, at 5:31 AM, Ndapa Nakashole wrote: I am considering using Lucene in my mini Grid-based search engine. I would like to partition my index by term as opposed to partition by document. From what i have read in the mailing list so far, it seems like partition by term is impossible

Re: product based term combination for BooleanQuery?

2007-07-04 Thread Mike Klaas
On 3-Jul-07, at 4:43 PM, Tim Sturge wrote: Here's the explain output I currently get for "George Bush" "George W Bush", "John Kerry" "John Denver" and "John Bush". (there are others in between, but they follow very much the same pattern; an enormous score for one of "John" or "Bush" and a v

Re: product based term combination for BooleanQuery?

2007-07-03 Thread Mike Klaas
Try out: http://issues.apache.org/jira/browse/LUCENE-850 If this is useful to you, be sure to add a comment to the issue. -Mike On 3-Jul-07, at 10:51 AM, Tim Sturge wrote: I'm following myself up here to ask if anyone has experience or code with a BooleanQuery that weights the terms it encou

Re: Highlighter that works with phrase and span queries

2007-06-20 Thread Mike Klaas
On 19-Jun-07, at 3:39 PM, Mark Miller wrote: I have been working on extending the Highlighter with a new Scorer that correctly scores phrase and span queries. The highlighter is working great for me, but could really use some more banging on. If you have a need or an interest in a more accu

Re: documents with large numbers of fields

2007-05-18 Thread Mike Klaas
On 18-May-07, at 1:01 PM, charlie w wrote: So now I have the idea to invert the field name and value thusly: foo=tag ^2 bar=tag ^1.2 foobar=tag^1.8 and search "foo:tag". Intuitively, I would expect Lucene to be optimized for searching the values of fields, and not really the names

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Mike Klaas
On 17-May-07, at 6:43 AM, Andreas Guther wrote: I am actually using the FieldSelector and unless I did something wrong it did not provide me any load performance improvements which was surprising to me and disappointing at the same time. The only difference I could see was when I returned

Re: Re: How to index a lot of fields (without FileNotFoundException: Too many open files)

2007-04-30 Thread Mike Klaas
On 4/30/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Thanks for you reply. We are still using Lucene v1.4.3 and I'm not sure if upgrading is an option. Is there another way of disabling length normalization/document boosts to get rid of those files? Why not raise the limit of open files

Re: adding a field at index-time

2007-04-19 Thread Mike Klaas
On 4/18/07, William Mee <[EMAIL PROTECTED]> wrote: I'd like to add metadata which I get *after* indexing a document's contents to the index. To be more specific: I'm implementing shingling (detection of near-duplicate documents) and want to add the document fingerprint (which is based on the s

Re: Unicode Normalization

2007-04-11 Thread Mike Klaas
On 4/11/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote: > Unicode characters do not map > precisely to code points: a single character can often be represented > via a single codepoint or a combination of two (surrogate pa

Re: Unicode Normalization

2007-04-11 Thread Mike Klaas
On 4/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I have encountered a problem searching in my application because of : inconsistant unicode normalization forms in the corpus (and the : queries). I would like to normalize to form NFKD in an analyzer (I : think). I was thinking about creat

Re: Standard Parser Behavior

2007-04-10 Thread Mike Klaas
On 4/10/07, Walt Stoneburner <[EMAIL PROTECTED]> wrote: Furthermore syntax like +(-A +B) and -(-A +B) appear to be legal to Luke, though I have no clue what this even means in simple English. Let me try: +(-A +B) -> must match (-A +B) -> must contain B and must not contain A -(-A +B) -> must

Re: search-time boosting

2007-04-02 Thread Mike Klaas
On 4/2/07, Ofer Nave <[EMAIL PROTECTED]> wrote: I'd like to be able to boost documents at search-time, and I'm not sure how to do it. Example: I'm building a search engine for products (comparison shopping). Many queries tend to indicate a category (i.e., 'digital cameras') as opposed to a pro

Re: index file size threshold affecting search performance?

2007-03-28 Thread Mike Klaas
On 3/28/07, Scott Oshima <[EMAIL PROTECTED]> wrote: So I assumed a linear decay of performance as an index got bigger. For some reason when going from an index size of 1.89 to 1.95 gigs dramatically increased cpu across all of our servers. I was thinking of splitting the 1.95 index into 2 separ

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Mike Klaas
On 3/8/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: if the issue is thta you want to be abel to ship an index that people can manipulate as much as they want and you want to garuntee they can never reconstruct the original docs you're pretty much screwed ... even if you eliminate all of the po

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Mike Klaas
On 3/8/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : If you store a hash code of the word rather then the actual word you : should be able to search for stuff but not be able to actually retrieve that's a really great solution ... it could even be implemented asa TokenFilter so none of your c

Re: [Fwd: Re: indexing performance]

2007-03-01 Thread Mike Klaas
On 3/1/07, Saravana <[EMAIL PROTECTED]> wrote: Is this still hold good now ? Thanks for your reply. Probably most of that still applies to some extent. However, it is unclear whether it will speed up your application. First thing is to find out what your bottleneck is. Looking at the stats

Re: NO_NORMS and TOKENIZED?

2007-02-16 Thread Mike Klaas
On 2/16/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote: Solved... with fixed field definitions. <> Imagine a world with no search-time/index-time analyzer mismatches... I'm sure Yonik can imagine such a world... that's what Solr provide . Configure an analyzer for a field (or even separate

Re: 'a', 's' and 't' don't index properly

2007-02-08 Thread Mike Klaas
On 2/8/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote: Is there a .NET version of Solr? Nope. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Re: for admins: mailing list like spam

2006-11-03 Thread Mike Klaas
On 11/3/06, Patrick Turcotte <[EMAIL PROTECTED]> wrote: > > It will make mails list more easy to read (I am using gmail and I do > not have client-side filters). That is not true. You can have labels, and, if you look at the top of the page, right beside the "Search the Web" button, you have

Re: Putting some constraints on index optimization

2006-10-27 Thread Mike Klaas
On 10/27/06, Stanislav Jordanov <[EMAIL PROTECTED]> wrote: Have the following problem with (explicitly invoked) index optimization - it seems to always merge all existing index segments into a single huge segment, which is undesirable in my case. Is there a way to force index optimization to hono

Re: "Catalog" backend for document stored fields?

2006-10-20 Thread Mike Klaas
On 10/20/06, Robichaud, Jean-Philippe <[EMAIL PROTECTED]> wrote: 3- Any ideas on how else I could do this? I'm fully open to discussion! How about not storing the fields at all, but storing term vectors, and reconstructing the data from termpositions + terminfo? -Mike -

Re: Looking for a stemmer that can return all inflected forms

2006-10-14 Thread Mike Klaas
On 10/14/06, Jong Kim <[EMAIL PROTECTED]> wrote: Hi, I'm looking for a stemmer that is capable of returning all morphological variants of a query term (to be used for high-recall search). For example, given a query term of 'cares', I would like to be able to generate 'cares', 'care', 'cared', a

Re: Re[2]: strange behavior 4 query term boost

2006-09-28 Thread Mike Klaas
On 9/27/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Found the reason, it is a bug IMHO. The example should be: A: term1^5 term2^6 term3^7 B: term1^5E-4 term2^6E-4 term3^7E-4 C: term1^0.0006 term2^0.0006 term3^0.0007 A & C suppose return the same rank B is different Since B will be parsed