Re: N-dimensional Point Indexing

2018-10-17 Thread Ken Krugler
Solr. Before I go much further, is there anything like this already done, or in the works? Thanks, — Ken > On Feb 26, 2018, at 4:24 PM, Luís Filipe Nassif wrote: > > Thank you, Adrian. > > Em 26 de fev de 2018 21:19, "Adrien Grand" escreveu: > >> Yes it

Best way to plug in alternative range query support

2016-05-19 Thread Ken Krugler
is there a better way to handle this? I’m particularly curious about splicing this into something like Solr. Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: issue with IndexUpgrader

2015-01-29 Thread Ken
Hi Uwe, This is what I expected. I have already begun moving down the path of filesystem magic even though it strikes me as an ugly hack :) Thanks, Ken - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For

issue with IndexUpgrader

2015-01-28 Thread Ken
Hi, I'm trying to run the 3.6.2 and 4.7.2 IndexUpgrader operations on a set of prior version Lucene indexes and I'm running into trouble with some corner case indexes. Some (unknown set) of these indexes are just placeholders, they have been created but no documents have been added to them ye

RE: in-memory terms dictionary/Lucene-3069

2015-01-26 Thread Ken Krugler
Hi Mike, Has anyone tried back-porting the FSTPosting(s)Format to 4.6? Also https://issues.apache.org/jira/browse/LUCENE-3069 has a fixed version of 4.7, but your comment below (and the code) make me think this isn't correct, as it doesn't seem to be in a released version yet. Thank

Re: suppressing FreqProxPostingsArray

2012-03-20 Thread Ken McCracken
Hi Mike, Thanks for the response. We will do some more investigation. We will look to see if there is a clean way to suppress at least the extra 3 array allocations. Cheers, -Ken On Mar 19, 2012, at 5:32 PM, Michael McCandless > wrote: Hmm, I agree we could be more RAM efficient

suppressing FreqProxPostingsArray

2012-03-19 Thread Ken McCracken
positions etc suppressed? It seems that the reason I get an OutOfMemoryError is that 7 int[] of size proportional to number of unique fields are being constructed; however, at least some of them are probably wasteful given my indexing configurations. Any help is appreciated. Thanks, -Ken

Re: Lucene Challenge - sum, count, avg, etc.

2010-03-31 Thread Ken Krugler
and then turn the query into a set of affiliate_id x date range queries. Something like: affiliate_id: and (day:59 or day:60 or day:61 or week:10 or week:11 or week:12 or day:86 or day:87...) -- Ken On Mar 31, 2010, at 6:17pm, Michel Nadeau wrote: Hi, We're currently in the proces

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Ken Krugler
h. We wound up using ANTLR for this. -- Ken On Aug 20, 2009, at 8:09am, Valery wrote: Hi Robert, thanks for the hint. Indeed, a natural way to go. Especially if one builds a Tokenizer of the level of quality like StandardTokenizer's. OTOH, you mean that the out-of-the-box stuff is

Re: Analyzing performance and memory consumption for boolean queries

2009-06-23 Thread Ken Krugler
the matched entries. 6. Having most of the index loaded into the OS cache was the biggest single performance win. So if you've got 3 GB of unused memory on a server, limiting the size of the index to some low multiple of 3GB would be a good target. -- Ken Our query performance is sur

Re: Synchronizing Lucene indexes across 2 application servers

2009-06-20 Thread Ken Krugler
e Katta has added an index to both systems, then you can switch to it (and eventually remove the old index). The fact that you'd need two Katta "masters" makes things a bit more interesting, as you'd have to coordinate when they both decide to switch to using the new index(es).

Re: Distributed Lucene Questions

2009-06-01 Thread Ken Krugler
buted search support inside of Nutch. And Solr has distributed search support, though it's still pretty new. -- Ken -- Ken Krugler +1 530-210-6378 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional c

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
On 3/2/09 4:23 PM, "Ken Williams" wrote: > On 3/2/09 1:58 PM, "Erik Hatcher" wrote: > >> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >>> In the output, I get explanations like "0.88922405 = (MATCH) product >>> of:" >&

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
On 3/2/09 1:58 PM, "Erik Hatcher" wrote: > > On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >> In the output, I get explanations like "0.88922405 = (MATCH) product >> of:" >> with no details. Perhaps I need to do something different in >>

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
On 3/2/09 4:19 PM, "Steven A Rowe" wrote: > On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote: >> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >>> Also, while perusing the threads you refer to below, I saw a >>> reference to the following link, which see

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
ot; with no details. Perhaps I need to do something different in indexing? Thanks, -Ken On 2/26/09 10:36 AM, "Grant Ingersoll" wrote: > I don't know of anyone doing work on it in the Lucene community. My > understanding to date is that it is not really worth trying, b

Re: Restricting the result set with hierarchical ACL

2009-03-02 Thread Ken Krugler
;t use the typical approach of having a doc field with every group in it, then adding a required subclause to your query with every group as a boolean OR term. -- Ken -- Ken Krugler +1 530-210-6378 - To unsubscribe, e-mail: java-user

Re: How to compute the simlarity of a web page?

2009-02-25 Thread Ken Krugler
use to generate the target term vector, etc. Something we didn't do, which seemed valuable, would be to use phrases vs. single terms, along the lines of Amazon's SIPs (statistically improbable phrases). -- Ken çð 2009-02-16àÍìI 22:08 -0500ÅCGrant Ingersollé ì¼ÅF Hmmm, you

Re: Confidence scores at search time

2009-02-25 Thread Ken Williams
Hi all, I didn't get a response to this - not sure whether the question was ill-posed, or too-frequently-asked, or just not interesting. But if anyone could take a stab at it or let me know a different place to look, I'd really appreciate it. Thanks, -Ken On 2/20/09 12:00 PM, &qu

Confidence scores at search time

2009-02-20 Thread Ken Williams
.html Thanks. -- Ken Williams Research Scientist The Thomson Reuters Corporation Eagan, MN - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Implement a relaxed PhraseQuery?

2008-03-23 Thread Ken Krugler
h on subject == "alternative scoring algorithm for PhraseQuery". I believe Paul Elschot gave him some useful input, but then Philipp seemed to have dropped off the list...and he didn't respond to my email asking him if he was able to co

Re: Indexing source code files

2008-02-28 Thread Ken Krugler
essentially synonym processing, where you turn a single term into multiple terms based on the automatic splitting of the term using '_', '-', camelCasing, letter/digit transitions, etc. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378

Re: alternative scoring algorithm for PhraseQuery

2007-10-17 Thread Ken Krugler
here helped you finish your FuzzyPhraseQuery (or FuzzySpanQuery) addition to Lucene. Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can't find it, you can't fix it" - To unsubscribe, e-ma

Re: Serving remote lucene client - RMI vs HTTP

2007-07-15 Thread Ken Krugler
] Nutch already supports distributed Lucene searchers, using Hadoop RPC. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can't find it, you can't fix it" - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: boosting different parts of the same field

2007-05-31 Thread Ken Krugler
in Solr (where you can easily specify this type of combo field) is to add the field I want to boost multiple times. It's very course granularity, but it works. See a discussion of this recently on the Solr mailing list. -- Ken wojtek On 5/31/07, Donna L Gresh <[EMAIL PROTECTED]> wr

Re: UTF8 accents & umlauts filter?

2006-09-12 Thread Ken Krugler
sort key. This is pretty complex, especially when you start considering locale-specific details - we used ICU support for this in the past, which is where I'd probably start. ICU needs a lot of data to handle this properly across most locales, so it's not lightweight, but it would gi

Plus factor in returned results

2006-08-02 Thread Ken Kinder
I'd like to start with a standard parsed query, then combine it with another that says requires a field's untokenized value be inside of a set. The catch is, I want the document's position in that set to be included in the scoring. So I want to search for "chinese restaurant", but only for these

Re: Where to find drill-down examples (source code)

2006-07-21 Thread Ken Krugler
/search/lucene/query/DateIntervalQuery.java -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Out-of-order distinct

2006-06-14 Thread Ken Kinder
I've poked around on google and the archives quite a bite, but I can't find exactly what I need. Say I have a query that would normally return a set of documents: 1 002 (text...) 2 001 (text...) 3 001 (text...) 4 002 (text...) 5 004 (text...) I'd like that modified to be: 1 002 (text...) 2 001

Re: Multisearcher Lucene IOException

2006-06-04 Thread Ken Krugler
I don't think it's a bad index. After seeing a few postings about this same general problem, I'm guessing there's a bug hiding someplace. Sorry to not have a better answer... -- Ken -- Ken Krugler Krugle, Inc. +1 53

Re: BufferedIndexInput.readByte performance

2006-05-26 Thread Ken Krugler
g required to pick the right cut-off value for searches. Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Checking for duplicates inside index

2006-05-22 Thread Ken Krugler
ill need a big sum though. MD5? Just as a reference, Nutch uses an MD5 digest to detect duplicate web pages. It works fine, except of course when two docs differ by only an insignificant text delta. There's some recent work in this area - check out TextProfileSignature. -- Ken -- Ken K

Re: How are results merged from a multisearcher?

2006-05-18 Thread Ken Krugler
On Donnerstag 18 Mai 2006 18:36, Ken Krugler wrote: > >Could someone describe how the results from multiple indices are merged > when using a MultiSearcher? My naive intuition is that the scores for > documents found in each index could be wildly different, so what > crit

Re: How are results merged from a multisearcher?

2006-05-18 Thread Ken Krugler
selection of indices that get merged to form the N final indices. This randomization helps avoid the IDF skew problem. There's an Jira issue on the Nutch side (see NUTCH-92) around this same problem. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Fi

Re: Scoring without floating point calculations

2006-04-28 Thread Ken Krugler
t scoring algorithm. You can always add the log of the score versus doing a multiplication, but that would still involve a lot of source code changes. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-16 Thread Ken Krugler
against it. And yes, with a bunch of servers that all have 4GB of RAM, I'd be interested in the patch :) Thanks for creating it. -- Ken Doug Cutting <[EMAIL PROTECTED]> wrote: RAMDirectory is indeed currently limited to 2GB. This would not be too hard to fix. Please file a

Re: Multiple terms with the same position in PhraseQuery

2005-11-06 Thread Ken Krugler
project files - and I don't put them into the Eclipse Workspace directory. b. Then launch Eclipse and create a new Java project, importing the files from the external (SVN-controlled) location. -- Ken -- Ken Krugler Krugle,

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Ken Krugler
andard Java serialization support. So I doubt this would be a slam-dunk in the Lucene community. -- Ken # #!/usr/bin/perl use strict; use warnings; # illegal_null.plx -- Perl complains about non-shortest-form null. my $data = "foo\xC0\x80\n

Re: i18n query normalization

2005-08-23 Thread Ken Krugler
are tokenizers already built for lucene. Search the archives for a discussion about this, back in June I believe. I'd suggested using ICU to generate sort keys, and indexing those. -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 5

Re: NGram Language Categorization Source

2005-08-20 Thread Ken Krugler
M product(s) to get it) so what you've done is great for the open source community - thanks! Also I could post to the Unicode list re training data in multiple languages, as that's a good place to find out about multilingual corpora. -- Ken -- Ken Krugler TransPac Software, Inc.

Re: Indexing puncutation

2005-06-29 Thread Ken Krugler
e for the conversion, but in general that shouldn't matter. Two other issues are code/data size (ICU can be big) and the performance hit while indexing documents. -- Ken Aigner, Thomas wrote: Hello all, I am VERY new to Lucene and we are trying out Lucene to see if it will acco

Re: Looking for someone to develop Thai Lucene Analyzer

2005-06-22 Thread Ken Krugler
in a Java implementation, so this shouldn't be all that hard. See <http://www-306.ibm.com/software/globalization/topics/thaiusabilities/text.jsp> -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com>

Re: Question for Wildcard Search:

2005-06-22 Thread Ken Krugler
aining the tokens in reversed character order. Won't help for *foo* though. You can also index ngrams - say 3-grams. Every word gets tokenized & indexed as a sequence of three letter sub-strings. E.g. "tokenized" would be indexed as "tok" "oke" "ken&quo