Solr.
Before I go much further, is there anything like this already done, or in the
works?
Thanks,
— Ken
> On Feb 26, 2018, at 4:24 PM, Luís Filipe Nassif wrote:
>
> Thank you, Adrian.
>
> Em 26 de fev de 2018 21:19, "Adrien Grand" escreveu:
>
>> Yes it
is there a better way to handle this? I’m particularly curious about
splicing this into something like Solr.
Thanks,
— Ken
--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
Hi Uwe,
This is what I expected. I have already begun moving down the path of
filesystem magic even though it strikes me as an ugly hack :)
Thanks,
Ken
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For
Hi,
I'm trying to run the 3.6.2 and 4.7.2 IndexUpgrader operations on a set of
prior version Lucene indexes and I'm running into trouble with some corner
case indexes.
Some (unknown set) of these indexes are just placeholders, they have been
created but no documents have been added to them ye
Hi Mike,
Has anyone tried back-porting the FSTPosting(s)Format to 4.6?
Also https://issues.apache.org/jira/browse/LUCENE-3069 has a fixed version of
4.7, but your comment below (and the code) make me think this isn't correct, as
it doesn't seem to be in a released version yet.
Thank
Hi Mike,
Thanks for the response. We will do some more investigation. We will
look to see if there is a clean way to suppress at least the extra 3
array allocations.
Cheers,
-Ken
On Mar 19, 2012, at 5:32 PM, Michael McCandless > wrote:
Hmm, I agree we could be more RAM efficient
positions etc suppressed? It
seems that the reason I get an OutOfMemoryError is that 7 int[] of size
proportional to number of unique fields are being constructed; however, at
least some of them are probably wasteful given my indexing configurations.
Any help is appreciated.
Thanks,
-Ken
and then turn the query into a set of affiliate_id x
date range queries. Something like:
affiliate_id: and (day:59 or day:60 or day:61 or week:10 or
week:11 or week:12 or day:86 or day:87...)
-- Ken
On Mar 31, 2010, at 6:17pm, Michel Nadeau wrote:
Hi,
We're currently in the proces
h. We
wound up using ANTLR for this.
-- Ken
On Aug 20, 2009, at 8:09am, Valery wrote:
Hi Robert,
thanks for the hint.
Indeed, a natural way to go. Especially if one builds a Tokenizer of
the
level of quality like StandardTokenizer's.
OTOH, you mean that the out-of-the-box stuff is
the matched entries.
6. Having most of the index loaded into the OS cache was the biggest
single performance win. So if you've got 3 GB of unused memory on a
server, limiting the size of the index to some low multiple of 3GB
would be a good target.
-- Ken
Our query performance is sur
e Katta has added an index to both systems, then
you can switch to it (and eventually remove the old index).
The fact that you'd need two Katta "masters" makes things a bit more
interesting, as you'd have to coordinate when they both decide to
switch to using the new index(es).
buted search support inside of Nutch.
And Solr has distributed search support, though it's still pretty new.
-- Ken
--
Ken Krugler
+1 530-210-6378
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional c
On 3/2/09 4:23 PM, "Ken Williams" wrote:
> On 3/2/09 1:58 PM, "Erik Hatcher" wrote:
>
>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>> In the output, I get explanations like "0.88922405 = (MATCH) product
>>> of:"
>&
On 3/2/09 1:58 PM, "Erik Hatcher" wrote:
>
> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>> In the output, I get explanations like "0.88922405 = (MATCH) product
>> of:"
>> with no details. Perhaps I need to do something different in
>>
On 3/2/09 4:19 PM, "Steven A Rowe" wrote:
> On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote:
>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>> Also, while perusing the threads you refer to below, I saw a
>>> reference to the following link, which see
ot;
with no details. Perhaps I need to do something different in indexing?
Thanks,
-Ken
On 2/26/09 10:36 AM, "Grant Ingersoll" wrote:
> I don't know of anyone doing work on it in the Lucene community. My
> understanding to date is that it is not really worth trying, b
;t use the typical approach of having a doc
field with every group in it, then adding a required subclause to
your query with every group as a boolean OR term.
-- Ken
--
Ken Krugler
+1 530-210-6378
-
To unsubscribe, e-mail: java-user
use to generate the target term vector, etc.
Something we didn't do, which seemed valuable,
would be to use phrases vs. single terms, along
the lines of Amazon's SIPs (statistically
improbable phrases).
-- Ken
çð 2009-02-16àÍìI 22:08 -0500ÅCGrant Ingersollé ì¼ÅF
Hmmm, you
Hi all,
I didn't get a response to this - not sure whether the question was
ill-posed, or too-frequently-asked, or just not interesting. But if anyone
could take a stab at it or let me know a different place to look, I'd really
appreciate it.
Thanks,
-Ken
On 2/20/09 12:00 PM, &qu
.html
Thanks.
--
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
h on subject ==
"alternative scoring algorithm for PhraseQuery".
I believe Paul Elschot gave him some useful input, but then Philipp
seemed to have dropped off the list...and he didn't respond to my
email asking him if he was able to co
essentially synonym
processing, where you turn a single term into multiple terms based on
the automatic splitting of the term using '_', '-', camelCasing,
letter/digit transitions, etc.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
here helped you finish your
FuzzyPhraseQuery (or FuzzySpanQuery) addition to Lucene.
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
-
To unsubscribe, e-ma
]
Nutch already supports distributed Lucene searchers, using Hadoop RPC.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For
in Solr (where you can easily specify this type of
combo field) is to add the field I want to boost multiple times. It's
very course granularity, but it works.
See a discussion of this recently on the Solr mailing list.
-- Ken
wojtek
On 5/31/07, Donna L Gresh <[EMAIL PROTECTED]> wr
sort key.
This is pretty complex, especially when you start considering
locale-specific details - we used ICU support for this in the past,
which is where I'd probably start. ICU needs a lot of data to handle
this properly across most locales, so it's not lightweight, but it
would gi
I'd like to start with a standard parsed query, then combine it with
another that says requires a field's untokenized value be inside of a
set. The catch is, I want the document's position in that set to be
included in the scoring.
So I want to search for "chinese restaurant", but only for these
/search/lucene/query/DateIntervalQuery.java
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I've poked around on google and the archives quite a bite, but I can't
find exactly what I need. Say I have a query that would normally
return a set of documents:
1 002 (text...)
2 001 (text...)
3 001 (text...)
4 002 (text...)
5 004 (text...)
I'd like that modified to be:
1 002 (text...)
2 001
I don't
think it's a bad index. After seeing a few postings about this same
general problem, I'm guessing there's a bug hiding someplace.
Sorry to not have a better answer...
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 53
g required
to pick the right cut-off value for searches.
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
ill need a big sum though. MD5?
Just as a reference, Nutch uses an MD5 digest to detect duplicate web
pages. It works fine, except of course when two docs differ by only
an insignificant text delta. There's some recent work in this area -
check out TextProfileSignature.
-- Ken
--
Ken K
On Donnerstag 18 Mai 2006 18:36, Ken Krugler wrote:
> >Could someone describe how the results from multiple indices are merged
> when using a MultiSearcher? My naive intuition is that the scores for
> documents found in each index could be wildly different, so what
> crit
selection of indices that get merged to
form the N final indices. This randomization helps avoid the IDF skew
problem.
There's an Jira issue on the Nutch side (see NUTCH-92) around this
same problem.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Fi
t scoring algorithm.
You can always add the log of the score versus doing a
multiplication, but that would still involve a lot of source code
changes.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
against it.
And yes, with a bunch of servers that all have 4GB of RAM, I'd be
interested in the patch :)
Thanks for creating it.
-- Ken
Doug Cutting <[EMAIL PROTECTED]> wrote:
RAMDirectory is indeed currently limited to 2GB. This would not be too
hard to fix. Please file a
project
files - and I don't put them into the Eclipse Workspace directory.
b. Then launch Eclipse and create a new Java project, importing the
files from the external (SVN-controlled) location.
-- Ken
--
Ken Krugler
Krugle,
andard Java serialization support. So I doubt
this would be a slam-dunk in the Lucene community.
-- Ken
#
#!/usr/bin/perl
use strict;
use warnings;
# illegal_null.plx -- Perl complains about non-shortest-form null.
my $data = "foo\xC0\x80\n
are tokenizers already built for lucene.
Search the archives for a discussion about this,
back in June I believe. I'd suggested using ICU
to generate sort keys, and indexing those.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 5
M product(s) to get it) so what you've done is great
for the open source community - thanks!
Also I could post to the Unicode list re training data in multiple
languages, as that's a good place to find out about multilingual
corpora.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
e
for the conversion, but in general that shouldn't matter.
Two other issues are code/data size (ICU can be big) and the
performance hit while indexing documents.
-- Ken
Aigner, Thomas wrote:
Hello all,
I am VERY new to Lucene and we are trying out Lucene to see if
it will acco
in
a Java implementation, so this shouldn't be all that hard. See
<http://www-306.ibm.com/software/globalization/topics/thaiusabilities/text.jsp>
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
aining the
tokens in reversed character order.
Won't help for *foo* though.
You can also index ngrams - say 3-grams. Every word gets tokenized &
indexed as a sequence of three letter sub-strings. E.g. "tokenized"
would be indexed as "tok" "oke" "ken&quo
43 matches
Mail list logo