On 8/12/06, Mark Miller <[EMAIL PROTECTED]> wrote:
The single server is important because I think it will take a lot of
work to scale it to multiple servers. The index must allow for close to
real-time updates and additions. It must also remain searchable at all
times (other than than during the
Why is a single server so important? I can scale horizontally much cheaper
than I scale vertically.
On 8/11/06, Mark Miller <[EMAIL PROTECTED]> wrote:
I've made a nice little archive application with lucene. I made it to
handle our largest need: 2.5 million docs or so on a single server. Now
Hi Mark -
Having gone down this path for the past year, I echo comments from others
that scalability/availability/failover is a lot of work. We migrated away
from a custom system based on Lucene running on Windows to Solr running on
Linux. It took us 6 months to get our system to a solid five-n
The Keyword analyzer does no stemming or input modification of any sort:
think of it as WYSIWYG for index population. The Whitespace analyzer simply
removes spaces from your input (still no stemming), but the tokens are the
individual words. I don't have the code in front of me, so I'm not sure
Marc -
We built our index maintenance operation to assume a breakdown would occur
in process (because it happened several times.) We exist in an environment
where "always on, always available" is a business requirement. We also do a
lot of updates on a cyclical basis (every 10 minutes), so malf
y can
sometimes cause problems when both types of queries need to execute
simultaneously.
-- j
On 4/15/06, Paul Elschot <[EMAIL PROTECTED]> wrote:
>
> On Saturday 15 April 2006 18:20, Jeff Rodenburg wrote:
> > What was the thinking behind making the BooleanQuery maxClauseCount a
> &
What was the thinking behind making the BooleanQuery maxClauseCount a
static? Or, I guess more to the point, why not an instance setting as well?
Not trying to point out a flaw, just curious about the original thinking
behind the setting. I have a situation where I have a set of BooleanQueries
t
I run Lucene.Net as well, and your indexing performance is dependent on more
factors aside from whether you're using the Java or C# version. As a basic
suggestion, learn what you can about minMergeDocs and mergeFactor as well as
the compound file format. Try different combinations to understand w
Does anyone have a lead on "business" stop words? Things like "inc", "llc",
"md", etc.
I'd rather not reinvent this wheel. :-)
cheers,
jeff
I'm working on a utility program to help me verify/validate my Lucene
indexes. By that, I mean checking for data conformance, minimum fields,
etc. Using XML Schema Definition (http://www.w3.org/XML/Schema), my goal is
to ensure that indexes that I create are compliant for requisite fields,
data t
We've done this, and it's not that complex. (Sorry, client won't allow me
to release the code.)
It's AJAX on the front end, so that background call is simply executing a
search against an index that consists of the aggregated search terms. We do
wildcard queries to get the results we want. For u
Raul -
You'll want to look at the MultiSearcher and ParallelMultiSearcher classes
for this.
On 3/3/06, Raul Raja Martinez <[EMAIL PROTECTED]> wrote:
>
> Is it possible to search many indexes in one query and get back the Hits
> ordered by relevance?
>
> Can someone point me out to some document o
Very good note, I missed that. I need the development environment in front
of me to remember all the different class names correctly. ;-)
-- j
On 3/1/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Jeff Rodenburg wrote:
> > Following on the Range Query approach, how is per
Thanks to everyone on the replies. I'm going to try several of these
approaches and with equivalent data sets and run some side-by-side tests.
No timeframes guarantees here, but I'll report back with the different
approaches and the test results.
cheers,
-- j
On 2/28/06, Chris Hostetter <[EMAI
Very good points, I hadn't considered the term frequency of the digits
affecting scoring. As an aside, can that aspect of the score be ignored for
these fields?
I need to spend more time with FunctionQuery, I haven't given it the
attention it deserves.
Great feedback, thanks for the notes.
-- j
ch the user is searching.
>
> On our data set, we can still end up with 1000s of matching documents
> after boxing. Thus, we still see a bottleneck computing the score for
> even this smaller set of documents which we are still working through.
>
> -Mike
>
>
> -Original
ching.
>
> On our data set, we can still end up with 1000s of matching documents
> after boxing. Thus, we still see a bottleneck computing the score for
> even this smaller set of documents which we are still working through.
>
> -Mike
>
>
> -Original Message-
>
I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search. This i
You can generate a token stream for a block of text without having to index
it. Take a look at the highlighter code, it does this very thing.
On 2/5/06, Jeff Thorne <[EMAIL PROTECTED]> wrote:
>
> I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our
Vikas -
Start with the RemoteSearchable class. Technology will be RMI.
Hope this helps.
On 2/2/06, Vikas Khengare <[EMAIL PROTECTED]> wrote:
>
> Hi Friends
>
> How do I send one search query to multiple search Indexes which are
> on remote machines ?
>
> Which Technology will help me (A
Have you considered evaluating doc-score thresholds for limiting your
results? Since the perfect answers to these situations lie in the constant
tweaking and twiddling of analysis and tokenization, one way I've found to
help is to evaluate result scores. In your "Ontario CA" example, limiting
res
One way to do this (depending on your system and index size) is to remove
and add every url you find. This would ensure that every document in the
index is unique. No need to worry about sorting and iteration and doc_ids
and the like.
It rebuilds your entire index, but if you have a duplication
I'm very interested in incorporating smart geographic querying capabilities
(distance calcs are just scratching the surface) into Lucene and came across
this whitepaper:
http://www.clef-campaign.org/2005/working_notes/workingnotes2005/leidner05.pdf
Just curious, has anyone ventured down this path
Well done, Grant. Very informative.
Question on Term Vectors: with their inclusion in an index, have you noticed
any degradation in performance, either from a search effiiciency or
maintenance point-of-view? Given the power of term vectors, if the perf
impact is negligible, I'm curious to the re
Check out Chris Hostetter's methodology for doing this at cnet.
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200508.mbox/[EMAIL
PROTECTED]
This sounds like it matches your requirements.
cheers,
j
On 12/7/05, Ching-Pei Hsing <[EMAIL PROTECTED]> wrote:
>
> Has anyway solved the foll
thanks Erik
On 12/3/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On Dec 3, 2005, at 1:26 PM, Jeff Rodenburg wrote:
>
> > In one of the Google Labs whitepapers (
> > http://labs.google.com/papers/mapreduce-osdi04.pdf), a programming
> > construct
> >
In one of the Google Labs whitepapers (
http://labs.google.com/papers/mapreduce-osdi04.pdf), a programming construct
known as MapReduce is used in a variety of jobs/tasks within Google's
operation. As an example of the application of MapReduce, the whitepaper
refers to Distributed Sorting.
Essent
George -
There are a number of SQL Server specific ways you can do this. Email me
off-list as the solution is not relevant to Lucene.
-- j
On 12/2/05, George Abraham <[EMAIL PROTECTED]> wrote:
>
> All,
> I have created a Lucene index from data in a SQL Server db. When I conduct
> a
> Lucene sea
On 11/30/05, Daniel Pfeifer <[EMAIL PROTECTED]> wrote:
>
>
> 1.) Does Lucenes MultiSearcher implement some kind of automatic failover
> and/or load-balancing mechanism if both Searchables which I supply in
> MultiSearchers constructor go to two different servers but to the very same
> index-files?
(especially for numeric fields).
>
> If you haven't already, you should compare the query times of a
> "warmed" searcher. Sorted queries will still take longer, but I
> haven't measured how much longer.
>
> -Yonik
> Now hiring -- http://forms.cnet.com/slink?
I've read many comments from users on the list indicating that sorting
may/will be performance-heavy. Is high CPU utilization with a sorted search
one of the expected performance hits?
In tests for our implementation (25 concurrent connections generating
search/sort requests), we've seen performan
Hi John -
It sounds like you're thinking of your index in terms of sql constructs --
multiple rows for the same record. We do this very same thing with
categories; if you have a record that lives in multiple categories, just add
additional category field/value pairs for your original record. It's
Kevin -
Maybe I'm misunderstanding, but how is this not a BooleanQuery with two
clauses?
- j
On 10/26/05, Kevin L. Cobb <[EMAIL PROTECTED]> wrote:
>
> I've been using Lucene happily for a couple of years now. But, this new
> search functionality I'm trying to add is somewhat different that what
thanks Erik
On 10/26/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On 26 Oct 2005, at 02:50, Jeff Rodenburg wrote:
> > I'm considering building out an index that will flatten a data
> > structure,
> > such that some Document "A" will have
I'm considering building out an index that will flatten a data structure,
such that some Document "A" will have Fields 1,2 and 3.
Fields 1 and 2 are indexed/tokenized field. Field 3 is indexed, and will
contain many discrete values (up to possibly 5000).
Couple of questions:
1. Does the DEFAULT_MA
I don't mean to take the thread off-topic, but is this the recommended
approach for any of the Query objects, i.e. SpanQuery or PhraseQuery?
On 10/25/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On 25 Oct 2005, at 07:00, Rob Young wrote:
> > I am using TermQuery s (and FuzzyQuery s) on the s
Sounds like you might have to consider both, if the first one doesn't solve
your issue. A company field sounds like it's a single entry, i.e. one that
can't be "spammed up" with multiple terms, i.e. "Oralce Oracle Oracle". It
also sounds as if you're searching multiple fields, and that some fields
s of the
> query to 0.
>
> So, (MyQuery, sorted by MyFunkySort), becomes
> ((+MyQuery^0 MyFunctionQuery), sorted by score)
>
> -Yonik
> Now hiring -- http://forms.cnet.com/slink?231706
>
> On 10/22/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
> >
> > This
type of score you are trying to do, but maybe
> FunctionQuery would help.
> http://issues.apache.org/jira/browse/LUCENE-446
>
> -Yonik
> Now hiring -- http://forms.cnet.com/slink?231706
>
> On 10/22/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
> >
> > I have
I have a custom sort that completes calculations on-the-fly, similar to the
LIA distance sort. SortField type is Float. It works, but I need better
performance. I'm wondering if there's a better way to do this.
As a rule, the number of results returned in a given search will most often
be a fracti
I'll take the no-response as a "no". :-)
On 10/11/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
>
> Anyone running RemoteSearchable? I'm on v1.4.3 and am using it just fine,
> until I need to:
>
> 1) use a custom sort, or
> 2) use something that ext
Anyone running RemoteSearchable? I'm on v1.4.3 and am using it just fine,
until I need to:
1) use a custom sort, or
2) use something that extends HitCollector
I've got an idea as to the reasons why (serialization and remoteness), but
how do I get around these? Anyone run into issues like these an
Doug Cutting once said, back in 2003:
" The *HitCollector*-based search API is not meant to work remotely. To do
so would involve an RPC-callback for every non-zero score, which would be
extremely expensive. Also, just making *HitCollector* serializable would not
be sufficient. You'd also need to
In following the LIA custom sort example, the calculated sort value is based
on a field that contains all necessary values, i.e. "x,y" which is split
into two values for use in a distance algorithm.
Suppose I want a custom sort basis that performs a similar calculation, but
is based on a multiple
, not all objects related to Terms are
> Serializable. IMHO, it would be NICE to have a RemoteReader and a
> ParallelMultiReader to round out the API like:
>
> ParallelMultiReader, RemoteReader, MultiReader, Reader
>
> AND
>
> ParallelMultiSearcher,RemoteSearcher, MultiSearch
lelMultiReader to round out the API like:
>
> ParallelMultiReader, RemoteReader, MultiReader, Reader
>
> AND
>
> ParallelMultiSearcher,RemoteSearcher, MultiSearcher, Searcher
>
> Regards,
> Rus
>
>
>
>
> On 10/5/05, Jeff Rodenburg <[EMAIL PROTECTED]> w
Are there known limitations or issues with sorting and RemoteSearchable? I'm
encountering problems attempting to sort through a MultiSearcher
(ParallelMultiSearcher, actually). I'm using an array of RemoteSearchable
objects as the Searchable[] source. If I change the source indexes to be
local Inde
I'm looking for some suggestions on an analyzer decision. I've got my own
thoughts to this already, but would like some initial feedback on it first.
The scenario:
- An index of geographic information: cities, towns, states,
neighborhoods, zipcodes, generic names, etc. Examples are "New Yor
This is interesting, one I had not considered.
Mark - are there any code samples that implement this approach? Or maybe
something similar in approach?
thanks,
jeff
On 9/19/05, mark harwood <[EMAIL PROTECTED]> wrote:
>
> I think the HitCollector approach was fine but needed
> a couple of changes
I like Erik's suggestion here as a starting point. I would guess you might
find some direction in the Scorer class, but I haven't gone through this in
detail.
Conceptually a sliding weight based on proximity sounds correct...
-- jeff
On Sep 18, 2005, at 3:39 PM, James Huang wrote:
> > So the
Kevin -
You've come to the right list to get information to help you make a
decision. That said, the responsible answer to your question will be "it
depends". The supporter in me says Lucene is your best choice, hands down.
Your questions aren't as straightforward as you might expect. Lucene is
trimming the post further:
On 9/18/05, James Huang <[EMAIL PROTECTED]> wrote:
>
> >The problem is quite generic, I believe. What I like to do is similar to
> LIA-ch6, i.e. to find a "good Chinese Hunan-style restaurant near me." I
> prefer Hunan-style; however, if a good Human-style one is 12 m
Ben -
I can think of two ways to achieve this.
1) While adding your information to the index, query the index for an
existing record. If you get no match, add the record.
2) Control the exclusivity requirement from your data source, so that no
duplicate records ever have the opportunity to be i
Good call, Chris.I followed the BitSet comparison route and found that
the custom filter was working exactly as it should, but *I* wasn't passing
it correct data. Rookie mistake.
Doh! I hate it when that happens.
-- j
On 9/13/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
>
uals() then there's your problem.
Will do the step-through following this manner and post the results.
-- j
: Date: Tue, 13 Sep 2005 17:22:49 -0700
> : From: Jeff Rodenburg <[EMAIL PROTECTED]>
> : Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED]
> : To: Chris Hoste
Might be the same issue, haven't been able to determine during a
step-through on the code exec.
You're right, no need to add a new FilteredQuery to the statement, just a
search on combinedQuery with a new myCustomFilter.
Unfortunately, no joy; same response.
-- j
On 9/13/05, Chris Hostetter <[E
I'm encountering some unexpected behavior teeing up multiple Hits objects
from a searcher, and I think I'm missing something obvious. Hoping a second
pair of eyes might see what I'm missing.
Here's my code sequence:
// Some liberties taken in the code regarding names, etc.
// v1.4.3 codebase
Bo
Mayday, mayday
Has anyone had recent contact with George Aroush? He's presently managing
the C# port of Lucene.
Thanks,
Jeff Rodenburg
Is there a consensus or estimate on when v1.9 will be considered a stable
release? I'm prepping a deployment on v1.4.3 but would like an idea of when
1.9 might be considered stable in the eyes of the community.
-- Jeff Rodenburg
I know this question has been asked before, but I'm not certain of what
would work best for my scenario. So here goes...
I have an index with documents that carry a broad number of keyword fields,
usually containing numeric Ids (no sorting, so no leading zeros). From a set
of search results, I
Question to those who've deployed and maintained Lucene: any recommendations
or observations about practical decisions regarding analyzer choice in
indexing & searching? What have you found in operation to work well, become
difficult, yield better/worse results, affect performance, etc.? What wo
On Aug 30, 2005, at 9:53 PM, Friedland, Zachary (EDS - Strategy) wrote:
> > * I'm interested in implementing a "dynamic filter" component
> > that will walk through the hits[] object and pull out distinct
> > values for certain fields to display as search-within-a-search
> > options (all of them w
62 matches
Mail list logo