Re: 30 milllion+ docs on a single server

2006-08-13 Thread Jeff Rodenburg
On 8/12/06, Mark Miller <[EMAIL PROTECTED]> wrote: The single server is important because I think it will take a lot of work to scale it to multiple servers. The index must allow for close to real-time updates and additions. It must also remain searchable at all times (other than than during the

Re: 30 milllion+ docs on a single server

2006-08-12 Thread Jeff Rodenburg
Why is a single server so important? I can scale horizontally much cheaper than I scale vertically. On 8/11/06, Mark Miller <[EMAIL PROTECTED]> wrote: I've made a nice little archive application with lucene. I made it to handle our largest need: 2.5 million docs or so on a single server. Now

Re: Distributed Search

2006-07-27 Thread Jeff Rodenburg
Hi Mark - Having gone down this path for the past year, I echo comments from others that scalability/availability/failover is a lot of work. We migrated away from a custom system based on Lucene running on Windows to Solr running on Linux. It took us 6 months to get our system to a solid five-n

Re: Analyzer question

2006-05-19 Thread Jeff Rodenburg
The Keyword analyzer does no stemming or input modification of any sort: think of it as WYSIWYG for index population. The Whitespace analyzer simply removes spaces from your input (still no stemming), but the tokens are the individual words. I don't have the code in front of me, so I'm not sure

Re: Backing up indexes, reliability and robustness

2006-05-12 Thread Jeff Rodenburg
Marc - We built our index maintenance operation to assume a breakdown would occur in process (because it happened several times.) We exist in an environment where "always on, always available" is a business requirement. We also do a lot of updates on a cyclical basis (every 10 minutes), so malf

Re: Why is BooleanQuery.maxClauseCount static?

2006-04-15 Thread Jeff Rodenburg
y can sometimes cause problems when both types of queries need to execute simultaneously. -- j On 4/15/06, Paul Elschot <[EMAIL PROTECTED]> wrote: > > On Saturday 15 April 2006 18:20, Jeff Rodenburg wrote: > > What was the thinking behind making the BooleanQuery maxClauseCount a > &

Why is BooleanQuery.maxClauseCount static?

2006-04-15 Thread Jeff Rodenburg
What was the thinking behind making the BooleanQuery maxClauseCount a static? Or, I guess more to the point, why not an instance setting as well? Not trying to point out a flaw, just curious about the original thinking behind the setting. I have a situation where I have a set of BooleanQueries t

Re: Speed up Indexing

2006-03-23 Thread Jeff Rodenburg
I run Lucene.Net as well, and your indexing performance is dependent on more factors aside from whether you're using the Java or C# version. As a basic suggestion, learn what you can about minMergeDocs and mergeFactor as well as the compound file format. Try different combinations to understand w

Business stop words?

2006-03-16 Thread Jeff Rodenburg
Does anyone have a lead on "business" stop words? Things like "inc", "llc", "md", etc. I'd rather not reinvent this wheel. :-) cheers, jeff

Index validation utility

2006-03-11 Thread Jeff Rodenburg
I'm working on a utility program to help me verify/validate my Lucene indexes. By that, I mean checking for data conformance, minimum fields, etc. Using XML Schema Definition (http://www.w3.org/XML/Schema), my goal is to ensure that indexes that I create are compliant for requisite fields, data t

Re: Question

2006-03-07 Thread Jeff Rodenburg
We've done this, and it's not that complex. (Sorry, client won't allow me to release the code.) It's AJAX on the front end, so that background call is simply executing a search against an index that consists of the aggregated search terms. We do wildcard queries to get the results we want. For u

Re: Search on many indexes at once

2006-03-03 Thread Jeff Rodenburg
Raul - You'll want to look at the MultiSearcher and ParallelMultiSearcher classes for this. On 3/3/06, Raul Raja Martinez <[EMAIL PROTECTED]> wrote: > > Is it possible to search many indexes in one query and get back the Hits > ordered by relevance? > > Can someone point me out to some document o

Re: Hacking proximity search: looking for feedback

2006-03-01 Thread Jeff Rodenburg
Very good note, I missed that. I need the development environment in front of me to remember all the different class names correctly. ;-) -- j On 3/1/06, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Jeff Rodenburg wrote: > > Following on the Range Query approach, how is per

Re: Hacking proximity search: looking for feedback

2006-03-01 Thread Jeff Rodenburg
Thanks to everyone on the replies. I'm going to try several of these approaches and with equivalent data sets and run some side-by-side tests. No timeframes guarantees here, but I'll report back with the different approaches and the test results. cheers, -- j On 2/28/06, Chris Hostetter <[EMAI

Re: Hacking proximity search: looking for feedback

2006-02-28 Thread Jeff Rodenburg
Very good points, I hadn't considered the term frequency of the digits affecting scoring. As an aside, can that aspect of the score be ignored for these fields? I need to spend more time with FunctionQuery, I haven't given it the attention it deserves. Great feedback, thanks for the notes. -- j

Re: Hacking proximity search: looking for feedback

2006-02-28 Thread Jeff Rodenburg
ch the user is searching. > > On our data set, we can still end up with 1000s of matching documents > after boxing. Thus, we still see a bottleneck computing the score for > even this smaller set of documents which we are still working through. > > -Mike > > > -Original

Re: Hacking proximity search: looking for feedback

2006-02-28 Thread Jeff Rodenburg
ching. > > On our data set, we can still end up with 1000s of matching documents > after boxing. Thus, we still see a bottleneck computing the score for > even this smaller set of documents which we are still working through. > > -Mike > > > -Original Message- >

Hacking proximity search: looking for feedback

2006-02-28 Thread Jeff Rodenburg
I've been wrestling with a way to index and search data with a geo-positional aspect. By a geo-positional search, I want to constrain search results within a given location range. Furthermore, I want to allow the user to set/change the geo-positional boundaries as needed for their search. This i

Re: Inappropriate content detection

2006-02-05 Thread Jeff Rodenburg
You can generate a token stream for a block of text without having to index it. Take a look at the highlighter code, it does this very thing. On 2/5/06, Jeff Thorne <[EMAIL PROTECTED]> wrote: > > I am trying to figure out whether or not Lucene is an appropriate solution > for a problem that our

Re: How do I send search query to Multiple search Indexes ?

2006-02-02 Thread Jeff Rodenburg
Vikas - Start with the RemoteSearchable class. Technology will be RMI. Hope this helps. On 2/2/06, Vikas Khengare <[EMAIL PROTECTED]> wrote: > > Hi Friends > > How do I send one search query to multiple search Indexes which are > on remote machines ? > > Which Technology will help me (A

Re: Help with indexing and query strategy

2006-01-30 Thread Jeff Rodenburg
Have you considered evaluating doc-score thresholds for limiting your results? Since the perfect answers to these situations lie in the constant tweaking and twiddling of analysis and tokenization, one way I've found to help is to evaluate result scores. In your "Ontario CA" example, limiting res

Re: deleting duplicate documents from my index

2006-01-29 Thread Jeff Rodenburg
One way to do this (depending on your system and index size) is to remove and add every url you find. This would ensure that every document in the index is unique. No need to worry about sorting and iteration and doc_ids and the like. It rebuilds your entire index, but if you have a duplication

Lucene and geo queries

2006-01-04 Thread Jeff Rodenburg
I'm very interested in incorporating smart geographic querying capabilities (distance calcs are just scratching the surface) into Lucene and came across this whitepaper: http://www.clef-campaign.org/2005/working_notes/workingnotes2005/leidner05.pdf Just curious, has anyone ventured down this path

Re: ApacheCon next week

2005-12-12 Thread Jeff Rodenburg
Well done, Grant. Very informative. Question on Term Vectors: with their inclusion in an index, have you noticed any degradation in performance, either from a search effiiciency or maintenance point-of-view? Given the power of term vectors, if the perf impact is negligible, I'm curious to the re

Re: How to do refined search based on attributes and never return zero results

2005-12-07 Thread Jeff Rodenburg
Check out Chris Hostetter's methodology for doing this at cnet. http://mail-archives.apache.org/mod_mbox/lucene-java-user/200508.mbox/[EMAIL PROTECTED] This sounds like it matches your requirements. cheers, j On 12/7/05, Ching-Pei Hsing <[EMAIL PROTECTED]> wrote: > > Has anyway solved the foll

Re: Distributed sort

2005-12-04 Thread Jeff Rodenburg
thanks Erik On 12/3/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > > On Dec 3, 2005, at 1:26 PM, Jeff Rodenburg wrote: > > > In one of the Google Labs whitepapers ( > > http://labs.google.com/papers/mapreduce-osdi04.pdf), a programming > > construct > >

Distributed sort

2005-12-03 Thread Jeff Rodenburg
In one of the Google Labs whitepapers ( http://labs.google.com/papers/mapreduce-osdi04.pdf), a programming construct known as MapReduce is used in a variety of jobs/tasks within Google's operation. As an example of the application of MapReduce, the whitepaper refers to Distributed Sorting. Essent

Re: lucene and database searching, keeping score

2005-12-02 Thread Jeff Rodenburg
George - There are a number of SQL Server specific ways you can do this. Email me off-list as the solution is not relevant to Lucene. -- j On 12/2/05, George Abraham <[EMAIL PROTECTED]> wrote: > > All, > I have created a Lucene index from data in a SQL Server db. When I conduct > a > Lucene sea

Re: A couple of questions regarding load balancing and failover

2005-11-30 Thread Jeff Rodenburg
On 11/30/05, Daniel Pfeifer <[EMAIL PROTECTED]> wrote: > > > 1.) Does Lucenes MultiSearcher implement some kind of automatic failover > and/or load-balancing mechanism if both Searchables which I supply in > MultiSearchers constructor go to two different servers but to the very same > index-files?

Re: High CPU utilization with sort

2005-11-20 Thread Jeff Rodenburg
(especially for numeric fields). > > If you haven't already, you should compare the query times of a > "warmed" searcher. Sorted queries will still take longer, but I > haven't measured how much longer. > > -Yonik > Now hiring -- http://forms.cnet.com/slink?

High CPU utilization with sort

2005-11-20 Thread Jeff Rodenburg
I've read many comments from users on the list indicating that sorting may/will be performance-heavy. Is high CPU utilization with a sorted search one of the expected performance hits? In tests for our implementation (25 concurrent connections generating search/sort requests), we've seen performan

Re: Items in multiple category: distinct search?

2005-11-15 Thread Jeff Rodenburg
Hi John - It sounds like you're thinking of your index in terms of sql constructs -- multiple rows for the same record. We do this very same thing with categories; if you have a record that lives in multiple categories, just add additional category field/value pairs for your original record. It's

Re: Help with Search Java Code set up

2005-10-26 Thread Jeff Rodenburg
Kevin - Maybe I'm misunderstanding, but how is this not a BooleanQuery with two clauses? - j On 10/26/05, Kevin L. Cobb <[EMAIL PROTECTED]> wrote: > > I've been using Lucene happily for a couple of years now. But, this new > search functionality I'm trying to add is somewhat different that what

Re: MaxFieldLength or MaxFields?

2005-10-26 Thread Jeff Rodenburg
thanks Erik On 10/26/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > > On 26 Oct 2005, at 02:50, Jeff Rodenburg wrote: > > I'm considering building out an index that will flatten a data > > structure, > > such that some Document "A" will have

MaxFieldLength or MaxFields?

2005-10-25 Thread Jeff Rodenburg
I'm considering building out an index that will flatten a data structure, such that some Document "A" will have Fields 1,2 and 3. Fields 1 and 2 are indexed/tokenized field. Field 3 is indexed, and will contain many discrete values (up to possibly 5000). Couple of questions: 1. Does the DEFAULT_MA

Re: Using analyzers with term queries

2005-10-25 Thread Jeff Rodenburg
I don't mean to take the thread off-topic, but is this the recommended approach for any of the Query objects, i.e. SpanQuery or PhraseQuery? On 10/25/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > > On 25 Oct 2005, at 07:00, Rob Young wrote: > > I am using TermQuery s (and FuzzyQuery s) on the s

Re: Classifier4J and Lucene

2005-10-23 Thread Jeff Rodenburg
Sounds like you might have to consider both, if the first one doesn't solve your issue. A company field sounds like it's a single entry, i.e. one that can't be "spammed up" with multiple terms, i.e. "Oralce Oracle Oracle". It also sounds as if you're searching multiple fields, and that some fields

Re: Improving sort performance

2005-10-22 Thread Jeff Rodenburg
s of the > query to 0. > > So, (MyQuery, sorted by MyFunkySort), becomes > ((+MyQuery^0 MyFunctionQuery), sorted by score) > > -Yonik > Now hiring -- http://forms.cnet.com/slink?231706 > > On 10/22/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote: > > > > This

Re: Improving sort performance

2005-10-22 Thread Jeff Rodenburg
type of score you are trying to do, but maybe > FunctionQuery would help. > http://issues.apache.org/jira/browse/LUCENE-446 > > -Yonik > Now hiring -- http://forms.cnet.com/slink?231706 > > On 10/22/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote: > > > > I have

Improving sort performance

2005-10-22 Thread Jeff Rodenburg
I have a custom sort that completes calculations on-the-fly, similar to the LIA distance sort. SortField type is Float. It works, but I need better performance. I'm wondering if there's a better way to do this. As a rule, the number of results returned in a given search will most often be a fracti

Re: RemoteSearchable woes

2005-10-12 Thread Jeff Rodenburg
I'll take the no-response as a "no". :-) On 10/11/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote: > > Anyone running RemoteSearchable? I'm on v1.4.3 and am using it just fine, > until I need to: > > 1) use a custom sort, or > 2) use something that ext

RemoteSearchable woes

2005-10-12 Thread Jeff Rodenburg
Anyone running RemoteSearchable? I'm on v1.4.3 and am using it just fine, until I need to: 1) use a custom sort, or 2) use something that extends HitCollector I've got an idea as to the reasons why (serialization and remoteness), but how do I get around these? Anyone run into issues like these an

Hitcollectors and remotesearchables

2005-10-10 Thread Jeff Rodenburg
Doug Cutting once said, back in 2003: " The *HitCollector*-based search API is not meant to work remotely. To do so would involve an RPC-callback for every non-zero score, which would be extremely expensive. Also, just making *HitCollector* serializable would not be sufficient. You'd also need to

Custom sort with multiple fields?

2005-10-09 Thread Jeff Rodenburg
In following the LIA custom sort example, the calculated sort value is based on a field that contains all necessary values, i.e. "x,y" which is split into two values for use in a distance algorithm. Suppose I want a custom sort basis that performs a similar calculation, but is based on a multiple

Re: RemoteSearchable and sorting

2005-10-08 Thread Jeff Rodenburg
, not all objects related to Terms are > Serializable. IMHO, it would be NICE to have a RemoteReader and a > ParallelMultiReader to round out the API like: > > ParallelMultiReader, RemoteReader, MultiReader, Reader > > AND > > ParallelMultiSearcher,RemoteSearcher, MultiSearch

Re: RemoteSearchable and sorting

2005-10-05 Thread Jeff Rodenburg
lelMultiReader to round out the API like: > > ParallelMultiReader, RemoteReader, MultiReader, Reader > > AND > > ParallelMultiSearcher,RemoteSearcher, MultiSearcher, Searcher > > Regards, > Rus > > > > > On 10/5/05, Jeff Rodenburg <[EMAIL PROTECTED]> w

RemoteSearchable and sorting

2005-10-05 Thread Jeff Rodenburg
Are there known limitations or issues with sorting and RemoteSearchable? I'm encountering problems attempting to sort through a MultiSearcher (ParallelMultiSearcher, actually). I'm using an array of RemoteSearchable objects as the Searchable[] source. If I change the source indexes to be local Inde

Suggestions for analysis

2005-09-21 Thread Jeff Rodenburg
I'm looking for some suggestions on an analyzer decision. I've got my own thoughts to this already, but would like some initial feedback on it first. The scenario: - An index of geographic information: cities, towns, states, neighborhoods, zipcodes, generic names, etc. Examples are "New Yor

Re: Sort by relevance+distance

2005-09-19 Thread Jeff Rodenburg
This is interesting, one I had not considered. Mark - are there any code samples that implement this approach? Or maybe something similar in approach? thanks, jeff On 9/19/05, mark harwood <[EMAIL PROTECTED]> wrote: > > I think the HitCollector approach was fine but needed > a couple of changes

Re: Sort by relevance+distance

2005-09-18 Thread Jeff Rodenburg
I like Erik's suggestion here as a starting point. I would guess you might find some direction in the Scorer class, but I haven't gone through this in detail. Conceptually a sliding weight based on proximity sounds correct... -- jeff On Sep 18, 2005, at 3:39 PM, James Huang wrote: > > So the

Re: Is Lucene right for my app?

2005-09-18 Thread Jeff Rodenburg
Kevin - You've come to the right list to get information to help you make a decision. That said, the responsible answer to your question will be "it depends". The supporter in me says Lucene is your best choice, hands down. Your questions aren't as straightforward as you might expect. Lucene is

Re: Sort by relevance+distance

2005-09-18 Thread Jeff Rodenburg
trimming the post further: On 9/18/05, James Huang <[EMAIL PROTECTED]> wrote: > > >The problem is quite generic, I believe. What I like to do is similar to > LIA-ch6, i.e. to find a "good Chinese Hunan-style restaurant near me." I > prefer Hunan-style; however, if a good Human-style one is 12 m

Re: Stopping Duplicates

2005-09-17 Thread Jeff Rodenburg
Ben - I can think of two ways to achieve this. 1) While adding your information to the index, query the index for an existing record. If you get no match, add the record. 2) Control the exclusivity requirement from your data source, so that no duplicate records ever have the opportunity to be i

Re: Hits issue or custom filter issue?

2005-09-14 Thread Jeff Rodenburg
Good call, Chris.I followed the BitSet comparison route and found that the custom filter was working exactly as it should, but *I* wasn't passing it correct data. Rookie mistake. Doh! I hate it when that happens. -- j On 9/13/05, Jeff Rodenburg <[EMAIL PROTECTED]> wrote: >

Re: Hits issue or custom filter issue?

2005-09-13 Thread Jeff Rodenburg
uals() then there's your problem. Will do the step-through following this manner and post the results. -- j : Date: Tue, 13 Sep 2005 17:22:49 -0700 > : From: Jeff Rodenburg <[EMAIL PROTECTED]> > : Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED] > : To: Chris Hoste

Re: Hits issue or custom filter issue?

2005-09-13 Thread Jeff Rodenburg
Might be the same issue, haven't been able to determine during a step-through on the code exec. You're right, no need to add a new FilteredQuery to the statement, just a search on combinedQuery with a new myCustomFilter. Unfortunately, no joy; same response. -- j On 9/13/05, Chris Hostetter <[E

Hits issue or custom filter issue?

2005-09-13 Thread Jeff Rodenburg
I'm encountering some unexpected behavior teeing up multiple Hits objects from a searcher, and I think I'm missing something obvious. Hoping a second pair of eyes might see what I'm missing. Here's my code sequence: // Some liberties taken in the code regarding names, etc. // v1.4.3 codebase Bo

Is George Aroush still around?

2005-09-13 Thread Jeff Rodenburg
Mayday, mayday Has anyone had recent contact with George Aroush? He's presently managing the C# port of Lucene. Thanks, Jeff Rodenburg

Version 1.9

2005-09-11 Thread Jeff Rodenburg
Is there a consensus or estimate on when v1.9 will be considered a stable release? I'm prepping a deployment on v1.4.3 but would like an idea of when 1.9 might be considered stable in the eyes of the community. -- Jeff Rodenburg

BooleanQuery or QueryFilter?

2005-09-09 Thread Jeff Rodenburg
I know this question has been asked before, but I'm not certain of what would work best for my scenario. So here goes... I have an index with documents that carry a broad number of keyword fields, usually containing numeric Ids (no sorting, so no leading zeros). From a set of search results, I

"Right" combination of analyzers for indexing and searching

2005-09-04 Thread Jeff Rodenburg
Question to those who've deployed and maintained Lucene: any recommendations or observations about practical decisions regarding analyzer choice in indexing & searching? What have you found in operation to work well, become difficult, yield better/worse results, affect performance, etc.? What wo

Re: Ideal Index Fragmentation

2005-09-01 Thread Jeff Rodenburg
On Aug 30, 2005, at 9:53 PM, Friedland, Zachary (EDS - Strategy) wrote: > > * I'm interested in implementing a "dynamic filter" component > > that will walk through the hits[] object and pull out distinct > > values for certain fields to display as search-within-a-search > > options (all of them w