Re: Are there any Lucene optimizations applicable to SSD?

2008-08-20 Thread Cedric Ho
> [Cedric: Yes] > >> However I can't figure out why some of these queries are slower. Some >> are complicated queries, yet others are just simple single term >> queries and doesn't seems to score lots of hits. There's no >> correlation between the number of terms or number of hits with the >> respo

Re: Are there any Lucene optimizations applicable to SSD?

2008-08-20 Thread Cedric Ho
ing to get >90% of queries to return under 1 sec. Of >> course the more the better =) > > That's about the same as our original goal, but we've gotten greedier in > the meantime. Thanks for the help =) We'll also keep trying different methods until our goal is met. Regards, Cedric Ho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Are there any Lucene optimizations applicable to SSD?

2008-08-20 Thread Cedric Ho
-1340 This seems great! We got 5-6 fields that could get indexed this way. I'll definitely check it out. Thanks for the great tips =) Regards, Cedric Ho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Are there any Lucene optimizations applicable to SSD?

2008-08-19 Thread Cedric Ho
Hi eks, My index is fully optimized, but I wasn't aware that I can sort it by fields in Lucene. Could you elaborate on how to do that? By omitTf(), do you mean Fieldable.setOmitNorms(true)? I'll try that. Thanks, Cedric Ho > > if you have possibility to sort your index on

Re: Are there any Lucene optimizations applicable to SSD?

2008-08-19 Thread Cedric Ho
o sort by score. There are 3 returned fields, docId, date and publication, all of which we retrieve through fieldCaches. And we use this method to do the search: TopFieldDocs Searcher.search(Query query, Filter filter, int n, Sort sort) where for the test run n=100 We are targetin

Are there any Lucene optimizations applicable to SSD?

2008-08-19 Thread Cedric Ho
earches we receive are vastly different. So it's not likely we can depends on the system's file cache to speed things up for us. Any input is appreciated. Thanks, Cedric Ho - To unsubscribe, e-mail: [EMAIL PROTECTED] For ad

Re: Need addtional info for Field(希望看得懂中文的朋友帮我出出主意)

2008-04-22 Thread Cedric Ho
In that case you may want to index each: Field("Sub","下午去开会","01:02:02"); as a separate document. So your document contains 3 fields 1. title 2. time 3. sub then you can get both title and time by searching the "sub" field. Cedric 2008/4/22 王建新 <[EMAIL PROTECTED]>: > > 谢谢,我只是检索sub,不检索时间,在检索s

Re: Lucene 2.3.0 and NFS

2008-04-09 Thread Cedric Ho
I think cygwin will do the trick. On Thu, Apr 10, 2008 at 9:10 AM, Rajesh parab <[EMAIL PROTECTED]> wrote: > Hi All, > > Has anyone used rsync or similar utilities on Windows > OS to replicate Lucene index across multiple machines? > > Any pointers on it will be very useful? > > Regards, > Ra

Re: SpanQuery scoring seems different

2008-04-02 Thread Cedric Ho
And I just found an old jira issue which might explain this behavior LUCENE-533 http://www.archivum.info/[EMAIL PROTECTED]/2006-03/msg00265.html Cedric On Wed, Apr 2, 2008 at 3:15 PM, Cedric Ho <[EMAIL PROTECTED]> wrote: > Hi all, > > It seems that SpanNearQuery doesn't c

SpanQuery scoring seems different

2008-04-02 Thread Cedric Ho
Hi all, It seems that SpanNearQuery doesn't consider the boosting of the nested terms: 1.334 = (MATCH) weight(spanNear([content2MBM:morgan^4.0, content2MBM:stanley^4.0], 2, true) in 11976), product of: 2.0 = queryWeight(spanNear([content2MBM:morgan^4.0, content2MBM:stanley^4.0], 2, true

Re: Searching multiple indexes

2008-02-19 Thread Cedric Ho
> > I have some questions about searching multiple indexes. > > > > 1. IndexSearcher with a MultiReader will search the indexes > > sequentially? I think need to use either MultiSearcher or ParallelMultiSearcher > > > > 2. ParallelMultiSearcher searches in parallel. How is this > > done? One thre

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-15 Thread Cedric Ho
what they think of matches that cross paragraph borders. > Do you already have a firm requirement for that case? > > SpanNotQuery can be used to prevent matches over paragraph > borders when these are indexed as such, but I would not expect > that you would need those, given the fu

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-15 Thread Cedric Ho
Hi Paul, Do you mean the following? e.g. to index this: "first second third forth fifth six" originally it would be indexed as: (first,0) (second,1) (third,2) (forth,3) (fifth,4) (six,5) now it will be: (first,0) (second,0) (third,0) (forth,1) (fifth,1) (six,1) Then those Query classes that d

Some more questions on Payloads

2008-02-14 Thread Cedric Ho
Hi all, This is the same problem I am trying to solve as in the thread: "How to pass additional information into Similarity.scorePayload(...)" However since these questions are somewhat different, I figure I'd start a new thread. After diving into the Lucene Source codes for a while now I have

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-14 Thread Cedric Ho
;s possible to do that, and it may be good for performance in some cases, > but one can revert to using another field for different position info. > > Regards, > Paul Elschot > > > Op Thursday 14 February 2008 09:44:40 schreef Cedric Ho: > > > > Hi Paul, > > &

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-14 Thread Cedric Ho
Cheers, Cedric On Thu, Feb 14, 2008 at 2:58 PM, Paul Elschot <[EMAIL PROTECTED]> wrote: > Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho: > > > I am using Lucene's Built-in query classes: TernQuery, PhraseQuery, > > WildcardQuery, BooleanQuery and many of the S

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-13 Thread Cedric Ho
of info did you have in > mind? scorePayload is called from the query scoring class, so I am > not sure how you would pass in info to it unless you were writing your > own Query class. > > -Grant > > > On Feb 13, 2008, at 4:31 AM, Cedric Ho wrote: > > > Hi all, > &g

How to pass additional information into Similarity.scorePayload(...)

2008-02-13 Thread Cedric Ho
Hi all, My problem is I have some additional weighting info that come with each search. And I need to take both the weighting info and the payload to calculate scores. So how do I access the weighting info in Similarity.scorePayload(String,byte[],int,int) ? I've thought about using a ThreadLocal,

Re: large term vectors

2008-02-10 Thread Cedric Ho
itting indexes'. I see people say > this all over, but how have people been handling this. I'm going to > start a new thread, and there probably was one back in the day, but I > am going to fire it up again. But, how did you do it? > > > On Feb 10, 2008 9:18 PM, Cedric

Re: large term vectors

2008-02-10 Thread Cedric Ho
Is it a single index ? My index is also in the 200G range, but I never managed to get a single index of size > 20G and still get acceptable performance (in both searching and updating). So I split my indexes into chunks of < 10G I am curious as to how you manage such a single large index. Cedric

Re: Distributed Indexes

2008-02-10 Thread Cedric Ho
On Feb 9, 2008 12:07 AM, Ruslan Sivak <[EMAIL PROTECTED]> wrote: > The app does other things then search the index. I'm basically using > ColdFusion for the website and have four instances running on two > servers for load balancing. Each app does the searches, and the search > times are small, t

Re: Distributed Lucene Directory

2008-02-01 Thread Cedric Ho
On Feb 1, 2008 9:47 AM, Mark Miller <[EMAIL PROTECTED]> wrote: > > Cedric Ho wrote: > > > > But managing such a set of indexes is not trivial. Especially when > > need to add redundancies for reliability and update frequently. > > > Agreed. Apparentl

Re: Distributed Lucene Directory

2008-01-31 Thread Cedric Ho
won't takes too long, etc) Cedric On Jan 31, 2008 6:59 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > 31 jan 2008 kl. 09.42 skrev Cedric Ho: > > > I am wondering if there exist any implemenation of > > org.apache.lucene.store.Directory which can be distributed across >

Distributed Lucene Directory

2008-01-31 Thread Cedric Ho
distributed in 10 machines should achieve the same performance as a 10G index on a local FSDirectory. I know that optimization would be a problem for such a big index, but would the partial optimization introduced in Lucene 2.3 help? Any thoughts? Regards, Cedric Ho

Re: Chinese Segmentation with Phase Query

2007-11-10 Thread Cedric Ho
h this document. > > I am not sure if this works in your case because we index product information > and their descriptions which are not language friendly anyway because of the > abbreviations) > > Regards > > Uwe Goetzke > > > -Ursprüngliche Nachricht- >

Re: Chinese Segmentation with Phase Query

2007-11-09 Thread Cedric Ho
On Nov 10, 2007 2:08 AM, Steven A Rowe <[EMAIL PROTECTED]> wrote: > Hi Cedric, > > On 11/08/2007, Cedric Ho wrote: > > a sentence containing characters ABC, it may be segmented into AB, C or A, > > BC. > [snip] > > In this cases we would like to index both seg

Re: Chinese Segmentation with Phase Query

2007-11-09 Thread Cedric Ho
inese word segmentation, but will solve the > problem in your case. > > > On Nov 9, 2007 10:59 AM, Cedric Ho <[EMAIL PROTECTED]> wrote: > > Hi, > > > > We are having an issue while indexing Chinese Documents in Lucene. > > > > Some background first: >

Chinese Segmentation with Phase Query

2007-11-08 Thread Cedric Ho
Hi, We are having an issue while indexing Chinese Documents in Lucene. Some background first: Since CJK languages doesn't have space between words, we first have to determine the words from sentences. e.g. a sentence containing characters ABC, it may be segmented into AB, C or A, BC. the proble

Re: performance on filtering against thousands of different publications

2007-08-14 Thread Cedric Ho
wrote: > Hi Cedric, > > Cedric Ho wrote: > > On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > >> Are you iterating through a Hits object that has more than > >> 100 (maybe it's 200 now) entries? Are you loading each document that > >> satis

Re: performance on filtering against thousands of different publications

2007-08-14 Thread Cedric Ho
> > Some options: > 1) Try minimise leaping around the disk - maybe sorting your selected terms > will help. Look at methods in TermEnum and TermDocs which you can use to > build your own bitset from your (sorted) list of terms. Thanks, I'll try this method. > 2) Can you add higher-level terms

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Cedric Ho
s of publications). So we are thinking we may want to use a cache of TermsFilter, where each TermsFilter filter for a set of publications and maybe use some LRU policy to manage the cache of filters. This may eventually work, be we are also looking for other better alternatives. Thanks, Cedric > &g

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Cedric Ho
s. But I would still like to eliminate any inefficiencies in our search implementation first. > > > Best > Erick > > On 8/13/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > > > Hi all, > > > > My problem is as follows: > > > > Our docume

performance on filtering against thousands of different publications

2007-08-12 Thread Cedric Ho
Hi all, My problem is as follows: Our documents each comes from a different publication. And we currently have > 5000 different publication sources. Our clients can choose arbitrarily a subset of the publications while performing search. It is not uncommon that a search will have to match hundr

Re: Can I do boosting based on term postions?

2007-08-05 Thread Cedric Ho
u > > > also need to extend VSimilarity class - which would require > > implementation > > > of method scoreSpan(..). > > > > > > Let me know how it went. Though I did a testing for it, but before > > > submitting to contrib, I need to do extensive tes

Re: Can I do boosting based on term postions?

2007-08-02 Thread Cedric Ho
Paul Elschot <[EMAIL PROTECTED]> wrote: > Cedric, > > SpanFirstQuery could be a solution without payloads. > You may want to give it your own Similarity.sloppyFreq() . > > Regards, > Paul Elschot > > On Thursday 02 August 2007 04:07, Cedric Ho wrote: > > Thanks

Re: Can I do boosting based on term postions?

2007-08-01 Thread Cedric Ho
time. > > Thanks, > Shailendra Sharma, > CTO, Ver se' Innovation Pvt. Ltd. > Bangalore, India > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > > > Hi all, > > > > I was wondering if it is possible to do boosting by search terms' > >

Can I do boosting based on term postions?

2007-07-31 Thread Cedric Ho
Hi all, I was wondering if it is possible to do boosting by search terms' position in the document. for example: search terms appear in the first 100 words, or first 10% words, or in first two paragraphs would be given higher score. Is it achievable through using the new Payload function in luce

Re: WildcardQuery and SpanQuery

2007-07-19 Thread Cedric Ho
Thanks so much for helping ~ I will try it out tomorrow. Regards, Cedric On 7/19/07, Paul Elschot <[EMAIL PROTECTED]> wrote: On Wednesday 18 July 2007 12:30, Cedric Ho wrote: > Thanks for the quick response Paul =) > > However I am lost while looking at the surround packag

Re: WildcardQuery and SpanQuery

2007-07-18 Thread Cedric Ho
Thanks for the quick response Paul =) However I am lost while looking at the surround package. Are you suggesting I can solve my problem at hand using the surround package? On 7/18/07, Paul Elschot <[EMAIL PROTECTED]> wrote: On Wednesday 18 July 2007 05:58, Cedric Ho wrote: > Hi

WildcardQuery and SpanQuery

2007-07-17 Thread Cedric Ho
Hi everybody, We recently need to support wildcard search terms "*", "?" together with SpanQuery. It seems that there's no SpanWildcardQuery available. After looking into the lucene source code for a while, I guess we can either: 1. Use SpanRegexQuery, or 2. Write our own SpanWildcardQuery, and

exception during optimze

2007-05-31 Thread Cedric Ho
Hi, When I tried to build an index last night, the following exception occurred during call to IndexWriter.optimze(): java.lang.NullPointerException at org.apache.lucene.index.IndexFileDeleter.findDeletableFiles(IndexFileDeleter.java:88) at org.apache.lucene.index.IndexWriter.mer