possible latency increase from Lucene versions 4.1 to 4.4?

2013-09-13 Thread John Wang
Has anyone experienced a latency increase between the above versions? Mainly in conjunction queries. Thanks -John

command line lucene

2013-05-17 Thread John Wang
Hi folks: Sorry about the cross-post. Luke is awesome, but sometimes you only have command line access to your production boxes. So I wrote CLue, a command line lucene interface in the likes of Luke: Please take a look and collaborators wanted :) https://github.com/javasoze/clu

TermDocs.close

2009-12-27 Thread John Wang
Hi: I see TermDocs.close not being called when created with TermQuery: TermQuery creates it and passes to TermScorer, and is never closed. I see TermDocs.close actually closes the input stream. Is it safe not closing TermDocs? Thanks -John

Fwd: 3.0 api change

2009-12-21 Thread John Wang
Any comments? Did we just unintentionally remove getFieldComparatorSource in 3.0.0? -John -- Forwarded message -- From: John Wang Date: Mon, Dec 21, 2009 at 11:21 AM Subject: 3.0 api change To: Lucene Users List , lucene-...@jakarta.apache.org Hi guys: I noticed

share some numbers for range queries

2009-11-15 Thread John Wang
Hi: I did some performance analysis for different ways of doing numeric ranging with lucene. Thought I'd share: http://invertedindex.blogspot.com/2009/11/numeric-range-queries-comparison.html -John

Re: IndexWriter.close() no longer seems to close everything

2009-11-12 Thread John Wang
If you run the zoie test turned to nrt, you can replicate it rather easily: While the test is running, do lsof on your process, e.g. lsof -p | | wc -John On Thu, Nov 12, 2009 at 8:24 AM, John Wang wrote: > Well, I have code in the finally block to call IndexReader.close for every >

Re: IndexWriter.close() no longer seems to close everything

2009-11-12 Thread John Wang
t; reader you get back from getReader? > > Mike > > On Sun, Nov 8, 2009 at 10:41 PM, John Wang wrote: > > I am seeing the samething, but only when IndexWriter.getReader is called > at > > a high rate. > > > > from lsof, I see file handles growing. > > > &

Re: IndexWriter.close() no longer seems to close everything

2009-11-08 Thread John Wang
I am seeing the samething, but only when IndexWriter.getReader is called at a high rate. from lsof, I see file handles growing. -John On Sun, Nov 8, 2009 at 7:29 PM, Daniel Noll wrote: > Hi all. > > We updated to Lucene 2.9, and now we find that after closing our text > index, it is not possib

lucene 2.9+ numeric indexing

2009-11-08 Thread John Wang
Hi guys: Running into a strange problem: I am indexing into a field a numeric string: int n = Math.abs(rand.nextInt(100)); Field myField = new Field(MY_FIELD,String.valueOf(n),Store.NO,Index. NOT_ANALYZED_NO_NORMS); myField.setOmitTermFreqAndPositions(true); doc.add(myFi

Re: 2.9 per segment searching/caching

2009-10-22 Thread John Wang
n cost - > in some cases it does not. > > But we are talking degradation as you add more segments, not pure speed. > Degradation is worse now in the sort case. > > John Wang wrote: > > With many other coding that happened in 2.9, e.g. the PQ api etc., > sorting &g

Re: 2.9 per segment searching/caching

2009-10-22 Thread John Wang
With many other coding that happened in 2.9, e.g. the PQ api etc., sorting is actually faster than 2.4. -John On Thu, Oct 22, 2009 at 5:07 AM, Mark Miller wrote: > Bill Au wrote: > > Since Lucene 2.9 has per segment searching/caching, does query > performance > > degrade less than before (2.9) a

Re: Lucene 2.9.0 leaves too many .cfs files open, causing too many files open java error.

2009-10-18 Thread John Wang
Hi Glen: I think it is in your application code: The indexReader returned is not closed if the underlying index has changed. If your update rate is high, you will run into this issue because GC may not have caught up with the FH leak. THe code should instead be: if (indexReader!=null){ I

Re: Realtime search best practices

2009-10-12 Thread John Wang
I think it was my email Yonik responded to and he is right, I was being lazy and didn't read the javadoc very carefully.My bad. Thanks for the javadoc change. -John On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley wrote: > On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix > wrote: > > It may be surpri

Re: Realtime search best practices

2009-10-12 Thread John Wang
Oh, that is really good to know! Is this deterministic? e.g. as long as writer.addDocument() is called, next getReader reflects the change? Does it work with deletes? e.g. writer.deleteDocuments()? Thanks Mike for clarifying! -John On Mon, Oct 12, 2009 at 12:11 PM, Michael McCandless < luc...@mik

Re: faceted search performance

2009-10-12 Thread John Wang
Given you have 1M docs and about 1M terms, do you see very few docs per term? If your DocSet per term is very sparse, BitSet is probably not a good representation. Simple int array maybe better for memory, and faster for iterating. -John On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot wrote: > On

new sorting api and some perf numbers

2009-10-11 Thread John Wang
Hi guys: The new FieldComparator api looks really scary :) But after some perf testing with numbers I'd like to share, I guess it is worth it: HW: Mac Pro with 16G memory jvm: 1.6.0_13" jvm arg: -Xms1g -Xmx1g -server setup index: 1M docs even split into 8 segments (to make sure the test

Re: Realtime & distributed

2009-10-11 Thread John Wang
Eric: For more specific Zoie questions, let's move it to the zoie discussion group instead. Thanks -John On Sun, Oct 11, 2009 at 2:31 PM, John Wang wrote: > Hi Eric: > > I regret the direction the thread has taken and partly take responsibility > for it... > > As t

Re: Realtime & distributed

2009-10-11 Thread John Wang
globe. Sometimes there are differences of > opinion, however those are easily ironed out over time (and quite > frankly in this case benchmarks). > > However I am very concerned about your ignorant disregard of some of the > most basic human rights in existence. > > -J &

Re: Realtime & distributed

2009-10-09 Thread John Wang
I can provide some preliminary numbers (we will need to do some detailed analysis and post it somewhere): Dataset: medline starting index: empty. add only, no update, for 30 min. maximum indexing load, 1000 docs/ sec Under stress, we take indexing events (add only) and stream into both systems: Z

Re: Realtime & distributed

2009-10-08 Thread John Wang
Jason: I would really appreciate it if you would stop making false statements and misinformation. Everyone is entitled to his/her opinions on technologies, but deliberately making misleading and false information on such a distribution is just unethical, and you'll end up just discrediting

2.9 NRT w.r.t. sorting and field cache

2009-09-21 Thread John Wang
Looking at the code, seems there is a disconnect between how/when field cache is loaded when IndexWriter.getReader() is called. Is FieldCache updated? Otherwise, are we reloading FieldCache for each reader instance? Seems for operations that lazy loads field cache, e.g. sorting, this has a signif

Re: searching for c++, c#, etc...

2009-07-16 Thread John Wang
If you escape the character + or #, the sentence: "I know java + c++" would not skip +, furthermore, it breaks query parsing, where + is reserved. -John On Thu, Jul 16, 2009 at 9:04 AM, John Wang wrote: > This runs into problems when you have such following sentence: > "I

Re: searching for c++, c#, etc...

2009-07-16 Thread John Wang
This runs into problems when you have such following sentence: "I dislike c++." If you use WSA, then last token is "c++.", not "c++", the query would not find this document. -John On Thu, Jul 16, 2009 at 8:29 AM, Chris Salem wrote: > That seems to be working. you don't have to escape the plus

addIndexesNoOptimize

2009-07-03 Thread John Wang
Hi guys: Running into a question with IndexWriter.addIndexesNoOptimize: I am trying to expand a smaller index by replicating it into a larger index. So I am adding the same directory N times. I get an exception because noDupDirs(dirs) fails. For this call, is this check neccessary?

Re: kamikaze

2009-04-30 Thread John Wang
You are right, Grant.Michael, Anmol, let's move this to the kamikaze mailing list: http://groups.google.com/group/kamikaze-users Michael, I have added you by default. -John On Thu, Apr 30, 2009 at 4:37 PM, Grant Ingersoll wrote: > Does Kamikaze have a mailing list? It seems like, to me anyway,

Re: Query did not return results

2009-04-24 Thread John Wang
What analyzers are you using for both query and indexing?Can you also post some code on you indexed? -John On Fri, Apr 24, 2009 at 8:02 PM, blazingwolf7 wrote: > > Hi, > > I created a query that will find a match inside documents. Example of text > match "terror india" > And documents with this

Re: kamikaze

2009-04-24 Thread John Wang
Hi Michael: We are using it internally here at LinkedIn for both our search engine as well as our social graph engine. And we have a team developing actively on it. Let us know how we can help you. -John On Fri, Apr 24, 2009 at 1:56 PM, Michael Mastroianni < mmastroia...@glgroup.com> wrote:

Re: Faceting, Sort and DocIDSet

2009-04-22 Thread John Wang
Karsten: Yes, you kinda need that for faceting to work. Take a look at FacetDataCache class. -John On Wed, Apr 22, 2009 at 3:06 AM, Karsten F. wrote: > > Hi Dave, > > facets: > in you case a solution with one > int[IndexReader.maxDoc()] > fits. For each document number you can store an inte

Re: Faceting, Sort and DocIDSet

2009-04-20 Thread John Wang
Hi David: We built bobo-browse specifically for these types of usecases: http://code.google.com/p/bobo-browse Let me know if you need any help getting it going. -John On Mon, Apr 20, 2009 at 12:59 PM, Karsten F. wrote: > > Hi David, > > correct: you should avoid reading the content o

Re: LocalLucene/Lucene Spatial

2009-04-19 Thread John Wang
Is there a reason the Query build is from a bitset via a ConstantScoreQuery instead a RangeQuery? Seems we would be paying a penalty for loading the bitset, esp the bitset would be rather sparse. Furthermore, is TrieRangeQuery planning to be somehow used in the spatial package? Thanks -John On

Re: Google's search Appliance relevance ranking

2009-04-17 Thread John Wang
Little I know about GSA, there isn't a distributed solution (old information, not sure if it is still the case), so it is not very easy to scale your search system. Something you can achieve rather easily with a Lucene/Solr implementation. There are other benefits of using an open source solution s

Re: Autonomy search technology

2009-04-06 Thread John Wang
> John mentions. > > -Grant > > > On Apr 3, 2009, at 7:24 PM, John Wang wrote: > > Not quite.For example, # of fields is static thru out the corpus. # zones >> is per document. E.g. let's say you have 1 million docs, some docs have 2 >> paragraphs,

Re: Autonomy search technology

2009-04-03 Thread John Wang
rch, but came up empty handed. > > Thanks for your time! > > Matthew Runo > Software Engineer, Zappos.com > mr...@zappos.com - 702-943-7833 > > On Apr 3, 2009, at 10:08 AM, John Wang wrote: > > > Verity VDK, which was bought by autonomy, has zone search. S

Re: Autonomy search technology

2009-04-03 Thread John Wang
; > Thanks for your time! > > Matthew Runo > Software Engineer, Zappos.com > mr...@zappos.com - 702-943-7833 > > > On Apr 3, 2009, at 10:08 AM, John Wang wrote: > > Verity VDK, which was bought by autonomy, has zone search. Something >> lucene >> currently does not

Re: Autonomy search technology

2009-04-03 Thread John Wang
Verity VDK, which was bought by autonomy, has zone search. Something lucene currently does not support. We have implemented it ontop of lucene and thinking about contributing. -John On Fri, Apr 3, 2009 at 8:56 AM, Lukáš Vlček wrote: > Hi, > anybody has experience with Automony search technolog

Re: IndexWriter.deleteDocuments(Query query)

2009-04-02 Thread John Wang
m doing? BTW, can you shine some light on why would IndexWriter move docids around when it is opened and no docs has been added to it? Thanks -John On Thu, Apr 2, 2009 at 2:20 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Apr 1, 2009 at 6:37 PM, John Wang wrot

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Apr 1, 2009 at 5:22 PM, John Wang wrote: > > Hi Michael: > > > >1) Yes, we use TermDocs, exactly what > IndexWriter.deleteDocuments(Term) > > is doing under the cover. > > This

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
ess < luc...@mikemccandless.com> wrote: > On Wed, Apr 1, 2009 at 2:04 PM, John Wang wrote: > > > My test essentially this. I took out the reader.deleteDocuments call from > > both scenarios. I took a index of 5m docs. a batch of 1 randomly > > generated uids. > >

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
Thanks Michael for the info. I do guarantee there are not modifications between when "MySpecialIndexReader" is loaded and when I iterate and find the deleted docids. I am, however, not aware that when IndexWriter is opened, docids move. I thought only when docs are added and when it is committed.

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
how > would you produce that docIdSet? > > We could consider delete by Filter instead, since that exposes the > necessary getDocIdSet(IndexReader) method. > > Or, with near real-time search, we could enhance it to allow deletions > via the obtained reader (the first approach doesn&

Re: IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread John Wang
So do you think it is a good addition/change to the current api now? -John On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley wrote: > On Tue, Mar 31, 2009 at 4:58 PM, John Wang wrote: > > I fail to see the difference of exposing the api to allow for a Query > > instance to be

Re: API to get index info

2009-03-31 Thread John Wang
Excellent! Thanks -John On Tue, Mar 31, 2009 at 2:25 PM, Yonik Seeley wrote: > On Tue, Mar 31, 2009 at 4:55 PM, John Wang wrote: > > Maybe I am missing something. I don't see any calls that would gimme the > > number of segments. Are you suggesting: > IndexCom

Re: IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread John Wang
eeley wrote: > On Tue, Mar 31, 2009 at 3:41 PM, John Wang wrote: > > Also, can we expose IndexWriter.deleteDocuments(int[] docids)? > > Exposing internal ids from the IndexWriter may not be a good idea > given that they are transient. > > > -Yonik &

Re: API to get index info

2009-03-31 Thread John Wang
Maybe I am missing something. I don't see any calls that would gimme the number of segments. Are you suggesting: IndexCommit.getFileNames().size()? Thanks -John On Tue, Mar 31, 2009 at 1:04 PM, Yonik Seeley wrote: > On Tue, Mar 31, 2009 at 3:43 PM, John Wang wrote: > > Can we ha

API to get index info

2009-03-31 Thread John Wang
Can we have an API that exposes index information, e.g. number of segments etc.? (or simply make SegmentInfo(s) public classes) We currently do this by working around package-level protecting by sneaking in a subclass in the org.apache.index package. We are moving towards OSGI, and split-packages

IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread John Wang
Hi guys: IndexWriter.deleteDocuments(Query query) api is not really making sense to me. Wouldn't IndexWriter.deleteDocuments(DocIdSet set) be better? Since we don't really care about scoring for this call. Also, can we expose IndexWriter.deleteDocuments(int[] docids)? Using the current api is

Re: Faceted search with OpenBitSet/SortedVIntList

2009-02-08 Thread John Wang
Elschot wrote: > John, > > On Sunday 08 February 2009 00:35:10 John Wang wrote: > > Our implementation of facet search can handle this. Using bitsets for > > intersection is not scalable performance wise when index is large. > > > > We are using a compact forwarded i

Re: Faceted search with OpenBitSet/SortedVIntList

2009-02-07 Thread John Wang
Our implementation of facet search can handle this. Using bitsets for intersection is not scalable performance wise when index is large. We are using a compact forwarded index representation in memory for the counting. Similar to FieldCache idea but more compact. Check it out at: http://sourcefor

Re: Lucene Index Monitor

2009-01-27 Thread John Wang
Luke is great, but sometimes you don't have a windowing system installed on the target machine. A webapp like LIMO is very useful. It is unfortunate that it is not being maintained. -John On Mon, Jan 26, 2009 at 3:44 PM, Chris Hostetter wrote: > > : I need to monitor my searches and index. i kn

Re: IndexReader.isDeleted

2009-01-24 Thread John Wang
Mike: "We are considering replacing the current random-access IndexReader.isDeleted(int docID) method with an iterator & skipTo (DocIdSet) access that would let you iterate through the deleted docIDs, instead." This is exactly what we are doing. We do have to however, build the intern

Re: TermScorer default buffer size

2009-01-08 Thread John Wang
> > On Wednesday 07 January 2009 07:25:17 John Wang wrote: > > > Hi: > > > > > >The default buffer size (for docid,score etc) is 32 in TermScorer. > > > > > > We have a large index with some terms to have very dense doc sets. > By > &

TermScorer default buffer size

2009-01-06 Thread John Wang
Hi: The default buffer size (for docid,score etc) is 32 in TermScorer. We have a large index with some terms to have very dense doc sets. By increasing the buffer size we see very dramatic performance improvements. With our index (may not be typical), here are some numbers with buffer

Re: Field.omitTF

2008-12-18 Thread John Wang
ieldable youll find: > > /** Expert: > * > * If set, omit term freq, positions and payloads from postings for this > field. > */ > void setOmitTf(boolean omitTf); > > - Mark > > > John Wang wrote: > >> Thanks Mark!I don't think it is documented (a

Re: Field.omitTF

2008-12-18 Thread John Wang
Thanks Mark!I don't think it is documented (at least the ones I've read), should this be considered as a bug or ... ? Thanks -John On Thu, Dec 18, 2008 at 2:05 PM, Mark Miller wrote: > Drops positions as well. > > - Mark > > > > On Dec 18, 2008, at 4:57 PM, &quo

Field.omitTF

2008-12-18 Thread John Wang
Hi: In lucene 2.4, when Field.omitTF() is called, payload is disabled as well. Is this intentional? My understanding is payload is independent from the term frequencies. Thanks -John

Re: Taxonomy in Lucene

2008-12-12 Thread John Wang
between solr and browseengine ? > > Thanks for mention browseengine. I really like the car demo! > > Best regards > Karsten > > > John Wang wrote: > > > > We are doing lotsa internal changes for performance. Also upgrading the > > api > > to support

Re: Taxonomy in Lucene

2008-12-12 Thread John Wang
wsing: > starting point is > org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc > ? > (It works with millions of facet values on millions of hits) > > What is the starting point in browseengine? > > How is the connection between solr and browseengine ? > > Thanks for mention browseengine. I really like t

Re: Taxonomy in Lucene

2008-12-11 Thread John Wang
We are doing lotsa internal changes for performance. Also upgrading the api to support for features. So my suggestion is to wait for 2.0. (should release this this month, at the latest mid jan) We can take this offline if you want to have a deeper discussion on browse engine. Thanks -John On Thu

Re: Taxonomy in Lucene

2008-12-10 Thread John Wang
We are doing a release shortly which contains API change.Let us know if you need help. -John On Wed, Dec 10, 2008 at 11:27 AM, John Wang <[EMAIL PROTECTED]> wrote: > www.browseengine.com > -John > > > On Wed, Dec 10, 2008 at 10:55 AM, Glen Newton <[EMAIL PROTECTED]

Re: Taxonomy in Lucene

2008-12-10 Thread John Wang
www.browseengine.com -John On Wed, Dec 10, 2008 at 10:55 AM, Glen Newton <[EMAIL PROTECTED]> wrote: > From what I understand: > faceted browse is a taxonomy of depth =1 > > A taxonomy in general has an arbitrary depth: > > Example: Biological taxonomy: > > Kingdom Animalia > Phylum Acanthocepha

Re: Chinese Analyzer evaluation

2008-12-09 Thread John Wang
Hi Cooper: Where are these classes? Thanks -John On Tue, Dec 9, 2008 at 2:27 AM, Cooper Geng <[EMAIL PROTECTED]> wrote: > Hi all, > > My application will provide Chinese search engine. I got some analyzer on > Chinese language. > Any suggestion about these: > > IK_CAnalyzer > IKAnalyzer > >

Re: Sorting documents without a query

2008-12-05 Thread John Wang
The obvious way is to use use MatchAllDocsQuery with Sort parameters on the searcher, e.g. searcher.search(new MatchAllDocsQuery(),sort); If you only care about 1 sort spec (e.g. no secondary sort to break ties) it may be faster just traversing the term table since that is already sorted. -John

Re: TopDocs

2008-12-04 Thread John Wang
searcher.doc(scoreDoc.doc); On Thu, Dec 4, 2008 at 6:59 PM, Ian Vink <[EMAIL PROTECTED]> wrote: > I have this search which returns TopDocs > TopDocs topDocs = searcher.Search(query, bookFilter, maxDocsToFind); > > > How do I get the document object for the ScoreDoc? > > foreach (ScoreDoc scoreDo

Re: Suggestions for drill downs

2008-12-04 Thread John Wang
On Thu, Dec 4, 2008 at 5:46 PM, Muralidharan V <[EMAIL PROTECTED]>wrote: > John, > > Using the FieldCache worked well. Thanks! > > -Murali > > On Thu, Dec 4, 2008 at 3:10 PM, John Wang <[EMAIL PROTECTED]> wrote: > > > Easiest way to do thi

Re: Suggestions for drill downs

2008-12-04 Thread John Wang
Easiest way to do this is using the FieldCache. It constructs a StringIndex object which gives you very fast lookup to the field value (index) given a docid. Create a parallel count array to the lookup array for the StringIndex. Run your HitCollector thru should be fast. Loading FieldCache maybe ex

Re: NIOFSDirectory

2008-12-04 Thread John Wang
, could someone explain? > > thanks, > -glen > > > 2008/12/4 John Wang <[EMAIL PROTECTED]>: > > Thanks! > > -John > > > > On Thu, Dec 4, 2008 at 2:16 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > > >> Details in the bug: > &

Re: NIOFSDirectory

2008-12-04 Thread John Wang
ockFactory); > } > > -Yonik > > > On Thu, Dec 4, 2008 at 5:08 PM, John Wang <[EMAIL PROTECTED]> wrote: > > That does not help. The File/path is not stored with the instance. It is > in > > a map FSDirectory keeps statically. Should subclasses of FSDirectory

Re: Slow queries with lots of hits

2008-12-04 Thread John Wang
Tim: How about implementing your own HitCollector and stop when you have collected 100 docs with score above certain threshold? BTW, are there lotsa concurrent searches? -John On Thu, Dec 4, 2008 at 12:52 PM, Tim Sturge <[EMAIL PROTECTED]> wrote: > That makes sense. I should be more p

Re: NIOFSDirectory

2008-12-04 Thread John Wang
..what version are we talking about? :-) > > The current development version of Lucene allows you to directly > instantiate FSDirectory subclasses. > > -Yonik > > > > thanks, > > > > Glen > > > > 2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>: >

Re: NIOFSDirectory

2008-12-04 Thread John Wang
Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]> wrote: > > Hi guys: > >We did some profiling and benchmarking: > > > >The thread contention on FSDIrectory is gone, and for the set of > queries >

NIOFSDirectory

2008-12-04 Thread John Wang
Hi guys: We did some profiling and benchmarking: The thread contention on FSDIrectory is gone, and for the set of queries we are running, performance improved by a factor of 5 (to be conservative). Great job, this is awesome, a simple change and made a huge difference. To get NIO

Re: Pooling indexReader

2008-06-30 Thread John Wang
eader (call > its incRef()) and then decRef() it when you're done. That would probably be > cleanest... > > Mike > > > On Jun 29, 2008, at 11:51 AM, John Wang wrote: > > Hi: >> I had some code to do indexReader pooling to avoid open and close on a >> large

Pooling indexReader

2008-06-29 Thread John Wang
Hi: I had some code to do indexReader pooling to avoid open and close on a large index when doing lotsa searches. So I had a FilteredIndexReader proxy that overrides the doClose method to do nothing, and when I really want to close it, I call super.doClose(). This patter worked well for me prior

Re: Indexing the spider content

2008-06-24 Thread John Wang
Maybe building a Lucene gateway to hook in with VSpider. Are you using VSpider or K2Spider? -John On Tue, Jun 24, 2008 at 8:35 PM, yugana <[EMAIL PROTECTED]> wrote: > > Hi Otis, > > Thanks for the reply. So you mean it is not possible to use Lucene to index > the fetched (Verity Spider Content)

changing index format

2008-06-24 Thread John Wang
Hi: I am trying to add couple more values to the TermInfo file and want to keep the index backward compatible. But I see values such as docFreq etc. are stored as a VInt, so I couldn't do things like using the signed bit to determine whether to read/write the extra values. Any suggestions? (

Re: IndexReader.reopen memory leak

2008-05-29 Thread John Wang
How big is your index? Thanks -John On Thu, May 29, 2008 at 10:29 AM, Michael Busch <[EMAIL PROTECTED]> wrote: > Does your FilteredIndexReader.reopen() return a new instance of > FilteredIndexReader in case the inner reader was updated (i. e. > in!=newInner)? > > > -

Re: IndexReader.reopen memory leak

2008-05-29 Thread John Wang
fig); } fixes my leak. -John On Thu, May 29, 2008 at 12:35 AM, Michael Busch <[EMAIL PROTECTED]> wrote: > Could you share some details about how you implemented reopen() in your > reader? > > -Michael > > > John Wang wrote: > >> Yes, I do close the old reader. >

Re: IndexReader.reopen memory leak

2008-05-28 Thread John Wang
with the reference >> counting. Are you doing anything special? E. g. do you have own reader >> implementations that you call reopen() on? What kinds of readers are you >> using? >> >> Are you maybe able to provide a heapdump? >> >> -Michael

IndexReader.reopen memory leak

2008-05-27 Thread John Wang
Hi: We are experiencing memory leak with calling IndexReader.reopen(). From eyeballing the lucene source code, I am seeing normCache is not cleared. Anyone else experiencing this? Thanks -John

Re: distributed lucene progress

2008-05-21 Thread John Wang
I see. So is it then the bailey project? -John On Tue, May 20, 2008 at 9:04 PM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Oh, it very much did. Check Hadoop Wiki's "Recent Changes", it's there. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original

distributed lucene progress

2008-05-14 Thread John Wang
Hi: What is the current status on the distributed lucene project proposed at: http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html Thanks -John

Re: confused about an entry in the FAQ

2008-05-10 Thread John Wang
If your indexed field is not used to further filtering out the doc nor further scoring, you should use some sort of priority queueing mechanism to gather the top N documents. You can then call reader.document() on those docs if necc. -John On Sat, May 10, 2008 at 6:35 AM, Stephane Nicoll <[EMAIL

Re: Does Lucene Supports Billions of data

2008-05-01 Thread John Wang
TED]> wrote: > On Thursday 01 May 2008 00:01:48 John Wang wrote: > > I am not sure how well lucene would perform with > 2 Billion docs in a > > single index anyway. > > Even if they're in multiple indexes, the doc IDs being ints will still > prevent > it going pa

Re: Does Lucene Supports Billions of data

2008-04-30 Thread John Wang
; > That said, Lucene needs to support >2B, so docids (and all associated > internals) need to become 'long' fairly soon > > -Glen > > 2008/4/30 John Wang <[EMAIL PROTECTED]>: > > lucene docids are represented in a java int, so max signed int would be > the

Re: Does Lucene Supports Billions of data

2008-04-30 Thread John Wang
lucene docids are represented in a java int, so max signed int would be the limit, a little over 2 billion. -John On Wed, Apr 30, 2008 at 11:54 AM, Sebastin <[EMAIL PROTECTED]> wrote: > > Hi All, > Does Lucene supports Billions of data in a single index store of size 14 > GB > for every search.I

Re: Why Lucene has to rewrite queries prior to actual searching?

2008-04-07 Thread John Wang
Other use is for custom Query objects to reboost or expand the user query from information gathered from the indexreader at search time. -John On Mon, Apr 7, 2008 at 2:56 PM, Paul Elschot <[EMAIL PROTECTED]> wrote: > Itamar, > > Query rewrite replaces wildcards with terms available from > the ind

Re: Problems about using Lucene to generate tag cloud..

2008-04-04 Thread John Wang
check out http://www.browseengine.com tag cloud impl on lucene is avail. -John On Wed, Apr 2, 2008 at 4:12 PM, Daniel Noll <[EMAIL PROTECTED]> wrote: > On Thursday 03 April 2008 08:08:09 Dominique Béjean wrote: > > Hum, it looks like it is not true. > > Use a do-while loop make the first terms.t

Re: payload performance wrt fieldcache

2008-04-03 Thread John Wang
Apparently tp.nextPosition() is needed :( Any ideas? -John On Thu, Apr 3, 2008 at 8:20 AM, John Wang <[EMAIL PROTECTED]> wrote: > I am loading both from disk. > But I found the culprit: > > My code: > > while (tp.next()) > > { > >

Re: payload performance wrt fieldcache

2008-04-03 Thread John Wang
Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) > got 2.6 Million Euro funding! > > > On Thu, Apr 3, 2008 at 7:36 AM, John Wang <[EMAIL PROTECTED]> wrote: > > Sorry, gmail was screwy and accidentally sent the msg. > &g

Re: payload performance wrt fieldcache

2008-04-03 Thread John Wang
d cache load, and it took much longer than when it had 1000. I did some profiling and the profiler is pointing to TermPositions.next and TermPositions.nextPosition and TermPositions.getPayload as the culprit. I would think payload would always be faster. Any ideas? Thanks -John On Thu, Apr 3, 2008 a

payload performance wrt fieldcache

2008-04-03 Thread John Wang
Hi:

Re: is it possible to change the way score from different field combine to give final lucene score

2008-03-26 Thread John Wang
HI Grant: I don't see FunctionQuery in the javadocs. Can you post a link? Thanks -john On Mon, Mar 24, 2008 at 3:32 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > See the FunctionQuery and the org.apache.lucene.search.function > package. You can also implement your own query, as it's n

Re: random accessing term value

2008-03-25 Thread John Wang
Tue, Mar 25, 2008 at 11:16 AM, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On Mar 25, 2008, at 1:32 PM, John Wang wrote: > >Is there a way to random accessing term value in a field? e.g. > > > >in my field, content, the terms are: lucene, is, cool >

random accessing term value

2008-03-25 Thread John Wang
Hi: Is there a way to random accessing term value in a field? e.g. in my field, content, the terms are: lucene, is, cool Is there a way to access content[2] -> cool? Thanks -John

Re: Biggest index

2008-03-14 Thread John Wang
We are running on one box in prod with 20 million docs in one index. -John On Fri, Mar 14, 2008 at 8:01 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > How big is your machine and how big are your docs? (unique terms, > etc.) Even if it would fit, it sounds like you are going to have to > go d

Re: indexing api wrt Analyzer

2008-03-13 Thread John Wang
ene.document.Document,%20org.apache.lucene.analysis.Analyzer%29> > . > > > > On Mar 13, 2008, at 4:12 PM, John Wang wrote: > > > Hi Grant: > > > >For our corpus, we don't rely on idf in scoring calculation that > > much, > > so I don't see that being

Re: indexing api wrt Analyzer

2008-03-13 Thread John Wang
8 at 11:37 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Mar 13, 2008, at 11:03 AM, John Wang wrote: > > > Yes, but usually it's a good idea to add documents in batch and not > > having > > to reinstantiate the writer for every document and then closing

Re: indexing api wrt Analyzer

2008-03-13 Thread John Wang
, > thus your application can identify the language, choose the analyzer > for the given doc, and then add the document > > See > public void addDocument(Document doc, Analyzer analyzer) > > > On Mar 12, 2008, at 8:40 PM, John Wang wrote: > > > Hi all: > > > &

indexing api wrt Analyzer

2008-03-12 Thread John Wang
Hi all: Maybe this has been asked before: I am building an index consists of multiple languages, (stored as a field), and I have different analyzers depending on the language of the language to be indexed. But the IndexWriter takes only an Analyzer. I was hoping to have IndexWriter t

Re: changing scoring formula

2008-03-08 Thread John Wang
you can always modify the raw lucene score in the hitCollector. -John On Wed, Mar 5, 2008 at 1:16 PM, sumittyagi <[EMAIL PROTECTED]> wrote: > > is there any way to change the score of the documents. > Actually i want to modify the scores of the documents dynamically, > everytime > for a given que

  1   2   >