Sort merge strategy ?

2016-11-16 Thread Kevin Burton
What's the current status of the sort merge strategy? I want to sort an index by a given field and keep it in that order on disk. It seems to have evolved over the years and I can't easily figure out the current status via the Javadoc in 6.x -- We’re hiring if you know of any awesome Java Devo

Re: Possible to cause documents to be contiguous after forceMerge?

2016-11-15 Thread Kevin Burton
On Tue, Nov 15, 2016 at 6:16 PM, Erick Erickson wrote: > You can make no assumptions about locality in terms of where separate > documents land on disk. I suppose if you have the whole corpus at index > time you > could index these "similar" documents contiguously. T > Wow.. that's shockingly fr

Possible to cause documents to be contiguous after forceMerge?

2016-11-15 Thread Kevin Burton
I have a large index (say 500GB) that with a large percentage of near duplicate documents. I have to keep the documents there (can't delete them) as the metadata is important. Is it possible to get the documents to be contiguous somehow? Once they are contiguous then they will compress very well

Re: Lucene and Xanga.com

2005-08-25 Thread Kevin Burton
On 8/24/05, Monsur Hossain <[EMAIL PROTECTED]> wrote: > > Otis, we've been continually impressed with the performance of Lucene. > We've been ever increasing the load we are putting on it (from our small > help section, to our slightly larger metros, to our big groups, and then > our gigantic webl

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> * A Nutch implementation: > http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/ > > * A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763 A step in the right direction. It doesn't have other language categories created though. > * JTextCat (http

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> Erhm... Not to rain on your parade, but Googling for "ngram java" gives > a lot of hits. http://sourceforge.net/projects/ngramj and also > "languageidentifier" in Nutch are two examples of Open Source Java > implementations. Each can be used with Lucene. I think I've played with ngramj and found

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf > > > > Linguini: Language Identification for Multilingual Documents > > John M. Prager > > Prager also uses an n-gram approach, so you might

NGram Language Categorization Source

2005-08-20 Thread Kevin Burton
Hey lucene guys. I know for a fact that a bunch of you have been curious about language categorization for a long time now and Java has lacked a solid way to solve this problem. Anyway. This new library that I just released should be easy to tie into your lucene indexers. Just use the library o

Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Kevin Burton
Peter A. Friend wrote: I changed that value to 8k, and based on the truss output from an index run, it is working. Haven't gotten much beyond that to see if it causes problems elsewhere. The value also needs to be altered on the read end of things. Ideally, this will be made settable via

Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Kevin Burton
Chris Collins wrote: Well I am currently looking at merging too. In my application merging will occur against a filer (read as higher latency device). I am currently working on how to stage indices on local disk before moving to a filer. Assume I must move to a filer eventually for whatever c

Re: Optimizing indexes with mulitiple processors?

2005-06-09 Thread Kevin Burton
Chris Collins wrote: To follow up. I was surprised to find that from the experiment of indexing 4k documents to local disk (Dell PE with onboard RAID with 256MB cache). I got the following data from my profile: 70 % time was spent in inverting the document 30 % in merge Oh.. yeah.. thats i

Re: Optimizing indexes with mulitiple processors?

2005-06-09 Thread Kevin Burton
Bill Au wrote: Optimize is disk I/O bound. So I am not sure what multiple CPUs will buy you. Now on my system with large indexes... I often have the CPU at 100%... Kevin -- Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. See irc.freenode.net #rojo if you want to chat. Rojo

Optimizing indexes with mulitiple processors?

2005-06-09 Thread Kevin Burton
Is it possible to get Lucene to do an index optimize on multiple processors? Its a single threaded algorithm currently right? Its a shame since I have a quad machine but I'm only using 1/4th of the capacity. Thats a heck of a performance hit. Kevin -- Use Rojo (RSS/Atom aggregator)! - v

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-09 Thread Kevin Burton
Andrew Boyd wrote: Kevin, Those results are awsome. Could you please give those of us that were following but not quite understanding everything some pseudo code or some more explaination? Ug.. I hate to say this bug ignore these numbers. Turns out that I was hitting a cache ... I thou

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Kevin Burton
Paul Elschot wrote: For a large number of indexes, it may be necessary to do this over multiple indexes by first getting the doc numbers for all indexes, then sorting these per index, then retrieving them from all indexes, and repeating the whole thing using terms determined from the retrieved d

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Kevin Burton
Chris Hostetter wrote: : was computing the score. This was a big performance gain. About 2x and : since its the slowest part of our app it was a nice one. :) : : We were using a TermQuery though. I believe that one search on one BooleanQuery containing 20 TermQueries should be faster then 20

use of LinkedList in ConjunctionScorer hurting performance?

2005-06-07 Thread Kevin Burton
This is a strange anomaly I wanted to point out: http://www.flickr.com/photos/burtonator/18030919/ This is a jprofiler screenshot. I can give you a jprofiler "snapshot" if you want but it requires the clientside app. I'm not sure why this should be hot... in a linked list this should be fas

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-06 Thread Kevin Burton
Matt Quail wrote: We have a system where I'll be given 10 or 20 unique keys. I assume you mean you have one unique-key field, and you are given 10-20 values to find for this one field? Internally I'm creating a new Term and then calling IndexReader.termDocs() on this term. Then if te

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-06 Thread Kevin Burton
Chris Hostetter wrote: I haven't profiled either of thse suggestions but: 1) have you tried constructing a BooleanQuery of all 10-20 terms? Is the total time to execute the search, and access each Hit slower then your termDocs approach? Actually using any type of query was very slow. Th

Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-06 Thread Kevin Burton
Hey. I'm trying to figure out the FASTEST way to solve this problem. We have a system where I'll be given 10 or 20 unique keys. Which are stored as non-tokenized fields within Lucene. Each key represents a unique document. Internally I'm creating a new Term and then calling IndexReader.te

Performance tuning and org.apache.lucene.store.InputStream.BUFFER_SIZE

2005-06-01 Thread Kevin Burton
I was doing a JProfiler install of our webapp/lucene last week and of course a large part of our app is spent in RandomAccessFile.readBytes ... This is called by InputStream.readByte which internally uses a BUFFER_SIZE of 1024 (which is the default). This value seems too small for a default

Re: Ability to load a document with ONLY a few fields for performance?

2005-06-01 Thread Kevin Burton
Andrew Boyd wrote: The numbers look impressive. If I build from the 1.9 trunck will I get the patch? Funny... I went ahead and imoplemented this myself and it didn't work. Of course I may have implemented it incorrectly. I'll look at the patch source and try it out! Something fun to

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
Andrew Boyd wrote: How about using range query? private Term begin, end; begin = new Term("dateField", DateTools.dateToString(Date.valueOf(<"backInTimeStringDate">))); end = new Term("dateField", DateTools.dateToString(Date.valueOf(<"farFutureStringDate">))); Ha.. crap. That won't wor

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
Andrew Boyd wrote: How about using range query? private Term begin, end; begin = new Term("dateField", DateTools.dateToString(Date.valueOf(<"backInTimeStringDate">))); end = new Term("dateField", DateTools.dateToString(Date.valueOf(<"farFutureStringDate">))); RangeQuery query = new RangeQ

Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
I have an index with a date field. I want to quickly find the minimum and maximum values in the index. Is there a quick way to do this? I looked at using TermInfos and finding the first one but how to I find the last? I also tried the new sort API and the performance was horrible :-/ Any i

Possible to find min and max values for a Date field?

2005-05-30 Thread Kevin Burton
Is it possible to find the minimum and maximum values for a date field with a given reader? I guess I could use TermEnum to do a binary search until I get a hit but this seems a bit kludgy. Thoughts? I don't see any APIs for doing this and a google/grep of the source doesn't help Kevin -

Ability to load a document with ONLY a few fields for performance?

2005-05-28 Thread Kevin Burton
I have a Document with about 15 fields. I only need two of them. How much faster would lucene be if I only fetched the two fields? Each field is a separate file and this would almost certainly slow down just the basic IO. I think I looked at this a long time ago and there was no high level