Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-12 Thread Tom Burton-West
han > 3.x's terms index... if you run CheckIndex with -verbose it will print > additional details about the block structure of your terms indices... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West > wrot

Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-12 Thread Tom Burton-West
Thanks Mike, > OK. It would be good to know where all your RAM is being consumed, > and how much of that is really the terms index: it ought to be a very > small part of it. > > I made a bunch of heap dumps. I just watched with jconsole and ran jmap -histo when memory use got high. I've appende

Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-10 Thread Tom Burton-West
tional details about the block structure of your terms indices... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West > wrote: > > Hello all, > > > > We have over 3 billion unique terms in our indexes

Details on setting block parameters for Lucene41PostingsFormat

2015-01-09 Thread Tom Burton-West
mat.html#Lucene41PostingsFormat%28int,%20int%29> " Is there documentation or discussion somewhere about how to determine appropriate parameters or some detail about what setting the maxBlockSize and minBlockSize does? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

index writer closes due to OOM/heap space issue but no recovery after GC

2015-01-09 Thread Tom Burton-West
ce(see attached) but I continue getting this error. Can someone please explain why after the GC frees memory, I continue to get the error? p.s. My documents average about 800KB and at completion each shard has over 3 billion unique terms.

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-09 Thread Tom Burton-West
docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET] No problems were detected with this index. On Thu, Aug 8, 2013 at 11:24 AM, Robert Muir wrote: > On Thu, Aug 8, 2013 at 11:18 AM, Tom Burton-West > wrote: > > Sure I should be able to build a lucene core and give

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-08 Thread Tom Burton-West
le for you to build a lucene-core.jar from > branch_4x and run checkindex with that jar file to confirm it really > addresses the issue: if this is possible in any way it would be > fantastic. > > There is nothing wrong with your index: its just a code thing :) > > On Thu, Aug 8, 2

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-08 Thread Tom Burton-West
Hi Robert, I've been running CheckIndex for over a week and it is still working through seekCeil() (See below.) I'm going to kill the CheckIndex. Admittedly, this index is an unusual one, but at one point we were considering using MLT in our regular index which would result in a large termvecto

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-02 Thread Tom Burton-West
Thanks Robert, Looks like it switches between seekCeil and seekExact: "main" prio=10 tid=0x0e79a000 nid=0x5fe5 runnable [0x2b32de0cc000] jstack.out3- java.lang.Thread.State: RUNNABLE jstack.out3-at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.see

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-01 Thread Tom Burton-West
> > On Tue, Jul 30, 2013 at 1:06 PM, Tom Burton-West > wrote: > > Thanks Mike, Robert and Adrien, > > > > Unfortunately, I killed the processes, so its too late to get a stack > > trace. On thing that was suspicious was that top was reporting memory > use

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Tom Burton-West
ccandless.com> wrote: > You should also upgrade your Java! > > 1.6.0_16 is really ancient and has exciting bugs ... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Jul 30, 2013 at 1:06 PM, Tom Burton-West > wrote: > > Thanks Mike,

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Tom Burton-West
Thanks Mike, Robert and Adrien, Unfortunately, I killed the processes, so its too late to get a stack trace. On thing that was suspicious was that top was reporting memory use as 20GB res even though I invoked the JVM with java -Xmx10g -Xms10g. I'm going to double the memory, turn on GC logging,

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Tom Burton-West
lues..." after that. > > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, Jul 29, 2013 at 4:30 PM, Tom Burton-West > wrote: > > We have very large indexes, almost a terabyte for a single index, and > > normally it takes overnight to run a che

Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-29 Thread Tom Burton-West
We have very large indexes, almost a terabyte for a single index, and normally it takes overnight to run a checkindex. I started a CheckIndex on Friday and today (Monday) it seems to be stuck testing vectors although we haven't got vectors turned on. (See below) The output file was last written J

Re: TestGrouping.Java seems to combine multiple tests into one huge test

2013-06-20 Thread Tom Burton-West
om > > > On Tue, Jun 18, 2013 at 12:48 PM, Tom Burton-West > wrote: > > Hello, > > > > I'm trying to understand BlockGroupingCollector. I thought I would > start > > by running the tests in the debugger. However the only test I can find > is > >

build of trunk hangs

2013-06-20 Thread Tom Burton-West
I'm trying to build trunk and when I run "ant compile" the build hangs right after "Building replicator" at the line "common.resolve:". (see below for more context) I'm not familiar with Ivy so I'm not too sure where to look for the problem. Can someone point me to the FAQ or the appropriate reso

Re: TestGrouping.Java seems to combine multiple tests into one huge test

2013-06-18 Thread Tom Burton-West
to make it more understandable! > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Jun 18, 2013 at 12:48 PM, Tom Burton-West > wrote: > > Hello, > > > > I'm trying to understand BlockGroupingCollector. I thought I would > start >

TestGrouping.Java seems to combine multiple tests into one huge test

2013-06-18 Thread Tom Burton-West
Hello, I'm trying to understand BlockGroupingCollector. I thought I would start by running the tests in the debugger. However the only test I can find is lucene/grouping/src/test/org/apache/lucene/search/grouping/TestGrouping.java In TestGrouping.java, in the second test, "testRandom" it see

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Tom Burton-West
Please add tburtonw to contributors Tom Burton-West tburtonw at umich dot edu Tom On Mon, Mar 25, 2013 at 9:05 AM, Steve Rowe wrote: > > On Mar 25, 2013, at 8:49 AM, Rafał Kuć wrote: > > Could you add RafalKuc to contributors ? Thanks :) > > Added to ContributorsGroup. >

Re: 答复: About the Sorting of Groups during Grouping by

2013-03-04 Thread Tom Burton-West
Hello Oliver, We are very interested in group sorting based on some aggregation function also. Would you consider contributing your code to Lucene, or posting your results? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library

Re: CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-13 Thread Tom Burton-West
n segments (containing 865870 documents) detected WARNING: would write new segments file, and 865870 documents would be lost, if -fix were specified On Wed, Dec 5, 2012 at 5:29 PM, Robert Muir wrote: > On Wed, Dec 5, 2012 at 2:27 PM, Tom Burton-West > wrote: > > > Thanks Robert, > &g

Re: CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-05 Thread Tom Burton-West
s.sun.com/bugdatabase/view_bug.do?bug_id=5091921 > > We tried to add workarounds to lucene to dodge problems from this, but > really a newer unaffected version would be safer. > > On Wed, Dec 5, 2012 at 1:47 PM, Robert Muir wrote: > > > > > On Wed, Dec 5, 2012 at 1:30

CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-05 Thread Tom Burton-West
Hello, I'm trying to merge 12 indexed into one big index using the Lucene IndexMergeTool (command line used appended below). The merge seemed to finish successfully, but when I ran CheckIndex on the merged index, I got an array out of bounds error "java.lang.ArrayIndexOutOfBoundsException: 13315

Re: Which stemmer?

2012-11-16 Thread Tom Burton-West
Hi Mike, >>Honestly I've never heard of anyone using "dogs" to mean feet either, but hey nobody's perfect. This is really off topic but I couldn't resist. This usage of "dogs" to mean feet occurs in old blues lyrics such as Blind Lemon Jefferson's "Hot Dogs" http://www.youtube.com/watch?v=v670qV

Re: Superset Similarity?

2012-11-16 Thread Tom Burton-West
Hi Otis, I hope this is not off-topic, Apparently in Lucene similarity does not have to be set at index time: See http://lucene.apache.org/core/4_0_0/changes/Changes.html under Lucene 2959 "All models default to the same index-time norm encoding as DefaultSimilarity, so you can easily try these

Re: Which stemmer?

2012-11-15 Thread Tom Burton-West
I agree with Erick that you probably need to give your client a list of concrete examples, and perhaps to explain the trade-offs. All stemmers both overstem and understem. Understemming means that some forms of a word won’t get searched. For example, without stemming, searching for “dogs” would

Re: Can you use reduced sized test indexes to predict performance gains for a larger index?

2010-02-15 Thread Tom Burton-West
the other hand, once we started building our test indexes so they were significantly larger than the amount of memory available for OS disk caching, we could see results that extrapolated out to the large index. Tom Burton-West www.hathitrust.org ryguasu wrote: > > I'd like