Re: Running query against a single document

2018-09-21 Thread Tom Mortimer
Hi, Have you considered using MemoryIndex <https://lucene.apache.org/core/6_5_1/memory/org/apache/lucene/index/memory/MemoryIndex.html> ? cheers, Tom tel +44 8700 118334 : mobile +44 7876 741014 : skype tommortimer On Fri, 21 Sep 2018 at 13:58, Aurélien MAZOYER wrote: > Hi, >

Re: Matching a single instance of a multivalued field

2018-06-08 Thread Tom Mortimer
Ah, that's an interesting idea - thanks Adrien! On Fri, Jun 8, 2018 at 3:54 PM Adrien Grand wrote: > Hi Tom, > > One way to solve this could be to use block joins by indexing each value in > its own document and joining the parent document using > ToParentBlockJoinQuery. >

Matching a single instance of a multivalued field

2018-06-08 Thread Tom Mortimer
quite long (I don't know if there's a limit). Is there a neater way? cheers, Tom

Re: EOF exception from ramDirectory search in spark

2018-05-11 Thread Tom Hirschfeld
org.apache.lucene.util.packed.DirectReader$DirectPackedReader48.get(DirectReader.java:305) ... 35 more On Fri, May 11, 2018 at 1:15 AM, Adrien Grand wrote: > Can you share the full stack trace? > > Le ven. 11 mai 2018 à 04:19, Tom Hirschfeld a > écrit : > > > Hey All, > > I h

EOF exception from ramDirectory search in spark

2018-05-10 Thread Tom Hirschfeld
to address this issue but I have been unable to find out whats going on. Any hint as to what might be happening here? Best, Tom Hirschfeld

Lucene, Spark, HDFS question

2018-03-13 Thread Tom Hirschfeld
s this compatible? Are we able to store our index in HDFS and read from a spark job? Best, Tom Hirschfeld

NumericDocValues vs SortedNumericDocValues

2018-02-05 Thread Tom Hirschfeld
sort about 200 results. My specific questions are, for our use case, how do these two fields differ in: 1) total index size 2) query time performance/impact on sorting 3) any other "gotchas" I may not have thought of yet Thanks for your time & assistance! Best, Tom Hirschfeld

Spatial Indexing of Polygons

2017-08-14 Thread Tom Hirschfeld
ed if it exists. Is there a recommended way to support indexing and searching of polygons (building footprint sized polygons, not huge ones)? If so what is the currently recommended API to use? We are currently thinking about using the s2cell library from google. Best, Tom Hirschfeld

Optimizing number of segments in lucene index (no writes/deletes, only reads)

2017-06-13 Thread Tom Hirschfeld
segment per cpu in prod? 1 segment per core in prod? Something else? Best, Tom Hirschfeld

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-13 Thread Tom Hirschfeld
Once again, thanks for your help. Best, Tom Hirschfeld On Thu, May 18, 2017 at 4:22 AM, Uwe Schindler wrote: > Hi, > Are you sure that the term index is the problem? Even with huge indexes > you never need 65 good of heap! That's impossible. > Are you sure that your problem is not

Term Dictionary taking up lots of heap memory, looking for solutions, lucene 5.3.1

2017-05-17 Thread Tom Hirschfeld
is issue? If so, how do I got about loading an alternative codec and configuring it to my needs? I'm having trouble finding docs/examples of how this is used in the real world so even if you point me to a repo or docs somewhere I'd appreciate it. Thanks! Best, Tom Hirschfeld

Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-17 Thread Tom Hirschfeld
? If so, how do I got about loading an alternative codec and configuring it to my needs? I'm having trouble finding docs/examples of how this is used in the real world so even if you point me to a repo or docs somewhere I'd appreciate it. Thanks! Best, Tom Hirschfeld

Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-12 Thread Tom Burton-West
Thanks Mike, Do you know how I can configure Solr to use the min=200 and max=398 block sizes you suggested? Or should I ask on the Solr list? Tom On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > The first int to Lucene41PostingsFormat is the

Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-12 Thread Tom Burton-West
got high. I've appended a bit more from the error trace and the top memory users from one of the heap dumps below.. I tried to send a bunch of heap dumps to the mailing list but the message got rejected. I'll send them directly to you. Tom java.lang.OutOfMemoryError: J

Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-10 Thread Tom Burton-West
hat is the trade-off when increasing the block size? Tom On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > The first int to Lucene41PostingsFormat is the min block size (default > 25) and the second is the max (default 48) for the block tree ter

Details on setting block parameters for Lucene41PostingsFormat

2015-01-09 Thread Tom Burton-West
mat.html#Lucene41PostingsFormat%28int,%20int%29> " Is there documentation or discussion somewhere about how to determine appropriate parameters or some detail about what setting the maxBlockSize and minBlockSize does? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

index writer closes due to OOM/heap space issue but no recovery after GC

2015-01-09 Thread Tom Burton-West
ce(see attached) but I continue getting this error. Can someone please explain why after the GC frees memory, I continue to get the error? p.s. My documents average about 800KB and at completion each shard has over 3 billion unique terms.

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-09 Thread Tom Burton-West
Hi Robert, Thanks for the fix. Checkindex finished within 24 hours, which is not terrible, given the size of this index (about a terabyte).. Tom Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index Segments file=segments_e numSegments=2 version=4.2.1 format= userData={commitTimeMSec

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-08 Thread Tom Burton-West
check out revision 1511014 from branch 4x and build it? Tom On Thu, Aug 8, 2013 at 10:51 AM, Robert Muir wrote: > Hi Tom, I committed a fix for the root cause > (https://issues.apache.org/jira/browse/LUCENE-5156). > > Thanks for reporting this! > > I dont know if its feasib

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-08 Thread Tom Burton-West
in a large termvectors file, although only about 800,000 docs per index. Should we expect to see something similar or with the two orders of magnitude decrease in the number of docs, might CheckIndex work a bit faster? Tom --- Started CheckIndex on Tuesday July 30 and it

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-02 Thread Tom Burton-West
tors(CheckIndex.java:1503) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854) I don't think highlighting is too slow (at least for our small indexes), but will take a look at the postingshighl

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-01 Thread Tom Burton-West
taling a few hundred K. Tom The top 10 processes in pmap are: total804,745,732K 2baaf526c000 300,897,888K r--s- /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch.tvd 2b3b4bf1b000 155,250,472K r--s- /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch_Lucene41_0.doc 2b88aa

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Tom Burton-West
Thanks Mike, Got my sysadmins to upgrade our test machine to "1.7.0_09" Will ask them to upgrade production which is currently 1.6.0_45-b06 on the indexing machines and 1.6.0_16-b01 on the serving machines. Tom On Tue, Jul 30, 2013 at 1:47 PM, Michael McCandless < luc...@mikem

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Tom Burton-West
gging, and remember to echo STDERR to a log and run it again on one of the indexes. I'll report back as soon as something interesting shows up. (Probably tomorrow sometime.) Tom On Tue, Jul 30, 2013 at 11:22 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Can you ge

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Tom Burton-West
I didn't start it up with any GC logging or hooks to attach jconsole. I'm going to kill it and maybe try again and give it more memory and maybe turn on GC logging. Tom On Tue, Jul 30, 2013 at 8:41 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > I think that'

Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-29 Thread Tom Burton-West
itten Jul 27 02:28, Note that in this 750 GB segment we have about 83 million docs with about 2.4 billion unique terms and about 110 trillion tokens. Have we hit a new CheckIndex limit? Tom --- Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index Segments file=segme

Re: TestGrouping.Java seems to combine multiple tests into one huge test

2013-06-20 Thread Tom Burton-West
some feature related to Doc Values and the Collectors? Tom On Tue, Jun 18, 2013 at 1:14 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > +1 to somehow refactor this scary test to make it more understandable! > > Mike McCandless > > http://blog.mikemccandless.c

build of trunk hangs

2013-06-20 Thread Tom Burton-West
omeone point me to the FAQ or the appropriate resource to figure out what is going on? Tom - resolve: [echo] Building replicator... ivy-availability-check: [echo] Building replicator... ivy-fail: ivy-configure: [ivy:configure] :: lo

Re: TestGrouping.Java seems to combine multiple tests into one huge test

2013-06-18 Thread Tom Burton-West
a while, since I need to walk through the code a lot more before I understand it enough to feel confident that I can pull out an appropriate test. Tom On Tue, Jun 18, 2013 at 1:14 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > +1 to somehow refactor this scary test

TestGrouping.Java seems to combine multiple tests into one huge test

2013-06-18 Thread Tom Burton-West
llector could have a separate test? Tom

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Tom Burton-West
Please add tburtonw to contributors Tom Burton-West tburtonw at umich dot edu Tom On Mon, Mar 25, 2013 at 9:05 AM, Steve Rowe wrote: > > On Mar 25, 2013, at 8:49 AM, Rafał Kuć wrote: > > Could you add RafalKuc to contributors ? Thanks :) > > Added to ContributorsGroup. >

Re: 答复: About the Sorting of Groups during Grouping by

2013-03-04 Thread Tom Burton-West
Hello Oliver, We are very interested in group sorting based on some aggregation function also. Would you consider contributing your code to Lucene, or posting your results? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library

Re: Which token filter can combine 2 terms into 1?

2012-12-26 Thread Tom
erms. For example, let's say one snipped in your SnippetFilter is: "word2 word3" you will get Term 0: field=body text=word1 Term 1: field=body text=word2 word3 In this case, word2 and word3 will NOT be split. > > -- Jack Krupansky > > -Original Mess

Re: Which token filter can combine 2 terms into 1?

2012-12-21 Thread Tom
On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky wrote: > And to be more specific, most query parsers will have already separated > the terms and will call the analyzer with only one term at a time, so no > term recombination is possible for those parsed terms, at query time. > Most analyzers will

Re: CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-13 Thread Tom Burton-West
problem occurs, but these are 45GB indexes so merging all 12 takes about a day and running CheckIndex feels like it takes a day, although its probably only a few hours. Any hints on an easier way to troubleshoot or ideas about what might be causing the problem? Tom java version "1.7.0_09&quo

Re: CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-05 Thread Tom Burton-West
Thanks Robert, I've asked our sysadmins to install a more recent Java version for testing. I'll report back if it fails with the newer Java version. Tom On Wed, Dec 5, 2012 at 4:53 PM, Robert Muir wrote: > I'm particularly thinking its something like > http://bug

CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-05 Thread Tom Burton-West
ndle to 274 billion. Appended below is the message from CheckIndex, the command line used to merge the indexes, and the term count line from CheckIndex run on each of the 12 indexes that were later merged. Tom CheckIndex error: Opening index @ bigramsRetest Segments file=segments_1 numSegments=

Re: Which stemmer?

2012-11-16 Thread Tom Burton-West
27;s "Hot Dogs" http://www.youtube.com/watch?v=v670qVwzm9c (Hard to make out what he's singing on the old 78, but he's says his "dogs" is red hot, meaning he can run really fast.) http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/ Tom

Re: Superset Similarity?

2012-11-16 Thread Tom Burton-West
these out/switch back and forth/run experiments and comparisons without re-indexing." Does Solr expose this ability to change similarities without re-indexing? i.e could you just change your schema? Tom http://www.hathitrust.org/blogs/large-scale-search On Thu, Nov 15, 2012 at 11:3

Re: Which stemmer?

2012-11-15 Thread Tom Burton-West
Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191-203, 1993). *http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf> " Tom http://www

Re: international stop set?

2012-10-27 Thread Tom
On Fri, Oct 26, 2012 at 8:34 PM, Trejkaz wrote: > On Sat, Oct 27, 2012 at 1:53 PM, Tom wrote: > > Hello, > > > > using Lucene 4.0.0b, I am trying to get a superset of all stop words (for > > an international app). > > I have looked around, and not found anythi

RE: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Burton-West, Tom
logs/large-scale-search/slow-queries-and-common-words-part-2 You can also look at the standard stop word sets at http://snowball.tartarus.org/ (look under the entries for each stemmer) or http://search.cpan.org/~creamyg/Lingua-StopWords-0.09/ or http://members.unine.ch/jacques.savoy/clef/index.html

RE: Does change to ICU in Lucene/Solr 3.3 require re-indexing?

2011-07-14 Thread Burton-West, Tom
? Sounds like we will need to re-index the whole 9 million books with the Solr/Lucene 3.3 (4.8 jar) to be on the safe side. Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, July 14, 2011 2:29 PM To: java-user@lucene.apache.org Subject: Re: Does change to ICU in

Does change to ICU in Lucene/Solr 3.3 require re-indexing?

2011-07-14 Thread Burton-West, Tom
nizing or folding Do the changes to the ICU filters/tokenizers in Solr/Lucene 3.3 change how tokenizing and the folding filter work in terms of queries run through the 3.3 filters possibly not matching documents indexed with the 3.1dev filters? Tom Burton-West

RE: Non-English Languages Search

2011-05-13 Thread Burton-West, Tom
Hi Ivan and Robert, >> sounds like you should talk to Tom Burton-West! Ok, I'll bite. A few questions: Are you planning to have separate fields for each language or the same fields with contents in different languages? If #2 are you planning to have a field to indicate the language

RE: Sharding Techniques

2011-05-12 Thread Burton-West, Tom
Have you considered running cache warming queries of your most frequent terms/phrases so that the data is in the OS disk cache? Tom >> When queries (without two fields mentioned above) have a lot of >>words/phrases search time is high. E.g I took a query with around 80 unique >>t

Re: SpanNearQuery - inOrder parameter

2011-05-10 Thread Tom Hill
that I'm not familiar with the span query code, so this is just a quick deduction. Not sure how easy it would be to add this duplicate term detection, if that's the problem. Tom On Tue, May 10, 2011 at 5:58 AM, Gregory Tarr wrote: > Anyone able to help me with the problem below? &

RE: Sharding Techniques

2011-05-10 Thread Burton-West, Tom
ld help depends on just what your bottleneck is. It's not clear that your index is large enough that the size of the index is causing your bottleneck. We run indexes of about 350GB with average response times under 200ms and 99th percentile reponse times of under 2 s

RE: Link to nightly build test reports on main Lucene site needs updating

2011-05-02 Thread Burton-West, Tom
Thanks for fixing++ Tom -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Sunday, May 01, 2011 6:05 AM To: d...@lucene.apache.org; simon.willna...@gmail.com; java-user@lucene.apache.org Subject: RE: Link to nightly build test reports on main Lucene site needs

Link to nightly build test reports on main Lucene site needs updating

2011-04-29 Thread Burton-West, Tom
apache.org? Tom

RE: TermDoc to TermDocsEnum

2011-03-23 Thread Burton-West, Tom
org/viewvc/lucene/dev/trunk/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=markup 3.x version here: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=markup Tom http://www.hathitrust.org/b

termIndexInterval, CheckIndex, size of tis file and Lucene index compression

2011-03-21 Thread Burton-West, Tom
f the term takes a VInt that only occupies 1 byte, we have 6 bytes for that data, which leaves only 3 bytes for the String that holds the Suffix. What am I missing here? Tom Burton-West ---

Understanding the IndexWriter-Infostream log

2011-03-17 Thread Burton-West, Tom
wing? "DW: ramUsed=33.467 MB newFlushedSize=6764406 docs/MB=0.155 new/old=19.276%" What does "docs/MB" and 'new/old" mean? Tom

RE: Bigrams for CJK with ICUTokenizer ?

2011-02-04 Thread Burton-West, Tom
Thanks Robert, I opened up LUCENE 2906. But I just realized in the effort to keep the description short, I forgot to include your option of producing both unigrams and bigrams, which is a nice option. Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday

RE: Bigrams for CJK with ICUTokenizer ?

2011-02-04 Thread Burton-West, Tom
izer could be modified to output bigrams or that a filter could be designed that would take the output of the ICUTokenizer and create shingles on tokens with the attribute for Han? Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, February 04, 20

Bigrams for CJK with ICUTokenizer ?

2011-02-04 Thread Burton-West, Tom
ter in the filter chain after the ICUTokenizer that would produce overlapping bigrams for CJK? Tom Burton-West

Re: RE: maybe I hit a bug of Term ?

2010-12-10 Thread Tom Hill
So, if you write your own Map implementation, you may have to care about this, but in the general case, you can just use any Collection and it will work. Tom 2010/12/10 Sariny : > Object.hashCode() is "implemented by converting the internal address of the > object into an integer"

ICUTokenizer and CJK

2010-11-22 Thread Burton-West, Tom
Hi all, I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it does with CJK, which for C and J appears to be breaking into unigrams. Is this correct? Tom

API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Burton-West, Tom
structure in memory rather than having to force Lucene to actually read the entire tis file by using termEnum.next() ? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

RE: High frequency term for the searched query

2010-11-04 Thread Burton-West, Tom
will look up the total number of documents containing the term and the total number of occurrences of the term in the index. http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/GetTermInfo.java?revision=957522&view=markup Tom

RE: scalability limit in terms of numbers of large documents

2010-08-16 Thread Burton-West, Tom
bottlenecks with disk I/O or other bottlenecks, long before you reach this limit. BTW: we index whole books as Solr documents, not chapters or pages. Tom www.hathitrust.org/blogs ---

RE: Question to the writer of MultiPassIndexSplitter

2010-08-05 Thread Burton-West, Tom
say "SinglePassSplitter work started, to be contributed soon." You might try asking him directly or posting to the java-dev list. Tom www.hathitrust.org/blogs -Original Message- From: Christopher Condit [mailto:con...@sdsc.edu] Sent: Thursday, August 05, 2010 12:08 PM To: Yatir Ben S

RE: on-the-fly "filters" from docID lists

2010-07-23 Thread Burton-West, Tom
unt == 1) { bits.set(docs[0]); } >>That could involve a lot of disk seeks unless you cache a pk->docid lookup in >>ram. That sounds interesting. How would the pk->docid lookup get populated? Wouldn't a pk->docid cache be invalidated with each commit or merg

RE: on-the-fly "filters" from docID lists

2010-07-22 Thread Burton-West, Tom
Hi Mike and Martin, We have a similar use-case. Is there a scalability/performance issue with the getDocIdSet having to iterate through hundreds of thousands of docIDs? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Michael McCandless

RE: Relevancy Practices

2010-04-29 Thread Fornoville, Tom
and the scoring and relevancy in the search engine itself. Cheers, Tom -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: donderdag 29 april 2010 16:15 To: java-user@lucene.apache.org Subject: Relevancy Practices I'm putting on a talk

RE: Right memory for search application

2010-04-27 Thread Fornoville, Tom
nfo in the excellent article here: http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection- boot-camp-draft/ Regards, Tom -Original Message- From: Samarendra Pratap [mailto:samarz...@gmail.com] Sent: dinsdag 27 april 2010 15:50 To: java-user@lucene.apache.org Subject: Re: Ri

RE: Understanding lucene indexes and disk I/O

2010-04-13 Thread Burton-West, Tom
Sometime in the next month or so we will get our new test server and after I get the backup of testing jobs under control, I'd love to do some testing with flex and our data. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, April 13

Understanding lucene indexes and disk I/O

2010-04-12 Thread Burton-West, Tom
d and added together. If that is true then we are talking a sequential read of the entire tis file up to the current term. Is this correct? Can someone point me to the area of the code base where this is implemented ? Am I missing something here? Tom Burton-West

Re: search on documents which DO NOT have field defined

2010-03-10 Thread Tom Hill
ause.Occur.MUST); bq.add(trq, BooleanClause.Occur.MUST_NOT); Tom On Wed, Mar 10, 2010 at 2:11 PM, Tom Hill wrote: > Try > > -fieldname:[* TO *] > > as in > > > http://localhost:8983/solr/select/?q=-weight%3A[*+TO+*]&version=2.2&start=0&rows=10&indent=on

Re: search on documents which DO NOT have field defined

2010-03-10 Thread Tom Hill
Try -fieldname:[* TO *] as in http://localhost:8983/solr/select/?q=-weight%3A[*+TO+*]&version=2.2&start=0&rows=10&indent=on Tom On Wed, Mar 10, 2010 at 1:48 PM, bgd wrote: > Hi, > I have a bunch of documents which do not have a particular field defined. > How can d

Re: Can you use reduced sized test indexes to predict performance gains for a larger index?

2010-02-15 Thread Tom Burton-West
the other hand, once we started building our test indexes so they were significantly larger than the amount of memory available for OS disk caching, we could see results that extrapolated out to the large index. Tom Burton-West www.hathitrust.org ryguasu wrote: > > I'd like

Re: Question about many fields within a single index

2009-12-30 Thread Tom Hill
alFields x nDocs. So, an index with 10,000 documents, with one field each, same field for all docs: -rw-r--r-- 1 tom wheel 10004 Dec 30 18:54 _0.nrm a 10,000 doc index, where each doc has one of 100 different field names, but still only one field per doc: -rw-r--r-- 1 tom wheel 10

Re: document with different index time boost returns same score

2009-12-18 Thread Tom Hill
your scoring. You can run luke (http://code.google.com/p/luke/) , and look at the values for fieldNorm. It's on the documents tab. Does the ordering look like it is based on these numbers? Then length difference are probably what's happening to you. Tom On Fri, Dec 18, 2009 at 10:26 A

RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

2009-12-12 Thread Rob Staveley (Tom)
ier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Anshum [mailto:ansh...@gmail.com] > Sent: Friday, December 11, 2009 7:31 PM > To: java-user@lucene.apache.org > Subject: Re: Lucene 3.0.0 writer with a Lucene 2.3.1 index >

Lucene 3.0.0 writer with a Lucene 2.3.1 index

2009-12-11 Thread Rob Staveley (Tom)
I'm upgrading from 2.3.1 to 3.0.0. I have 3.0.0 index readers ready to go into production and writers in the process of upgrading to 3.0.0. I think understand the implications of http://wiki.apache.org/lucene-java/BackwardsCompatibility#File_Formats for the upgrade, but I'd love it if someone coul

RE: IndexWriter.MaxFieldLength.UNLIMITED at what price?

2009-12-10 Thread Rob Staveley (Tom)
risks of getting massive docs. And even then I'd first try to create other mechanisms to try to not index such documents... Mike On Thu, Dec 10, 2009 at 3:15 AM, Rob Staveley (Tom) wrote: > I was wondering where I might read about the cost of using > IndexWr

IndexWriter.MaxFieldLength.UNLIMITED at what price?

2009-12-10 Thread Rob Staveley (Tom)
I was wondering where I might read about the cost of using IndexWriter.MaxFieldLength.UNLIMITED versus IndexWriter.MaxFieldLength.LIMITED. Are thee any consequences over and above the obvious one that you are going to analyse more content in your IndexWriter when you have more than 10,000 chara

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
rly test your changes before deploying to production ? On Wed, Dec 9, 2009 at 17:55, Rob Staveley (Tom) wrote: > COMPRESS is supported (only deprecated) in 2.9.1, so I'm expecting them to be > supported > http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/document/Fiel

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
d, Dec 9, 2009 at 16:50, Rob Staveley (Tom) wrote: > Thanks, Danil. I think you've saved me a lot of time. Weiwei too - converting > rather than reindexing everything, which will save a lot of time. > > So, I should do this: > > 1. Convert readers to 2.9.1, which should be

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
not have full access to the data center, you can read(readonly > mode is preferred) from the data center(through nfs or something like that) > and write to your local disk. > > When all converting is done, you can copy the new index to the data center > with the help of the a

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
read the old version index and then use a 3.0.0 IndexWriter to write all the documents into a new index 3. Update QueryPaser to 3.0.0 I've redeployed my system and it works fine now. On Wed, Dec 9, 2009 at 8:13 PM, Rob Staveley (Tom) wrote: > I have Lucene 2.3.1 code and indexes deployed in

Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
I have Lucene 2.3.1 code and indexes deployed in production in a distributed system and would like to bring everything up to date with 3.0.0 via 2.9.1. Here's my migration plan: 1. Add a index writer which generates a 2.9.1 "test" index 2. Have that "test" index writer push that 2.9.1 "test" ind

Re: question related to Indexing

2009-12-08 Thread Tom Hill
If you tell us WHY you want to do this, rather than HOW you want to do it, the chances are much better that someone can help. What's the business motivation here? What does the end user want to achieve? Tom On Tue, Dec 8, 2009 at 8:16 AM, Phanindra Reva wrote: > Hello, >Tha

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Tom Hill
Is that a policy for Lucene? Thanks, Tom On Mon, Dec 7, 2009 at 4:44 PM, Jason Rutherglen wrote: > I wonder if Google Collections (even though we don't use third party > libraries) concurrent map, which supports weak keys, handles the > removal of weakly referenced keys in a more ele

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Tom Hill
l value to the cache, and doing a GC, and see if your memory drops then. Tom On Mon, Dec 7, 2009 at 1:48 PM, TCK wrote: > Thanks for the response. But I'm definitely calling close() on the old > reader and opening a new one (not using reopen). Also, to simplify the > analysis, I

New Technical White Papers on Apache Lucene 2.9 and Solr 1.4 from Lucid Imagination

2009-10-23 Thread Tom Alt
ation, rich document acquisition and more) . Download (reg required) at http://www.lucidimagination.com/whitepaper/whats-new-in-solr-1-4?sc=AP Tom www.lucidimagination.com

Re: Re: about TopFieldDocs

2009-01-05 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: about TopFieldDocs

2009-01-05 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: Re: Search Test file

2009-01-03 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: Search Test file

2009-01-03 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: Re: Search Problem

2009-01-02 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: Search Problem

2009-01-02 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: Re: Search Problem

2009-01-01 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: Search Problem

2009-01-01 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: Showing highlighted results

2009-01-01 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Showing highlighted results

2009-01-01 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: How to index pdf, html, doc and other MIME types in lucene

2008-12-31 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to index pdf, html, doc and other MIME types in lucene

2008-12-31 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: IndexCommit#getFileNames() returning duplicates?

2008-12-29 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexCommit#getFileNames() returning duplicates?

2008-12-29 Thread tom
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

  1   2   3   >