RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

2009-12-12 Thread Rob Staveley (Tom)
e release notes that on doing so, a growth in the index > size should be anticipated and handled. > > -- > Anshum Gupta > Naukri Labs! > http://ai-cafe.blogspot.com > > The facts expressed here belong to everybody, the opinions to me. The > distinction is yours to draw.

Lucene 3.0.0 writer with a Lucene 2.3.1 index

2009-12-11 Thread Rob Staveley (Tom)
I'm upgrading from 2.3.1 to 3.0.0. I have 3.0.0 index readers ready to go into production and writers in the process of upgrading to 3.0.0. I think understand the implications of http://wiki.apache.org/lucene-java/BackwardsCompatibility#File_Formats for the upgrade, but I'd love it if someone coul

RE: IndexWriter.MaxFieldLength.UNLIMITED at what price?

2009-12-10 Thread Rob Staveley (Tom)
risks of getting massive docs. And even then I'd first try to create other mechanisms to try to not index such documents... Mike On Thu, Dec 10, 2009 at 3:15 AM, Rob Staveley (Tom) wrote: > I was wondering where I might read about the cost of using > IndexWr

IndexWriter.MaxFieldLength.UNLIMITED at what price?

2009-12-10 Thread Rob Staveley (Tom)
I was wondering where I might read about the cost of using IndexWriter.MaxFieldLength.UNLIMITED versus IndexWriter.MaxFieldLength.LIMITED. Are thee any consequences over and above the obvious one that you are going to analyse more content in your IndexWriter when you have more than 10,000 chara

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
rly test your changes before deploying to production ? On Wed, Dec 9, 2009 at 17:55, Rob Staveley (Tom) wrote: > COMPRESS is supported (only deprecated) in 2.9.1, so I'm expecting them to be > supported > http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/document/Fiel

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
d, Dec 9, 2009 at 16:50, Rob Staveley (Tom) wrote: > Thanks, Danil. I think you've saved me a lot of time. Weiwei too - converting > rather than reindexing everything, which will save a lot of time. > > So, I should do this: > > 1. Convert readers to 2.9.1, which should be

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
not have full access to the data center, you can read(readonly > mode is preferred) from the data center(through nfs or something like that) > and write to your local disk. > > When all converting is done, you can copy the new index to the data center > with the help of the a

RE: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
read the old version index and then use a 3.0.0 IndexWriter to write all the documents into a new index 3. Update QueryPaser to 3.0.0 I've redeployed my system and it works fine now. On Wed, Dec 9, 2009 at 8:13 PM, Rob Staveley (Tom) wrote: > I have Lucene 2.3.1 code and indexes deployed in

Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Rob Staveley (Tom)
I have Lucene 2.3.1 code and indexes deployed in production in a distributed system and would like to bring everything up to date with 3.0.0 via 2.9.1. Here's my migration plan: 1. Add a index writer which generates a 2.9.1 "test" index 2. Have that "test" index writer push that 2.9.1 "test" ind

RE: WilcardQuery and memory

2007-03-09 Thread Rob Staveley (Tom)
For indexing e-mail, I recommend that you tokenise the e-mail addresses into fragments and query on the fragments as whole terms rather than using wildcards. Rather than looking for fischauto333* in (say) smtp-from, look for fischauto333 in (say) an additional field called smtp-from-fragments to

Re: All readers must have same maxDoc: 16651064!=16507074

2007-01-05 Thread Rob Staveley (Tom)
Oh I get it now. That was a great explanation. Thanks Doron. Doron Cohen wrote: Rob, "Rob Staveley (Tom)" <[EMAIL PROTECTED]> wrote on 05/01/2007 06:18:10: I'm attempting to delete documents matching a term on a ParallelReader and got the error message ab

All readers must have same maxDoc: 16651064!=16507074

2007-01-05 Thread Rob Staveley (Tom)
I'm attempting to delete documents matching a term on a ParallelReader and got the error message above, presumably while adding directories to the ParallelReader. I'm puzzled, because I don't need to have the same maxDoc (and numDoc) in index directories for a ParallelMultiSearcher, so what's the

RE: Merge Index Filling up Disk Space

2006-12-21 Thread Rob Staveley (Tom)
I've found that merging a 20G directory into another 20G directory on another disk required the target disk to have > 50G available during the merge. I ran out of space on my ~70G disk for the merge and had to do it on another system with ~170G available, but I'm not sure how much was used transien

RE: Merging "orphaned" segments into a composite index

2006-09-16 Thread Rob Staveley (Tom)
; segments into a composite index Rob Staveley (Tom) wrote: > It looks like my segments file only contains information for the .cfs > segments. So this approach doesn't work. I was wondering if I could > use > IndexWriter.addIndexes(IndexReader[]) instead. Can I open

RE: Merging "orphaned" segments into a composite index

2006-09-16 Thread Rob Staveley (Tom)
300G each and regenerating them is something I'd like to avoid. -Original Message- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: 15 September 2006 18:18 To: java-user@lucene.apache.org Subject: Merging "orphaned" segments into a composite index I have had some b

Merging "orphaned" segments into a composite index

2006-09-15 Thread Rob Staveley (Tom)
I have had some badly behaved Lucene indexing software crash on me several times and have been left with an index directory with lots of non-composite files in, when all I ought to be getting is the compound files .cfs files plus deletable and segments. Re-indexing everything doesn't bear think

RE: Indexing existing email archives

2006-08-15 Thread Rob Staveley (Tom)
ith Tropo. -----Original Message- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: 15 August 2006 13:26 To: java-user@lucene.apache.org Subject: RE: Indexing existing email archives OK, we're well off topic now. If you have follow up on this can I recommend that you don't

RE: Indexing existing email archives

2006-08-15 Thread Rob Staveley (Tom)
le to index them as it is? The mail client is Thunderbird Mailbox. If I have to use third party software is there anything you can suggest? suba suresh. Rob Staveley (Tom) wrote: >> I have to have this working in next couple of days. > > I had a similar requirement and it look me sev

RE: 7GB index taking forever to return hits

2006-08-15 Thread Rob Staveley (Tom)
Sounds like you want to tokenise CONTENTS, if you are not already doing so. Then you could simply have: +CONTENTS:white +CONTENTS:hard +CONTENTS:hat -Original Message- From: Van Nguyen [mailto:[EMAIL PROTECTED] Sent: 15 August 2006 01:30 To: java-user@lucene.apache.org Subject: RE:

RE: Indexing existing email archives

2006-08-15 Thread Rob Staveley (Tom)
> I have to have this working in next couple of days. I had a similar requirement and it look me several weeks to get something working. I think you'll need to make use of as much off the shelf software as you can, given your timescale. If you don't want to get your hands dirty with Lucene and li

RE: Sorting

2006-08-02 Thread Rob Staveley (Tom)
> Scorers are by contract expected to score docs in docId order This was my missing link. Now it makes sense to me to use a buffered RandomAccessFile and not bother with the presort. Many thanks, Chris, that was very well explained. I'll have a crack at a lean-memory SortComparatorSource implem

RE: Sorting

2006-08-01 Thread Rob Staveley (Tom)
> file seeks instead of array lookups I'm with you now. So you do seeks in your comparator. For a large index you might as well use java.io.RandomAccessFile for the "array", because there would be little value in buffering when the comparator is liable to jump all around the file. This sounds ver

RE: Sorting

2006-07-31 Thread Rob Staveley (Tom)
Ref 1: I was just about to show you a link at Sun but I realise that it was my misread! OK, so the maximum heap is 2G on a 32-bit Linux platform, which doubles the numbers, and yes indeed 64 bits seems like a good idea, if having sort indexes in RAM is a good use of resources. But there must be a b

Re: Sorting

2006-07-30 Thread Rob Staveley (Tom)
The limit is much less than Integer.MAX_VALUE (2,147,483,647), unless you have a VM which can run in more than 1G of heap. 1G limits you to a theoretical number of 256M (268,435,456) documents with 4 bytes per array element. In practise it will be something a less, because there are other things wh

RE: Date ranges - getting the approach right

2006-07-20 Thread Rob Staveley (Tom)
Wow. Looking at the implementation of http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.h tml#open(org.apache.lucene.store.Directory) I've now realised that when you create an IndexReader (clue it is abstract), you actually instantiate a MultiReader, with an IndexReader for

RE: Date ranges - getting the approach right

2006-07-20 Thread Rob Staveley (Tom)
Sorry for the delayed response. It takes me a while to get my head around Lucene. I've got parallel indexes, which means that chorological ordering by doc ID would need to be a bit more sophisticated. It strikes me that there must be some performance advantage doing it though. I'll see if I can

RE: Date ranges - getting the approach right

2006-07-16 Thread Rob Staveley (Tom)
The second approach requires three hits, doesn't it? (1) TermQuery on start date + sort on document ID (2) TermQuery on end date + reverse sort on document ID (3) The actual query with a filter on the above Would that really be a saving? -Original Message- From: Erick Erickson [mailto:[E

RE: MissingStringLastComparatorSource and MultiSearcher

2006-07-15 Thread Rob Staveley (Tom)
--- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 15 July 2006 16:52 To: java-user@lucene.apache.org Subject: Re: MissingStringLastComparatorSource and MultiSearcher On 7/15/06, Rob Staveley (Tom) <[EMAIL PROTECTED]> wrote: > Incidentally, Yonik, is the logic for > org.apache.solr

RE: MissingStringLastComparatorSource and MultiSearcher

2006-07-15 Thread Rob Staveley (Tom)
t;, "1", result.doc(7).get("id")); } 8<---- -----Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 14 July 2006 21:59 To: java-user@lucene.apache.org Subject: Re: MissingStringLastComparatorSource and MultiSearcher On 7/14/06, Rob Staveley (Tom) <[EMAIL PROTECTE

RE: Date ranges - getting the approach right

2006-07-15 Thread Rob Staveley (Tom)
> It's not allways faster ... it really depends on how many matching terms there are in your range. Does the cached RangeFilter's performance drop off relative to RangeQuery with a large number of matches then? > Wether you should cache all RangeFilters depends largely on how often you plan on r

RE: MissingStringLastComparatorSource and MultiSearcher

2006-07-15 Thread Rob Staveley (Tom)
m not sure if it wouldn't be better for it to be SortField.FLOAT in the above. -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 14 July 2006 21:59 To: java-user@lucene.apache.org Subject: Re: MissingStringLastComparatorSource and MultiSearcher On 7/14/06, Rob

RE: MissingStringLastComparatorSource and MultiSearcher

2006-07-15 Thread Rob Staveley (Tom)
:-) You're right! It remains the case that INT and FLOAT equivalents of MissingStringLastComparatorSource would be useful for the reverse reverse (i.e. not reverse) case :-) -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 15 July 2006 00:24 To: java-user@lucene.ap

RE: MissingStringLastComparatorSource and MultiSearcher

2006-07-14 Thread Rob Staveley (Tom)
ltiSearcher On 7/14/06, Rob Staveley (Tom) <[EMAIL PROTECTED]> wrote: > Chris Hostetter and Yonik's MissingStringLastComparator looks like a > neat way to specify where to put null values when you want them to > appear at the end of reverse sorts rather than at the beginning, b

MissingStringLastComparatorSource and MultiSearcher

2006-07-14 Thread Rob Staveley (Tom)
Chris Hostetter and Yonik's MissingStringLastComparator looks like a neat way to specify where to put null values when you want them to appear at the end of reverse sorts rather than at the beginning, but I spotted the note... // Note: basing lastStringValue on the StringIndex won't work /

RE: What are norms?

2006-07-14 Thread Rob Staveley (Tom)
Got it. Thanks Yonik. -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 14 July 2006 15:42 To: java-user@lucene.apache.org Subject: Re: What are norms? On 7/14/06, Rob Staveley (Tom) <[EMAIL PROTECTED]> wrote: > I'm trying to reduce the memory req

RE: What are norms?

2006-07-14 Thread Rob Staveley (Tom)
I'm trying to reduce the memory requirement of my application that has ~40 indexed fields. Would I be wasting my time omitting norms in this application? What would I lose by omitting norms? The ability to boost individual fields as they are added to the index? Anything else? [I want to check tha

Mixing compressed and uncompressed values

2006-07-14 Thread Rob Staveley (Tom)
Is this a bad idea? String synopsis = /* may be any length between 0 and 400 characters */ // Store, but don't index the synopsis // If the synopsis is > 150 characters, we should compress it Field field = new Field( "synopsis",synopsis

Date ranges - getting the approach right

2006-07-14 Thread Rob Staveley (Tom)
For the sake of date ranges, I'm storing dates as MMDD in my e-mail indexing application. My users typically want to limit their queries to ranges of dates, which include today. The application is indexing in real time. I gather I should prefer RangeQuery to ConstantScoreQuery+RangeFilter, b

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)
(PDDocument) method of the PDFTextStripper. I will try the other suggestion. suba suresh. Rob Staveley (Tom) wrote: > If you are using > http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get > Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large > String an

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)
If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may need a 1G heap. If, however, you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText (org.pdf

RE: Missing fields used for a sort

2006-07-11 Thread Rob Staveley (Tom)
I can't thank you enough, Yonik :-) -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 11 July 2006 18:05 To: java-user@lucene.apache.org Subject: Re: Missing fields used for a sort On 7/11/06, Rob Staveley (Tom) <[EMAIL PROTECTED]> wrote: > Thanks for

RE: Missing fields used for a sort

2006-07-11 Thread Rob Staveley (Tom)
Thanks for the info both of you. Of course Lucene obeys Murphy's law that the missing ones appear first when you reverse sort, which is what Murphy's law says you want to do. Does solr have a custom build of Lucene in it, or is the functionality required to required to get the missing ones to the

Missing fields used for a sort

2006-07-11 Thread Rob Staveley (Tom)
If I want to sort on a field that doesn't exist in all documents in my index, can I have a default value for documents which lack that field (e.g. MAXINT or 0)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Compressed fields

2006-07-11 Thread Rob Staveley (Tom)
What's a sensible guideline for length of an un-indexed field and whether to store it compressed or not? I have a 300 character document synopsis, which I store. Would there be any saving having it compressed? Can you have an index with a stored un-indexed field which is sometimes compressed and s

RE: Managing a large archival (and constantly changing) database

2006-07-07 Thread Rob Staveley (Tom)
Aha, OK that makes sense. Likewise James Pine's explanation. Thanks both of you. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 07 July 2006 20:40 To: java-user@lucene.apache.org Subject: RE: Managing a large archival (and constantly changing) database : How ca

RE: Managing a large archival (and constantly changing) database

2006-07-07 Thread Rob Staveley (Tom)
I should probably direct this to Doug Cutting, but following that thread I come to Doug's post at http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12709.html . Doug says: > 1. On the index master, periodically checkpoint the index. Every minute or so the IndexWriter is closed and a

RE: IndexSearcher memory leak?

2006-07-06 Thread Rob Staveley (Tom)
sly assumed (wrongly) that RAMDirectory.close() would free up its memory buffers.. but i guess I needed to RTFC... RAMDirectory.close() is just an empty method. On 7/5/06, Rob Staveley (Tom) <[EMAIL PROTECTED]> wrote: > My two bits... > > If you consider (say) the 2n

RE: IndexSearcher memory leak?

2006-07-05 Thread Rob Staveley (Tom)
My two bits... If you consider (say) the 2nd pass of the loop... Searcher searcher = null; for(int i = 0; i < 5; ++i){ RAMDirectory ramdir = new RAMDirectory( db ); // <- Consider this moment searcher = n

RE: HTML text extraction

2006-06-21 Thread Rob Staveley (Tom)
I found that CyberNeko left style and script in the text and JTidy produced better output, but both of them use DOM and were therefore subject to OutOfMemory errors (JTidy being worse than CyberNeko). I've since then moved over to TagSoup, which I needed to customise to strip style script (a simple

RE: indexing emails

2006-06-19 Thread Rob Staveley (Tom)
and pick up all emails in the thread. Mike -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: 19 June 2006 08:21 To: java-user@lucene.apache.org Subject: Re: indexing emails Rob Staveley (Tom) wrote: > Having spent a lot of time getting this wrong myself in an

RE: indexing emails

2006-06-17 Thread Rob Staveley (Tom)
Having spent a lot of time getting this wrong myself in an e-mail indexer(!), I urge you to consider whether in your query interface you will need to look for mail to "john*" rather than [EMAIL PROTECTED], because "john*" may have been addressed to [EMAIL PROTECTED] or [EMAIL PROTECTED] If you inde

RE: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Rob Staveley (Tom)
The penny drops. Thank you so much for your time, Chris :-) -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 15 June 2006 18:43 To: java-user@lucene.apache.org Subject: RE: BooleanQuery.TooManyClauses on MultiSearcher : Incidentally, I'm getting BooleanQuery.Too

RE: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Rob Staveley (Tom)
you could drop it into your : Lucene installation. I'm not entirely sure how well the : ConstantScoreQueries work with a MultiSearcher (mainly because i odn't know : how well Filter's work with MultiSearchers) but you could give it a try -- : it certainly won't have a TooManyC

RE: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Rob Staveley (Tom)
and "neil" is relatively common and yet "fred" is getting the BooleanQuery.TooManyClauses and "neil" isn't. Does that make sense? Should the actual term used in a PrefixQuery effect the number of clauses? -Original Message- From: Rob Staveley (Tom) [ma

RE: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Rob Staveley (Tom)
I'm still trying to get my head around ConstantScorePrefixQuery. Could I simply use this as a drop-in replacement for PrefixQuery? -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 15 June 2006 18:22 To: java-user@lucene.apache.org; eks dev Subject: Re: BooleanQuery

RE: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Rob Staveley (Tom)
hers) but you could give it a try -- it certainly won't have a TooManyClauses problem. : : -Original Message- : From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] : Sent: 15 June 2006 14:51 : To: java-user@lucene.apache.org : Subject: BooleanQuery.TooManyClauses on MultiSearcher : : I've

RE: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Rob Staveley (Tom)
with BooleanQuery.TooManyClauses on my MultiSearcher, is there a smarter approach that I should be adopting? -Original Message- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: 15 June 2006 14:51 To: java-user@lucene.apache.org Subject: BooleanQuery.TooManyClauses on MultiSearcher I've just a

BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Rob Staveley (Tom)
I've just added a 3rd index directory (i.e. 3rd IndexSearcher) to my MultiSearcher and I'm getting BooleanQuery.TooManyClauses errors on queries which were working happily on 2 indexes. Here's an example query, which hopefully you'll find self-explanatory from the XML structure. 8<

RE: Problems indexing large documents

2006-06-10 Thread Rob Staveley (Tom)
ce. maxFieldLength = -1 could perhaps denote what's needed?? -----Original Message- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: 10 June 2006 07:22 To: java-user@lucene.apache.org Subject: RE: Problems indexing large documents I'm trying to come to terms with http://lucene.a

RE: Problems indexing large documents

2006-06-09 Thread Rob Staveley (Tom)
I'm trying to come to terms with http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h tml#setMaxFieldLength(int) too. I've been attempting to index large text files as single Lucene documents, passing them as java.io.Reader to cope with RAM. I was assuming (like - I suspect

RE: Compound / non-compound index files and SIGKILL

2006-06-09 Thread Rob Staveley (Tom)
I am no longer a Jira virgin. http://issues.apache.org/jira/browse/LUCENE-594 Thanks again. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 09 June 2006 07:13 To: java-user@lucene.apache.org Subject: RE: Compound / non-compound index files and SIGKILL : Whom sh

RE: Compound / non-compound index files and SIGKILL

2006-06-08 Thread Rob Staveley (Tom)
would have fulfilled its purpose when the call to the constructor returns. -Original Message----- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: 08 June 2006 00:34 To: java-user@lucene.apache.org Subject: RE: Compound / non-compound index files and SIGKILL > I'm not sure what e

RE: Compound / non-compound index files and SIGKILL

2006-06-07 Thread Rob Staveley (Tom)
und / non-compound index files and SIGKILL : : : : If your content handlers should respond quickly then you should move : : indexing process to separate thread and maintain items in queue. : : : : Rob Staveley (Tom) wrote: : : > This is a real eye-opener, Volodymyr. Many thanks. I guess that means :

RE: Compound / non-compound index files and SIGKILL

2006-06-07 Thread Rob Staveley (Tom)
: To: java-user@lucene.apache.org : Subject: Re: Compound / non-compound index files and SIGKILL : : If your content handlers should respond quickly then you should move : indexing process to separate thread and maintain items in queue. : : Rob Staveley (Tom) wrote: : > This is a real eye-opener,

RE: PHP and Lucene integration

2006-06-06 Thread Rob Staveley (Tom)
For querying, we have PHP talking to our Java application through sockets and XML. Queries are set up in PHP, creating an XML document which corresponds to a subset of the subclasses of http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Query.html. If we'd had the PHP skill set at the

RE: Lucene in Action

2006-06-06 Thread Rob Staveley (Tom)
It is better value than the tee shirt http://www.cafepress.com/lucene/ smime.p7s Description: S/MIME cryptographic signature

RE: Avoiding java.lang.OutOfMemoryError in an unstored field

2006-06-06 Thread Rob Staveley (Tom)
06-06 at 10:43 +0100, Rob Staveley (Tom) wrote: > You are right there are going to be a lot of tokens. The entire boxy > of a text document is getting indexed in an unstored field, but I > don't see how I can flush a partially loaded field. Check these out: http://lucene.apache

RE: Compound / non-compound index files and SIGKILL

2006-06-06 Thread Rob Staveley (Tom)
If your content handlers should respond quickly then you should move indexing process to separate thread and maintain items in queue. Rob Staveley (Tom) wrote: > This is a real eye-opener, Volodymyr. Many thanks. I guess that means > that my orphan-producing hangs must be addDocument() c

RE: Avoiding java.lang.OutOfMemoryError in an unstored field

2006-06-06 Thread Rob Staveley (Tom)
java-user@lucene.apache.org Subject: RE: Avoiding java.lang.OutOfMemoryError in an unstored field On Tue, 2006-06-06 at 10:22 +0100, Rob Staveley (Tom) wrote: > > Thanks for the response, Karl. I am using FSDirectory. > -X:AggressiveHeap might reduce the number of times I get bitten by the

RE: Avoiding java.lang.OutOfMemoryError in an unstored field

2006-06-06 Thread Rob Staveley (Tom)
? -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: 06 June 2006 10:16 To: java-user@lucene.apache.org Subject: Re: Avoiding java.lang.OutOfMemoryError in an unstored field On Tue, 2006-06-06 at 10:11 +0100, Rob Staveley (Tom) wrote: > Sometimes I need to index large docume

RE: Avoiding java.lang.OutOfMemoryError in an unstored field

2006-06-06 Thread Rob Staveley (Tom)
ive in RAM, because that puts a limit on document size. -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: 06 June 2006 10:13 To: java-user@lucene.apache.org Subject: Re: Avoiding java.lang.OutOfMemoryError in an unstored field On Tue, 2006-06-06 at 10:11 +0100, Rob Sta

Avoiding java.lang.OutOfMemoryError in an unstored field

2006-06-06 Thread Rob Staveley (Tom)
Sometimes I need to index large documents. I've got just about as much heap as my application is allowed (-Xmx512m) and I'm using the unstored org.apache.lucene.document.Field constructed with a java.io.Reader, but I'm still suffering from java.lang.OutOfMemoryError when I index some large document

RE: Compound / non-compound index files and SIGKILL

2006-06-05 Thread Rob Staveley (Tom)
letable, .cfs; you can look name of segment in 'segments' file. Everything else is 'garbage' - you can delete it. Rob Staveley (Tom) wrote: > I've been indexing live data into a compound index from an MTA. I'm > resolving a bunch of problems unrelated to

RE: Compound / non-compound index files and SIGKILL

2006-06-05 Thread Rob Staveley (Tom)
GKILL I'm afraid shutdown hooks don't get to run (double-check that, I'm not certain). Otis - Original Message From: Rob Staveley (Tom) <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, June 5, 2006 6:17:11 AM Subject: Compound / non-compound index files

RE: Compound / non-compound index files and SIGKILL

2006-06-05 Thread Rob Staveley (Tom)
is built as compound and I no longer have "orphaned" index files. Hope this helps someone. Charles -----Original Message- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: Monday, June 05, 2006 6:17 AM To: java-user@lucene.apache.org Subject: Compound / non-compound inde

Compound / non-compound index files and SIGKILL

2006-06-05 Thread Rob Staveley (Tom)
I've been indexing live data into a compound index from an MTA. I'm resolving a bunch of problems unrelated to Lucene (disparate hangs in my content handlers). When I get a hang, I typically need to kill my daemon, alas more often than not using kill -9 (SIGKILL). However, these SIGKILLs are leavi

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
see via a file listing on the command line? Also, you may want to see if you have any stale locks or the like that is preventing you from doing an optimize. Rob Staveley (Tom) wrote: > Indexing 55648 documents in a new clean directory, I see only .cfs > files (+ deletable + segments).

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
Indexing 55648 documents in a new clean directory, I see only .cfs files (+ deletable + segments). Disk usage is 65K for all of these, which means that each message takes ~1K of index space rather than > 10K as it does in my 99GB index. Bearing in mind that the large index has > 5 million Lucene

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
> Note that IndexReader has a main() that will list the contents of compound index files. It looks like some of my index is compound and some isn't. My not very well informed guess is that an optimize() got interrupted somewhere along the line. If I try to optimize the index now, it throws except

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
I just tried to optimise my index, using the lucli command line client, and got: 8< lucli> optimize Starting to optimize index. java.io.IOException: Cannot overwrite: /mnt/sdb1/lucene-index/index-1/_2lhqi.fnm at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.j

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
Interesting. I am explicitly turning on the compound file format when I start my application, but I am suspicious about my optimizing thread. It *ought* to be optimising every 30 minutes, using thread synchronisation to prevent the writer from trying to write while optimisation takes place, but it

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
That's a really good idea, but I've got a total of 38 fields only. It is true that some of them are empty, but that can't account for the bulk. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 26 May 2006 17:50 To: java-user@lucene.apache.org Subject: RE: Seeing wh

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
> Is there anything I can learn from the index directory's file listing? Running this nasty little BASH one-liner... $ for i in `ls * | perl -nle 'if (/^.+(\..+)/) {print $1;}' | sort | uniq`;do ls -l *$i | awk '{SUM = SUM + $5} END {if (SUM > 1e10) {print "'$i': ", SUM}}'; done ... I see

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
> I can't see how Luke is going to show me what's occupying most of my index. I do however notice that none of my stored fields are stored compressed. Presumably Field.Store COMPRESS is something that is new in Lucene 1.9 and wasn't available in 1.4.3?? However, it is still hard to see what's c

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
Luke is working nicely with a XWin32 demo server, I just downloaded from StarNet, with a bit of SSH tunnelling :-) [I couldn't immediately figure out how to do it with Cygwin/X.] However, I can't see how Luke is going to show me what's occupying most of my index. -Original Message- From

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
The server is headless (i.e. no X-Windows). I've tried lucli, but that doesn't have Luke's whistles and bells. Does Luke have a non-GUI equivalent, Grant? -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: 26 May 2006 12:41 To: java-user@lucene.apache.org Subject: Re

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
In my index of e-mail message parts, it looks like 23K is being used up for each indexed message part, which is way more than I'd expect. I have a total of 37 fields per message part. I tokenize, index and do not store message part bodies. I store a <= 300 character synopsis of each message part.

Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
I am indexing e-mail in a compound index and for e-mail which is stored in ~60G (in Bzip2 compressed form), I have an index which is now 80G. Is there a tool I can use to see how much of the index is occupied by the different fields I am indexing? PS: I am a newbie to the mailing list - I hope I'