Re: Can you create a RAM index from a file index

2009-03-24 Thread Anshum
Hi Ganesh, What you are talking about is loading partial index (as per requirement) into RAM. This is exactly what any other decently designed application would do. On the other hand, RAM Directory implementation just copies all of the index into RAM. Also, tmpfs is nothing but an explicit copy o

Re: Can you create a RAM index from a file index

2009-03-24 Thread Ganesh
FileSystem index reader loads the data to RAM, I have tried with more than 6 GB of index (sharded to 20 index) and the response is pretty fast. What significance gain would be to use RAM directory. How the modifications done in RAM directory will sync with FileSystem. Regards Ganesh - Ori

Re: Can you create a RAM index from a file index

2009-03-24 Thread Otis Gospodnetic
That's indeed an alternative. Moreover, I have heard (not measured/comparered myself) from people who tried both MM and tmpfs approach that the former has some overhead. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Anshum > To: java

Re: Index Partitioning

2009-03-24 Thread Chris Hostetter
: This is perfect, exactly what I was looking for. Thanks much Andrzej! if you code that up and it works out well, contributing your code as a Jira attachment could help it become a re-usable tool for others in the future. (a simple command line that takes the directory of hte index, a value

Re: MergePolicy public but SegmentInfos package protected?

2009-03-24 Thread Chris Hostetter
: I'd rather not make SegmentInfos public; it's a large API and we do : make changes to it as we change the index format. It's also quite : internal to Lucene. : : Making your own MergePolicy/Scheduler is very much an "advanced" use : case... so I think it's acceptable to have to put it into o.a

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Michael McCandless
It looks like you are reusing a Field (the f.setValue(...) calls); are you sure you're not changing a Document/Field while another thread is adding it to the index? If you can post the full code, then I can try to run it on my wikipedia dump locally. Mike Jason Rutherglen wrote: > Mike, > > It

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Jason Rutherglen
Mike, It only happens when at least 1 million documents are indexed in a multithreaded fashion. Maybe I should post the code? I will try indexing without the payload field, I assume it won't fail because I indexed wikipedia before with no issues. Thanks! Jason On Tue, Mar 24, 2009 at 12:25 PM

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Jason Rutherglen
Using StandardAnalyzer. It's probably the payload field? This is the code that creates the payload field: private static class SinglePayloadTokenStream extends TokenStream { private Token token = new Token(UID_TERM.text(), 0, 0); private byte[] buffer = new byte[4];

Re: Memory Leak?

2009-03-24 Thread Paul Smith
No, I don't hit OOME if I comment out the call to getHTMLTitle. The heap behaves perfectly. I completely agree with you, the thread count goes haywire the moment I call the HTMLParser.getTitle(). I have seen a thread count of like 600 before my I hit OOME (with the getTitle() call on) and

Re: Corrupt index (IndexOutOfBoundsException)

2009-03-24 Thread René Zöpnek
Thank you for your help Michael. I've solved the problem by new creation of the index. The OutOfErrorException killed the thread, which was responsible for index maintenance. So the index recreation failed without an error message. So after recreating the index, the problem is solved. Sorry for

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Michael McCandless
I was just able to index all of wikipedia, using StandardAnalyzer, with assertions enabled, without hitting that exception. Which analyzer are you using (besides your payload field)? Mike Michael McCandless wrote: > H. > > Jason is this easily/compactly repeated?  EG, try to index the N doc

Re: MergePolicy public but SegmentInfos package protected?

2009-03-24 Thread Michael McCandless
I'd rather not make SegmentInfos public; it's a large API and we do make changes to it as we change the index format. It's also quite internal to Lucene. Making your own MergePolicy/Scheduler is very much an "advanced" use case... so I think it's acceptable to have to put it into o.a.l.index pack

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Michael McCandless
H. Jason is this easily/compactly repeated? EG, try to index the N docs before that one. If you remove the SinglePayloadTokenStream field, does the exception still happen? Mike Jason Rutherglen wrote: > While indexing using > contrib/org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker

MergePolicy public but SegmentInfos package protected?

2009-03-24 Thread Jason Rutherglen
I'm overriding MergePolicy which is public, however SegmentInfos is package protected which means the MergePolicy subclass must be in the org.apache.lucene.index package. Can we make SegmentInfos public?

Re: Memory Leak?

2009-03-24 Thread Michael McCandless
Actually, I was hoping you could try leaving the getHTML calls in, but increase the heap size of your Tomcat instance. Ie, to be sure there really is a leak vs you're just not giving the JRE enough memory. I do like your hypothesis, but looking at HTMLParser it seems like the thread should exit a

Re: Memory Leak?

2009-03-24 Thread Chetan Shah
Highly appreciate your replies Michael. No, I don't hit OOME if I comment out the call to getHTMLTitle. The heap behaves perfectly. I completely agree with you, the thread count goes haywire the moment I call the HTMLParser.getTitle(). I have seen a thread count of like 600 before my I hit OOME

Re: Memory Leak?

2009-03-24 Thread Michael McCandless
Odd. I don't know of any memory leaks w/ the demo HTMLParser, hmm though it's doing some fairly scary stuff in its getReader() method. EG it spawns a new thread every time you run it. And, it's parsing the entire HTML document even though you only want the title. You may want to switch to better

Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Jason Rutherglen
While indexing using contrib/org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker. The asserion error is from TermsHashPerField.comparePostings(RawPostingList p1, RawPostingList p2). A Payload is added to the document representing a UID. Only 1-2 out of 1 million documents indexed generates th

Re: Memory Leak?

2009-03-24 Thread Chetan Shah
After some more researching I discovered that the following code snippet seems to be the culprit. I have to call this to get the "title" of the indexed html page. And this is called 10 times as my I display 10 results on a page. Any Suggestions on how to achieve this without the OOME issue.

Streaming results of analysis to shards ... possible?

2009-03-24 Thread Cass Costello
Hello all, Our application involves a high index write rate - anywhere from a few dozen to many thousands of docs per sec. The write rate is frequently higher than the read rate (though not always), and our index must be as fresh as possible (we'd like search results to be no more than a couple o

Re: Term level boosting

2009-03-24 Thread Koji Sekiguchi
Seid Mohammed wrote: ok, but I need to know how to proceed with it. I mean how to include to my application many thanks Seid M You may want to look at the following articles: http://lucene.jugem.jp/?eid=133 http://lucene.jugem.jp/?eid=134 articles are in Japanese, but ignore them. :) Pro

question about grouping text

2009-03-24 Thread MFM
I have been able to successfully index and search text from structured documents like PDF and MS Word. I am having a real hard time trying to figure out how to group the index strings together e.g. if my document had a question and answer in a table, the search will produce the text with the quest

Re: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-24 Thread Karl Wettin
There is even an old thread about this on the Mahout-users list: http://markmail.org/message/ludu5hjfczuvgk3n 17 mar 2009 kl. 15.17 skrev Grant Ingersoll: Have a look at the Lucene sister project: Mahout: http://lucene.apache.org/mahout . In there is the Taste collaborative filtering project

Re: Term level boosting

2009-03-24 Thread Seid Mohammed
ok, but I need to know how to proceed with it. I mean how to include to my application many thanks Seid M On 3/24/09, Koji Sekiguchi wrote: > Seid Mohammed wrote: >> Hi All >> I want my lucene to index documents and making some terms to have more >> boost value. >> so, if I index the document "

Re: Term level boosting

2009-03-24 Thread Koji Sekiguchi
Seid Mohammed wrote: Hi All I want my lucene to index documents and making some terms to have more boost value. so, if I index the document "The quick fox jumps over the lazy dog" and I want the term fox and dog to have greater boost value. How can I do that Thanks a lot seid M How about

Term level boosting

2009-03-24 Thread Seid Mohammed
Hi All I want my lucene to index documents and making some terms to have more boost value. so, if I index the document "The quick fox jumps over the lazy dog" and I want the term fox and dog to have greater boost value. How can I do that Thanks a lot seid M -- "RABI ZIDNI ILMA" ---

Re: Corrupt index (IndexOutOfBoundsException)

2009-03-24 Thread Michael McCandless
When I run checkIndex on your index, I see a new exception: org.apache.lucene.index.CorruptIndexException: Incompatible format version: 119865344 expected 1 or lower at org.apache.lucene.index.FieldsReader.(FieldsReader.java:116) at org.apache.lucene.index.SegmentReader.initialize

Re: How to know the matched field?

2009-03-24 Thread Paul Libbrecht
Here's my first approach but I note that, typically, I have fields (which are not stored) which may be the matching field but still not be the one I want to return. Typically, I have a field "names in all languages along the standard- analyzer" which is not the one I want to "see as matched".

Re: Corrupt index (IndexOutOfBoundsException)

2009-03-24 Thread Michael McCandless
Instead of ignoring the exceptions in your finally clause, can you log them? It could be something interesting is happening in there... I'll have a look at the index. Mike "René Zöpnek" wrote: > Thanks for your answer, Mike. > > Unfortunately I have no direct access to the server with the corr

Re: Can you create a RAM index from a file index

2009-03-24 Thread Anshum
Hi Paul, Going by what you've conveyed here, I'd assume that you have more than some data. You could either go ahead with Ian's way which is the suggested one(as far as lucene implementation is concerned) but It'd not be possible if you're index is greater than 2 Gigs and you are not running the 6

Re: Can you create a RAM index from a file index

2009-03-24 Thread Paul Taylor
Ian Lea wrote: Hi You can load an existing index into a RAMDirectory using one of the constructors that takes an existing index. I believe that a RAM index will be the same size as a file based index. Of course I was looking at IndexSearcher but the constructor is for RAMDirectory MMapDir

Re: Can you create a RAM index from a file index

2009-03-24 Thread Ian Lea
Hi You can load an existing index into a RAMDirectory using one of the constructors that takes an existing index. I believe that a RAM index will be the same size as a file based index. MMapDirectory is another possibility. -- Ian. On Tue, Mar 24, 2009 at 8:42 AM, Paul Taylor wrote: > Hi

Can you create a RAM index from a file index

2009-03-24 Thread Paul Taylor
Hi Ive built some file based indexes based on data in a database, and it took quite some time. I am interested in trying to use RAM based indexes instead of file based indexes to compare search performance but its going to take some time to rebuild the index from the original database, isnt it

Re: Scores between words. Boosting?

2009-03-24 Thread Grant Ingersoll
Do you have any info that helps you narrow down how many to choose, like some type of ranking of the synonyms? I guess I would start smaller, say maybe 3, and then evaluate your results with different numbers. On Mar 22, 2009, at 2:40 PM, liat oren wrote: Ok, thanks. I will look how to u

Corrupt index (IndexOutOfBoundsException)

2009-03-24 Thread René Zöpnek
Thanks for your answer, Mike. Unfortunately I have no direct access to the server with the corrupt index. So changing the creation process of the index is not possible. I've uploaded the index to http://drop.io/hlu53sl (9 MB). Here is the code for creating the index: public static void crea