optimized searching

2009-06-29 Thread m.harig
hello all, i've gone through most of the posts from this forum , i need a code snippet for searching large index, currently am iterating , hits = searher.search(query); for (int inc = 0; inc < hits.length(); inc++) { Document doc = hits.doc(inc);

Re: Read large size index

2009-06-29 Thread m.harig
Thanks SImon , Example: IndexReader open = IndexReader.open("/tmp/testindex/"); IndexSearcher searcher = new IndexSearcher(open); final String fName = "test"; is fName a field like summary , contents?? TopDocs topDocs = searcher.search(new TermQuery(new Term(fName, "lucene")),

Re: Doc-Doc Similarity Matrix Construction

2009-06-29 Thread Amir Hossein Jadidinejad
It's exactly my question: http://www.mail-archive.com/lucene-u...@jakarta.apache.org/msg04915.html --- On Mon, 6/29/09, Amir Hossein Jadidinejad wrote: From: Amir Hossein Jadidinejad Subject: Doc-Doc Similarity Matrix Construction To: java-user@lucene.apache.org Date: Monday, June 29, 2009, 3:

Re: Doc-Doc Similarity Matrix Construction

2009-06-29 Thread Mark Harwood
See MoreLikeThis in the contrib/queries folder. It optimizes the speed of similarity comparisons by taking the most significant words only from a document as search terms. On 29 Jun 2009, at 20:14, Amir Hossein Jadidinejad wrote: Hi, It's my first experiment with Lucene. Please help me.

Doc-Doc Similarity Matrix Construction

2009-06-29 Thread Amir Hossein Jadidinejad
Hi, It's my first experiment with Lucene. Please help me. I'm going to index a set of documents and create a feature vector for each of them. This vector contains all terms belong to the document that weight using TFIDF. After that I want to compute the cosine similarity between all documents and

A simple Vector Space Model and TFIDF usage

2009-06-29 Thread Amir Hossein Jadidinejad
Hi, It's my first experiment with Lucene. Please help me. I'm going to index a set of documents and create a feature vector for each of them. This vector contains all terms belong to the document that weight using TFIDF. After that I want to compute the cosine similarity between all documents and

Re: Lucene Term Encoder

2009-06-29 Thread Erick Erickson
You probably need to make sure you understand analyzers beforeyou think about escaping/encoding. For instance, if you use StandardAnalyzer when indexing the text "Las Vegas-Food Dining Place" would index the tokens las vegas food dining place nary a hyphen to be seen. If you used StandardAnalyzer

Re: Lucene Term Encoder

2009-06-29 Thread John Seer
Hello Simon, I am looking for some class which automaticly will take care of text and convert it into text which can be used in query. The same way as URLEncoder encodes string for URL for example: Term: Las Vegas-Food AND Dining place After encoding term: Las Vagas(escapedDash)Food and Dining

Re: Read large size index

2009-06-29 Thread Simon Willnauer
On Mon, Jun 29, 2009 at 6:36 PM, m.harig wrote: > > Thanks Simon , > > Hey there, that makes things easier. :) > > ok here are some questions: > Do you iterate over all docs calling hits.doc(i) ?If so do you have to > load all fields to render your results, if not you should not retrieve > all

Re: Read large size index

2009-06-29 Thread m.harig
Thanks Simon , Hey there, that makes things easier. :) ok here are some questions: >>>Do you iterate over all docs calling hits.doc(i) ?If so do you have to load all fields to render your results, if not you should not retrieve all of them? Yes, am iterating over all docs by calling hits.doc

Re: Scaling out/up or a mix

2009-06-29 Thread Marcus Herou
Hi thanks for your answer, comments inline. On Mon, Jun 29, 2009 at 10:06 AM, eks dev wrote: > > depends on your architecture, will you partition your index? What is max > expected size of your index (you said 128G and growing..) what do you mean > with growing? You have in both options enogh me

Re: Read large size index

2009-06-29 Thread Simon Willnauer
Hey there, that makes things easier. :) ok here are some questions: Do you iterate over all docs calling hits.doc(i) ?If so do you have to load all fields to render your results, if not you should not retrieve all of them? You use IndexSearchersearch(Query q,...) which returns a Hits object have

Re: Optimizing unordered queries

2009-06-29 Thread Nigel
On Mon, Jun 29, 2009 at 6:28 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Sun, Jun 28, 2009 at 9:08 PM, Nigel wrote: > >> Unfortunately the TermInfos must still be hit to look up the > >> freq/proxOffset in the postings files. > > > > But for that data you only have to hit the T

Re: Delete by docId in IndexWriter

2009-06-29 Thread Ganesh
This issue has been raised earlier. IndexWriter is not providing any functionality to delete by doc id. IndexReader does, but it requires write permission. In my case, IndexReader and IndexWriter will be opened always. IR will be reopened frequently. To delete, i have no other option to write a

Re: Read large size index

2009-06-29 Thread m.harig
Thanks again, Did i index my files correctly, please need some tips, the following is the error when i run my keyword , i typed pdf , thats it , because i've got around 30,000 files named pdf, HTTP Status 500 - type Exception report message description The server encountered a

Re: Read large size index

2009-06-29 Thread Simon Willnauer
On Mon, Jun 29, 2009 at 3:07 PM, m.harig wrote: > > Thanks Simon , > >           This is how am indexing my documents , > > >                indexWriter.addDocument(doc, new StopAnalyzer()); > > >                indexWriter.setMergeFactor(10); > >                indexWriter.setMaxBufferedDocs(100);

Re: Read large size index

2009-06-29 Thread m.harig
Thanks Simon , This is how am indexing my documents , indexWriter.addDocument(doc, new StopAnalyzer()); indexWriter.setMergeFactor(10); indexWriter.setMaxBufferedDocs(100); indexWriter.setMaxMergeDocs(Integer.MAX_VA

Re: Read large size index

2009-06-29 Thread Simon Willnauer
On Mon, Jun 29, 2009 at 2:55 PM, m.harig wrote: > > Thanks Simon > >           I don't run any application on the tomcat , moreover i restarted > it , am not doing any jobs except searching , we've a 500GB drive , we've > indexed around 100,000 documents , it gives me around 1GB index . When i > tr

Re: Read large size index

2009-06-29 Thread m.harig
Thanks Simon I don't run any application on the tomcat , moreover i restarted it , am not doing any jobs except searching , we've a 500GB drive , we've indexed around 100,000 documents , it gives me around 1GB index . When i tried to search pdf i got the heap space error , -- View t

Re: Lucene Term Encoder

2009-06-29 Thread Simon Willnauer
Hi John, what do you mean by encoding? If you can be more clear about what you are looking for you might get help easily. simon On Sat, Jun 27, 2009 at 12:27 AM, John Seer wrote: > > Hello, > Is there any class in lucene which will do encoding for term? > > > Thanks > -- > View this message in co

Re: Read large size index

2009-06-29 Thread Simon Willnauer
Well, with this information I can hardly tell what the cause of the OOM is. It would be really really helpful if you could figure out where it happens. Do you get the OOM on the first try? I guess you do not do any indexing in the background?! What is your index "layout" I mean what kind of fields

Re: Read large size index

2009-06-29 Thread m.harig
Simon Willnauer wrote: > > On Mon, Jun 29, 2009 at 1:48 PM, m.harig wrote: >> >> >> >> Simon Willnauer wrote: >>> >>> Hey there, >>> before going out to use hadoop (hadoop mailing list would help you >>> better I guess) you could provide more information about you >>> situation. For instance: >

Re: Read large size index

2009-06-29 Thread Simon Willnauer
On Mon, Jun 29, 2009 at 1:48 PM, m.harig wrote: > > > > Simon Willnauer wrote: >> >> Hey there, >> before going out to use hadoop (hadoop mailing list would help you >> better I guess) you could provide more information about you >> situation. For instance: >> - how big is you index >> - version of

Re: Read large size index

2009-06-29 Thread m.harig
Simon Willnauer wrote: > > Hey there, > before going out to use hadoop (hadoop mailing list would help you > better I guess) you could provide more information about you > situation. For instance: > - how big is you index > - version of lucene > - which java vm > - how much heap space > - where

Re: Read large size index

2009-06-29 Thread Simon Willnauer
Hey there, before going out to use hadoop (hadoop mailing list would help you better I guess) you could provide more information about you situation. For instance: - how big is you index - version of lucene - which java vm - how much heap space - where does the OOM occure or maybe there is already

Re: special Type of indexing

2009-06-29 Thread Simon Willnauer
Hey there, currently lucene offers several possibilities to do what you want. 1. Use payloads you can encode arbitrary values per term you index and affect scoring by overriding Similarity#scorePayloads. 2. use IndexReader#termPosition(term) inside a query you can access the actual term position du

Read large size index

2009-06-29 Thread m.harig
hello all Am doing a search application on lucene, its working fine when my index size is small, am getting java heap space error when am using large size index, i came to know about hadoop with lucene to solve this problem , but i don't have any idea about hadoop , i've searched thru th

Re: Optimizing unordered queries

2009-06-29 Thread Michael McCandless
On Sun, Jun 28, 2009 at 9:08 PM, Nigel wrote: >> Unfortunately the TermInfos must still be hit to look up the >> freq/proxOffset in the postings files. > > But for that data you only have to hit the TermInfos for the terms you're > searching, correct?  So, assuming that there are vastly more terms

Re: MultiSegmentReader problems - current is null

2009-06-29 Thread Simon Willnauer
Quick question, which version of lucene do you use?! simon On Mon, Jun 29, 2009 at 9:55 AM, liat oren wrote: > The full error is: > Exception in thread "main" java.lang.NullPointerException >        at > Priorart.Lucene.Expert.index.MultiSegmentReader$MultiTermDocs.freq(Mu > ltiSegmentReader.java

Re: Scaling out/up or a mix

2009-06-29 Thread Toke Eskildsen
On Sat, 2009-06-27 at 00:00 +0200, Marcus Herou wrote: > We currently have about 90M documents and it is increasing rapidly so > getting into the G+ document range is not going to be too far away. We've performed fairly extensive tests regarding hardware for searches and some minor tests on hardwa

Re: Scaling out/up or a mix

2009-06-29 Thread eks dev
depends on your architecture, will you partition your index? What is max expected size of your index (you said 128G and growing..) what do you mean with growing? You have in both options enogh memory to load it into RAM... I would definitly try to have less machines and alot of memory, so that

Re: MultiSegmentReader problems - current is null

2009-06-29 Thread liat oren
The full error is: Exception in thread "main" java.lang.NullPointerException at Priorart.Lucene.Expert.index.MultiSegmentReader$MultiTermDocs.freq(Mu ltiSegmentReader.java:709) I looked at issue LUCENE-781- it might relates to this one?? Tho

Re: Scaling out/up or a mix

2009-06-29 Thread Marcus Herou
Thanks for the answer. Don't you think that part 1 of the email would give you a hint of nature of the index ? Index size(and growing): 16Gx8 = 128G Doc size (data): 20k Num docs: 90M Num users: Few hundred but most critical is that the admin staff which is using the index all day long. Query typ

highlighting the result within a file

2009-06-29 Thread Ritu choudhary
hi all, I am trying a program that could highlight a searched term and writes the result into a demo.html file. As of now this demo.html can show only few pages of the book . Is there any way i can use it show the whole book. (Can increasing the fragment size upto filesize help?) I have a

Re: Delete by docId in IndexWriter

2009-06-29 Thread Shay Banon
Agreed, thats the tricky part. I will open a Jira issue. Really hoping to get some time and maybe also provide a patch... Thanks, Shay Jason Rutherglen-2 wrote: > > This requires tracking the genealogy of docids as they are merged inside > IndexWriter. It's doable, so if you're particularly in