Re: Distributed Search

2006-07-27 Thread Jeff Rodenburg
Hi Mark - Having gone down this path for the past year, I echo comments from others that scalability/availability/failover is a lot of work. We migrated away from a custom system based on Lucene running on Windows to Solr running on Linux. It took us 6 months to get our system to a solid five-n

Re: How to get TermFreq only in some query results

2006-07-27 Thread Jia Mi
Thank you, Grant, really help me :P On 7/27/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote: You could store Term Vectors for your documents, and then look up the individual document vectors based on the query results. If you need help w/ Term Vectors, check out Lucene in Action, search this li

Re: Consult some information about adding index while searching

2006-07-27 Thread hu andy
Yes, I have closed IndexWriter. But it doesn't work. 2006/7/27, Michael McCandless <[EMAIL PROTECTED]>: > I met this problem: when searching, I add documents to index. Although I > instantiates a new IndexSearcher, I can't retrieve the newly added > documents. I have to close the program an

Re: Distributed Search

2006-07-27 Thread Yonik Seeley
On 7/27/06, Mark Miller <[EMAIL PROTECTED]> wrote: I thought I read that solr requires an OS that supports hard links and thought that Windows only supports soft links. For the default index distribution method from master to searcher, yes, hard-links are currently needed. The distribution mec

Re: Distributed Search

2006-07-27 Thread Mark Miller
Otis Gospodnetic wrote: I think we have an RMI example in Lucene in Action. You could also look at how Nutch does it. I think the code is in org.apache.nutch.ipc package. I'm not sure why cross-platform requirement rules out Solr, I would think it would exactly the opposite. As for 10m limit,

Re: Indexing large sets of documents?

2006-07-27 Thread Otis Gospodnetic
Rossini{}, I think what you might have read might have been that searching a Lucene index that lives in a HDFS would be slow. As far as I understand things, the thing to do is to copy the index to a local disk, out of HDFS, and then search it with Lucene from there. Otis() - Original Mes

Re: Distributed Search

2006-07-27 Thread Otis Gospodnetic
I think we have an RMI example in Lucene in Action. You could also look at how Nutch does it. I think the code is in org.apache.nutch.ipc package. I'm not sure why cross-platform requirement rules out Solr, I would think it would exactly the opposite. As for 10m limit, it depends. It depends on

Re: Output of index

2006-07-27 Thread Otis Gospodnetic
I think: - Get the number of documents from IndexReader. - Go from 0 to that number. - If reader.deleted(docId) == false get doc output doc fields' content Otis - Original Message From: MALCOLM CLARK <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, July 27, 200

Distributed Search

2006-07-27 Thread Mark Miller
I know there has been a lot of discussion on distributed search...I am looking for a cross platform solution, which seems to kill solr's approach...Everyone seems to have implemented this, but only as proprietary code...it would seem that just using the RMI searcher would allow a simple solutio

Output of index

2006-07-27 Thread MALCOLM CLARK
Hi, I'm going to attempt to output several thousand documents from a 3+ million document collection into a csv file. What is the most efficient method of retrieving all the text from the fields of each document one by one? Please help! Thanks, Malcolm

Re: Indexing large sets of documents?

2006-07-27 Thread Rafael Rossini
Oits, You mentioned the hadoop project. I check it out not a long time ago and I read someting about it did not support the lucene index. Is it possible to index and then search in a HDFS? []s Rossini On 7/27/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Michael, Certainly paralleli

Re: Indexing large sets of documents?

2006-07-27 Thread Otis Gospodnetic
Michael, Certainly parallelizing on a set of servers would work (hmm... hadoop?), but if you want to do this on a single machine you should tune some of the IndexWriter params. You didn't mention them, so I assume you didn't tune anything yet. If you have Lucene in Action, check out 2.7.1

RE: Indexing large sets of documents?

2006-07-27 Thread Dejan Nenov
Yes - parallelizing works great - we built a share-nothing java-spaces based system at X1 and on a 11-way cluster were able to index 350 office documents per second - this included the binary-2-text conversion, using Stellent INSO libraries. The trick is to create separate indexes and, if you do no

Re: Indexing large sets of documents?

2006-07-27 Thread MALCOLM CLARK
Is this the W3 Ent collection you are indexing? MC

Indexing large sets of documents?

2006-07-27 Thread Michael J. Prichard
I built an indexer that runs through email and its attachments, rips out content and what not and then creates a Document and adds it to an index. It works w/ no problem. The issue is that it takes around 3-5 seconds per email and I have seen up to 10-15 seconds for email w/ attachments. I n

Scoring a document (count?)

2006-07-27 Thread Russell M. Allen
I am curious about the potential use of document scoring as a means to extract additional data from an index. Specifically, I would like the score to be a count of how many times a particular field matched a set of terms. For example, I am indexing movie-stars (Each document is a movie-star). A

Re: MultiFieldQueryParser.parse deprecated. What can I use?

2006-07-27 Thread Paulo Silveira
Ok, I just tested it. So consider: String string = "word -foo"; String[] fields = { "title", "body" }; For the MultField I have: MultiFieldQueryParser qp = new MultiFieldQueryParser(fields, SearchEngine.ANALYZER); Query fieldsQuery = qp.parse(string); System.out.

Re: How to get TermFreq only in some query results

2006-07-27 Thread Grant Ingersoll
You could store Term Vectors for your documents, and then look up the individual document vectors based on the query results. If you need help w/ Term Vectors, check out Lucene in Action, search this list, or http://www.cnlp.org/apachecon2005 -Grant On Jul 27, 2006, at 4:52 AM, Jia Mi wr

Re: email libraries

2006-07-27 Thread Martin Braun
Hi John, > Just for the record - I've been using javamail POP and IMAP providers in > the past, and they were prone to hanging with some servers, and resource > intensive. I've been also using Outlook (proper, not Outlook Express - > this is AFAIK impossible to work with) via a Java-COM bridge suc

Re: Consult some information about adding index while searching

2006-07-27 Thread Michael McCandless
I met this problem: when searching, I add documents to index. Although I instantiates a new IndexSearcher, I can't retrieve the newly added documents. I have to close the program and enter the program, then it will be ok. Did you close your IndexWriter (so it flushes all changes to disk) be

Consult some information about adding index while searching

2006-07-27 Thread hu andy
I met this problem: when searching, I add documents to index. Although I instantiates a new IndexSearcher, I can't retrieve the newly added documents. I have to close the program and enter the program, then it will be ok. The platform is win xp. Is it the fault of xp? Thank you in advance.

RE: SOLVED: Lock obtain timed out

2006-07-27 Thread Björn Ekengren
I didn't describe the context fully. The app is a server that recieves updates randomly a couple of hundred times a day and I want the index to be updated at all times. If I would recieve several updates at once I could batch it but that is quite unlikely. _ Björn Ekengren Bankaktiebol

Re: SOLVED: Lock obtain timed out

2006-07-27 Thread karl wettin
On Thu, 2006-07-27 at 11:06 +0200, Björn Ekengren wrote: > Thancks everybody for the feedback. I now rewrote my app like this: > > synchronized (searcher.getWriteLock()){ > IndexReader reader = searcher.getIndexSearcher().getIndexReader(); > try { >

SOLVED: Lock obtain timed out

2006-07-27 Thread Björn Ekengren
Thancks everybody for the feedback. I now rewrote my app like this: synchronized (searcher.getWriteLock()){ IndexReader reader = searcher.getIndexSearcher().getIndexReader(); try { reader.deleteDocuments(new Term("id",id)); reader.cl

How to get TermFreq only in some query results

2006-07-27 Thread Jia Mi
Hi everyone, I am just developing an application using Lucene, and I know how to get the Term Freq via the IndexReader for the whole corpus. But I wonder if I can get the term freq statistics just inside the query results, like I want the hot words in just recent two weeks added into Lucene indic

Re: Timestamps as milliseconds

2006-07-27 Thread Miles Barr
Erick Erickson wrote: As Miles said, use the DateTools (lucene) class with a DAY resolution. That'll give you a MMDD format, which won't blow your query with a "TooManyClauses" exception... Remember that Lucene deals with strings, so you want to store things in easily-manipulated string

RE: Method to speed up caching for faceted navigation

2006-07-27 Thread Johan Stuyts
> I don't think it really matters wether you do deletes on the same > IndexReader -- what matters is if there has been any deletes > done to the > index prior to opening the reader since it was last > optimized. The reason > being that deleting a document just causes a record of the > deletion

Re: Method to speed up caching for faceted navigation

2006-07-27 Thread Chris Hostetter
: I looked at the implementation of 'read(int[], int[])' in : 'SegmentTermDocs' and saw that it did the following things: : - check if the document has a frequency higher than 1, and if so read : it; : - check if the document has been deleted, and if so don't add it to the : result; : - store the

RE: Lock obtain timed out

2006-07-27 Thread karl wettin
On Thu, 2006-07-27 at 08:59 +0200, Björn Ekengren wrote: > > > When I close my application containing index writers the > > > lock files are left in the temp directory causing an "Lock obtain > > > timed out" error upon the next restart. > > > > My guess is that you keep a writer open even though

Re: RE : Re: index articles with groups

2006-07-27 Thread Chris Hostetter
: Unfortunately this is not that easy. Because I must be able to retrieve : only one article and if i index all the content in one document then all : the document will be retrieved instead of the single article. i didn't say you had to *only* index the article contents in "group" documents ... y