Re: RE : Re: index articles with groups

2006-07-27 Thread Chris Hostetter
: Unfortunately this is not that easy. Because I must be able to retrieve : only one article and if i index all the content in one document then all : the document will be retrieved instead of the single article. i didn't say you had to *only* index the article contents in "group" documents ... y

RE: Lock obtain timed out

2006-07-27 Thread karl wettin
On Thu, 2006-07-27 at 08:59 +0200, Björn Ekengren wrote: > > > When I close my application containing index writers the > > > lock files are left in the temp directory causing an "Lock obtain > > > timed out" error upon the next restart. > > > > My guess is that you keep a writer open even though

Re: Method to speed up caching for faceted navigation

2006-07-27 Thread Chris Hostetter
: I looked at the implementation of 'read(int[], int[])' in : 'SegmentTermDocs' and saw that it did the following things: : - check if the document has a frequency higher than 1, and if so read : it; : - check if the document has been deleted, and if so don't add it to the : result; : - store the

RE: Method to speed up caching for faceted navigation

2006-07-27 Thread Johan Stuyts
> I don't think it really matters wether you do deletes on the same > IndexReader -- what matters is if there has been any deletes > done to the > index prior to opening the reader since it was last > optimized. The reason > being that deleting a document just causes a record of the > deletion

Re: Timestamps as milliseconds

2006-07-27 Thread Miles Barr
Erick Erickson wrote: As Miles said, use the DateTools (lucene) class with a DAY resolution. That'll give you a MMDD format, which won't blow your query with a "TooManyClauses" exception... Remember that Lucene deals with strings, so you want to store things in easily-manipulated string

How to get TermFreq only in some query results

2006-07-27 Thread Jia Mi
Hi everyone, I am just developing an application using Lucene, and I know how to get the Term Freq via the IndexReader for the whole corpus. But I wonder if I can get the term freq statistics just inside the query results, like I want the hot words in just recent two weeks added into Lucene indic

SOLVED: Lock obtain timed out

2006-07-27 Thread Björn Ekengren
Thancks everybody for the feedback. I now rewrote my app like this: synchronized (searcher.getWriteLock()){ IndexReader reader = searcher.getIndexSearcher().getIndexReader(); try { reader.deleteDocuments(new Term("id",id)); reader.cl

Re: SOLVED: Lock obtain timed out

2006-07-27 Thread karl wettin
On Thu, 2006-07-27 at 11:06 +0200, Björn Ekengren wrote: > Thancks everybody for the feedback. I now rewrote my app like this: > > synchronized (searcher.getWriteLock()){ > IndexReader reader = searcher.getIndexSearcher().getIndexReader(); > try { >

RE: SOLVED: Lock obtain timed out

2006-07-27 Thread Björn Ekengren
I didn't describe the context fully. The app is a server that recieves updates randomly a couple of hundred times a day and I want the index to be updated at all times. If I would recieve several updates at once I could batch it but that is quite unlikely. _ Björn Ekengren Bankaktiebol

Consult some information about adding index while searching

2006-07-27 Thread hu andy
I met this problem: when searching, I add documents to index. Although I instantiates a new IndexSearcher, I can't retrieve the newly added documents. I have to close the program and enter the program, then it will be ok. The platform is win xp. Is it the fault of xp? Thank you in advance.

Re: Consult some information about adding index while searching

2006-07-27 Thread Michael McCandless
I met this problem: when searching, I add documents to index. Although I instantiates a new IndexSearcher, I can't retrieve the newly added documents. I have to close the program and enter the program, then it will be ok. Did you close your IndexWriter (so it flushes all changes to disk) be

Re: email libraries

2006-07-27 Thread Martin Braun
Hi John, > Just for the record - I've been using javamail POP and IMAP providers in > the past, and they were prone to hanging with some servers, and resource > intensive. I've been also using Outlook (proper, not Outlook Express - > this is AFAIK impossible to work with) via a Java-COM bridge suc

Re: How to get TermFreq only in some query results

2006-07-27 Thread Grant Ingersoll
You could store Term Vectors for your documents, and then look up the individual document vectors based on the query results. If you need help w/ Term Vectors, check out Lucene in Action, search this list, or http://www.cnlp.org/apachecon2005 -Grant On Jul 27, 2006, at 4:52 AM, Jia Mi wr

Re: MultiFieldQueryParser.parse deprecated. What can I use?

2006-07-27 Thread Paulo Silveira
Ok, I just tested it. So consider: String string = "word -foo"; String[] fields = { "title", "body" }; For the MultField I have: MultiFieldQueryParser qp = new MultiFieldQueryParser(fields, SearchEngine.ANALYZER); Query fieldsQuery = qp.parse(string); System.out.

Scoring a document (count?)

2006-07-27 Thread Russell M. Allen
I am curious about the potential use of document scoring as a means to extract additional data from an index. Specifically, I would like the score to be a count of how many times a particular field matched a set of terms. For example, I am indexing movie-stars (Each document is a movie-star). A

Indexing large sets of documents?

2006-07-27 Thread Michael J. Prichard
I built an indexer that runs through email and its attachments, rips out content and what not and then creates a Document and adds it to an index. It works w/ no problem. The issue is that it takes around 3-5 seconds per email and I have seen up to 10-15 seconds for email w/ attachments. I n

Re: Indexing large sets of documents?

2006-07-27 Thread MALCOLM CLARK
Is this the W3 Ent collection you are indexing? MC

RE: Indexing large sets of documents?

2006-07-27 Thread Dejan Nenov
Yes - parallelizing works great - we built a share-nothing java-spaces based system at X1 and on a 11-way cluster were able to index 350 office documents per second - this included the binary-2-text conversion, using Stellent INSO libraries. The trick is to create separate indexes and, if you do no

Re: Indexing large sets of documents?

2006-07-27 Thread Otis Gospodnetic
Michael, Certainly parallelizing on a set of servers would work (hmm... hadoop?), but if you want to do this on a single machine you should tune some of the IndexWriter params. You didn't mention them, so I assume you didn't tune anything yet. If you have Lucene in Action, check out 2.7.1

Re: Indexing large sets of documents?

2006-07-27 Thread Rafael Rossini
Oits, You mentioned the hadoop project. I check it out not a long time ago and I read someting about it did not support the lucene index. Is it possible to index and then search in a HDFS? []s Rossini On 7/27/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Michael, Certainly paralleli

Output of index

2006-07-27 Thread MALCOLM CLARK
Hi, I'm going to attempt to output several thousand documents from a 3+ million document collection into a csv file. What is the most efficient method of retrieving all the text from the fields of each document one by one? Please help! Thanks, Malcolm

Distributed Search

2006-07-27 Thread Mark Miller
I know there has been a lot of discussion on distributed search...I am looking for a cross platform solution, which seems to kill solr's approach...Everyone seems to have implemented this, but only as proprietary code...it would seem that just using the RMI searcher would allow a simple solutio

Re: Output of index

2006-07-27 Thread Otis Gospodnetic
I think: - Get the number of documents from IndexReader. - Go from 0 to that number. - If reader.deleted(docId) == false get doc output doc fields' content Otis - Original Message From: MALCOLM CLARK <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, July 27, 200

Re: Distributed Search

2006-07-27 Thread Otis Gospodnetic
I think we have an RMI example in Lucene in Action. You could also look at how Nutch does it. I think the code is in org.apache.nutch.ipc package. I'm not sure why cross-platform requirement rules out Solr, I would think it would exactly the opposite. As for 10m limit, it depends. It depends on

Re: Indexing large sets of documents?

2006-07-27 Thread Otis Gospodnetic
Rossini{}, I think what you might have read might have been that searching a Lucene index that lives in a HDFS would be slow. As far as I understand things, the thing to do is to copy the index to a local disk, out of HDFS, and then search it with Lucene from there. Otis() - Original Mes

Re: Distributed Search

2006-07-27 Thread Mark Miller
Otis Gospodnetic wrote: I think we have an RMI example in Lucene in Action. You could also look at how Nutch does it. I think the code is in org.apache.nutch.ipc package. I'm not sure why cross-platform requirement rules out Solr, I would think it would exactly the opposite. As for 10m limit,

Re: Distributed Search

2006-07-27 Thread Yonik Seeley
On 7/27/06, Mark Miller <[EMAIL PROTECTED]> wrote: I thought I read that solr requires an OS that supports hard links and thought that Windows only supports soft links. For the default index distribution method from master to searcher, yes, hard-links are currently needed. The distribution mec

Re: Consult some information about adding index while searching

2006-07-27 Thread hu andy
Yes, I have closed IndexWriter. But it doesn't work. 2006/7/27, Michael McCandless <[EMAIL PROTECTED]>: > I met this problem: when searching, I add documents to index. Although I > instantiates a new IndexSearcher, I can't retrieve the newly added > documents. I have to close the program an

Re: How to get TermFreq only in some query results

2006-07-27 Thread Jia Mi
Thank you, Grant, really help me :P On 7/27/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote: You could store Term Vectors for your documents, and then look up the individual document vectors based on the query results. If you need help w/ Term Vectors, check out Lucene in Action, search this li

Re: Distributed Search

2006-07-27 Thread Jeff Rodenburg
Hi Mark - Having gone down this path for the past year, I echo comments from others that scalability/availability/failover is a lot of work. We migrated away from a custom system based on Lucene running on Windows to Solr running on Linux. It took us 6 months to get our system to a solid five-n