Re: Email Indexing

2010-10-27 Thread Lance Norskog
Tika has some mailbox file parsing that includes metadata parsing. For POP/IMAP email servers I don't know any tools. Hasan Diwan wrote: On 27 October 2010 18:16, Troy Wical wrote: Depends on what your trying to index, I suppose. Maildir or mbox? For some time now, off and on, I have been

Re: Email Indexing

2010-10-27 Thread Hasan Diwan
On 27 October 2010 18:16, Troy Wical wrote: > Depends on what your trying to index, I suppose. Maildir or mbox? For some > time now, off and on, I have been working to index an ezmlm mailing list > archive. In the end, I went with Swish-E and have made quite a bit of > progress. I am short of m

Re: Lucene index update

2010-10-27 Thread Nilesh Vijaywargiay
One major reason is to update a field or rather shadow a field. i have a field named "testField" in index1 and i want to update that field. When I update, I want only the new value to be reflected, not the value in old field. now parallelreader starts from the latest index, i.e index2 and searches

Re: Lucene index update

2010-10-27 Thread Pulkit Singhal
But why do you feel the need to have a parallel reader that combines result sets across two indices based on docId? On Thu, Oct 28, 2010 at 12:17 AM, Nilesh Vijaywargiay < nilesh.vi...@gmail.com> wrote: > Pulkit, > Parallel reader takes the union of all fields for a given id. Thus if I > want > t

Re: Lucene index update

2010-10-27 Thread Nilesh Vijaywargiay
Pulkit, Parallel reader takes the union of all fields for a given id. Thus if I want to add a field or modify a field of a document which has id 2 in index1, I need to createa a document with id 2 in index2 with the fields I want to add/modify. Thus parallel reader would treat them as fields of a s

Re: Lucene index update

2010-10-27 Thread Pulkit Singhal
Look interesting, what is the merit in having a second index in order to keep the document id the same? Perhaps I have misunderstood. Just want to understand your motivation here. On Wed, Oct 20, 2010 at 2:57 PM, Nilesh Vijaywargiay wrote: > I've written a blog regarding a work around for updati

Fuzzy Phrase Search

2010-10-27 Thread Andrew Scott
Hi Guys, I am wondering how I can go about doing a Fuzzy Phrase search using Lucene.NET 2.9.2 - I've tired looking around everywhere but there doesn't really seem to be any resources related to this anywhere. I found this stackoverflow link

Re: Text categorization / classification

2010-10-27 Thread mvazq...@ova.st
Thanks a lot! I was reading about Mahout today. I'll try that out. Thanks again Maria Sent from my iPhone On Oct 27, 2010, at 20:59, Lance Norskog wrote: > There are tools for this in the Mahout project. These are oriented > toward large-scale work. > > http://mahout.apache.org > > There is

Re: Email Indexing

2010-10-27 Thread Troy Wical
On Oct 27, 2010, at 3:57 PM, Hasan Diwan wrote: > I'd like to provide myself with a searchable index of email. I'm > familiar with the Javamail library, so will use this to fetch the > mail. Anyone out there done any indexing of email? On Sourceforge, > there's zoe[1], which hasn't had a release s

Re: Text categorization / classification

2010-10-27 Thread Lance Norskog
There are tools for this in the Mahout project. These are oriented toward large-scale work. http://mahout.apache.org There is a big learning curve and you have to learn Hadoop somewhat. The book 'Collective Intelligence' includes a suite of Python tools for small-scale experiments. On Wed, Oct

Email Indexing

2010-10-27 Thread Hasan Diwan
I'd like to provide myself with a searchable index of email. I'm familiar with the Javamail library, so will use this to fetch the mail. Anyone out there done any indexing of email? On Sourceforge, there's zoe[1], which hasn't had a release since 2004, and a couple of other projects. I'm also seein

Text categorization / classification

2010-10-27 Thread Maria Vazquez
I need to auto-categorize a large number of documents. They are basically news articles from major news sources (nytimes, npr, abcnews, etc). I'd like to categorize them automatically. Any suggestions? Lucene in Action suggests using a set of documents to build category vectors and then comparing

Michigan Information Retrieval Enthusiasts Group Quarterly Meetup - November 13, 2010

2010-10-27 Thread Provalov, Ivan
Cengage Learning is organizing a second quarterly meetup in Michigan (web-conference and dial-in are available) for the IR Enthusiasts. Please RSVP at http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group Presentations: 1. Search Assist Dictionary Based on Corpus Terms Colloca

Re: Does a IndexSearcher call incRef on the underlying reader?

2010-10-27 Thread Michael McCandless
On Wed, Oct 27, 2010 at 1:01 PM, Pulkit Singhal wrote: > 1st of all, great book. Thank you! > @Question3: It sounds like an IndexReader always starts with a count of zero > but that should not be a cause of worry because the value only gets acted > upon in a call to decRef() ... am I right? Ac

Re: adding documents to an existing index

2010-10-27 Thread Seth Rosen
Yakob the boolean in the constructor should be true if you want to create a NEW index in INDEX_DIR and false to append to an existing one as seen here [1] As for adding a directory to an index you will need to validate the directory, then loop through it recursively and add each doc to the writer

Re: adding documents to an existing index

2010-10-27 Thread Yakob
On 10/27/10, Seth Rosen wrote: > Yakob, > Here is a snippet of an example of IndexWriter from the lucene source that > you might find helpful. > > >> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new >> StandardAnalyzer(Version.LUCENE_CURRENT), true, >> IndexWriter.MaxFieldLeng

Re: adding documents to an existing index

2010-10-27 Thread Seth Rosen
Yakob, Here is a snippet of an example of IndexWriter from the lucene source that you might find helpful. > IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new > StandardAnalyzer(Version.LUCENE_CURRENT), true, > IndexWriter.MaxFieldLength.LIMITED); System.out.println("Indexing

Re: adding documents to an existing index

2010-10-27 Thread Yakob
well thanks anyway though. On 10/27/10, 蒋明原 wrote: > you are too lazy.download the lucene source code,take a glance and you will > find demos; > > On Wed, Oct 27, 2010 at 8:43 PM, Yakob wrote: > >> I did searched about this constructor and find that it's already been >> deprecated. >> >> http://

Re: adding documents to an existing index

2010-10-27 Thread 蒋明原
you are too lazy.download the lucene source code,take a glance and you will find demos; On Wed, Oct 27, 2010 at 8:43 PM, Yakob wrote: > I did searched about this constructor and find that it's already been > deprecated. > > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWri

Re: adding documents to an existing index

2010-10-27 Thread Yakob
I did searched about this constructor and find that it's already been deprecated. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.analysis.Analyzer, boolean) I am using lucene 3.0 now.can I really use

Re: adding documents to an existing index

2010-10-27 Thread 蒋明原
IndexWriter writer =new IndexWirter(path,analyzer,false); the 3rd parameter is what you want. than you can writer.add(doc) enjoy . On Wed, Oct 27, 2010 at 8:04 PM, Yakob wrote: > hello all, > I would like to ask of how to add new documents to an existing lucene > index. I mean what's class sh

adding documents to an existing index

2010-10-27 Thread Yakob
hello all, I would like to ask of how to add new documents to an existing lucene index. I mean what's class should I use to achieve this goal. thanks -- http://jacobian.web.id - To unsubscribe, e-mail: java-user-unsubscr...@luc

RE: Lucene Software/Hardware Setup Question

2010-10-27 Thread Toke Eskildsen
On Tue, 2010-10-26 at 23:17 +0200, Kovnatsky, Eugene wrote: > Thanks Toke. Very descriptive. A few more questions about your SSD > drive(s) > - what is its current size 4 * 64GB Samsung MCCOE64G5MPP-0VA00 drives. They were pretty cool two years ago and still work very well for search-servers (ra