Doc classification / categorization with Lucene ?

2006-11-06 Thread Dmitry Goldenberg
Hello, What are the best practices for document classification / categorization using Lucene? Any recommendations as far as manual vs. automatic, which products to use or not to use? Does Lucene offer anything out of the box? Thanks, - Dmitry

RE: Lucene - FileFormat

2006-04-21 Thread Dmitry Goldenberg
Simon, I wonder if using Zoe might do the trick - http://guests.evectors.it/zoe/ Have you tried it? - Dmitry From: Fisheye [mailto:[EMAIL PROTECTED] Sent: Fri 4/21/2006 7:23 AM To: java-user@lucene.apache.org Subject: Lucene - FileFormat Im trying to const

RE: Distributed Lucene.. - clustering as a requirement

2006-04-11 Thread Dmitry Goldenberg
Agreed, an inverted index cannot be efficiently maintained in a B-tree(hence RDBMS). But I think we can(or should) have the option of a B-tree based storage for unindexed fields, whereas for indexed fields we can use the existing lucene's architecture. prasen [EMAIL PROTECTED] wrote: >

RE: Distributed Lucene.. - clustering as a requirement

2006-04-06 Thread Dmitry Goldenberg
r storing the actual "documents"? This way you're using lucene for what lucene is best at, and using the database for what it's good at. At least up to a point -- RDBMSs have their limits too. OR maybe if you have a huge dataset, you might want to check out Nutch. On 4/6/06, Dmitry G

RE: Distributed Lucene.. - clustering as a requirement

2006-04-06 Thread Dmitry Goldenberg
I firmly believe that clustering support should be a part of Lucene. We've tried implementing it ourselves and so far have been unsuccessful. We tried storing Lucene indices in a database that is the back-end repository for our app in a clustered environment and could not overcome the indexing

RE: Data structure of a Lucene Index

2006-04-06 Thread Dmitry Goldenberg
Ideally, I'd love to see an article explaining both in detail: the index structure as well as the merge algorithm... From: Prasenjit Mukherjee [mailto:[EMAIL PROTECTED] Sent: Tue 3/28/2006 11:57 PM To: java-user@lucene.apache.org Subject: Data structure of a Luce

RE: How to get mapping of query terms to number of their occurrences in a doc?

2006-02-09 Thread Dmitry Goldenberg
rmance, so maybe we could also make this more common setting the default also? Erik On Feb 8, 2006, at 2:17 PM, Dmitry Goldenberg wrote: > Duh! Bingo! Mistery solved. I should have thought of this :) > The discrepancies come in with larger documents, definitely > 10K > terms whi

RE: Word files & Build vs. Buy?

2006-02-09 Thread Dmitry Goldenberg
Chris, Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc? Thanx, - Dmitry _

RE: How to get mapping of query terms to number of their occurrences in a doc?

2006-02-08 Thread Dmitry Goldenberg
f the raw term field/text and the freq : counts you get back to see if that helps you spot the problem? : : : : Date: Mon, 6 Feb 2006 14:34:05 -0800 : : From: Dmitry Goldenberg <[EMAIL PROTECTED]> : : Reply-To: java-user@lucene.apache.org : : To: java-user@lucene.apache.org : : Subject: How to get m

RE: How to get mapping of query terms to number of their occurrences in a doc?

2006-02-08 Thread Dmitry Goldenberg
manually, or by QueryParser). the direct equals comparisons you are dong should be fine. have you tried adding logging of the raw term field/text and the freq counts you get back to see if that helps you spot the problem? : Date: Mon, 6 Feb 2006 14:34:05 -0800 : From: Dmitry Goldenberg <[EMA

How to get mapping of query terms to number of their occurrences in a doc?

2006-02-06 Thread Dmitry Goldenberg
Given a query, I want to be able to, for each query term, get the number of occurrences of the term. I have tried what I'm including below and it does not seem to provide reliable results. Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of

RE: How to find "function()" - ?

2006-01-30 Thread Dmitry Goldenberg
d fashion, e.g. function\() -- or is function() ok? Thanks, - Dmitry From: Michael D. Curtin [mailto:[EMAIL PROTECTED] Sent: Fri 1/27/2006 2:14 PM To: java-user@lucene.apache.org Subject: Re: How to find "function()" - ? Dmitry Goldenberg wrote: >

How to find "function()" - ?

2006-01-27 Thread Dmitry Goldenberg
Hi, I'm trying to figure out a way to locate tokens which include special characters. The actual text in the file being indexed is something like "function() { statement1; statement2; }" The query I'm using is "function\()" since I want to locate precisely "function()" - the query succeeds

RE: Keyword fields, Porter stemming, and QueryParser

2006-01-25 Thread Dmitry Goldenberg
Dave, Thanks for the pointer. The Wrapper worked marvellously! This was exactly the situation - wanting to treat the standard fields and keyword fields differently as far as stemming is concerned (no stemming for the latter). - Dmitry From: Dave Kor [mailt

RE: java.io.IOException: read past EOF in BufferedIndexInput.refill

2006-01-24 Thread Dmitry Goldenberg
clues? From: Dmitry Goldenberg [mailto:[EMAIL PROTECTED] Sent: Tue 1/24/2006 3:52 PM To: java-user@lucene.apache.org Cc: java-dev@lucene.apache.org Subject: java.io.IOException: read past EOF in BufferedIndexInput.refill Has anyone seen this exception and been able to resolve the

java.io.IOException: read past EOF in BufferedIndexInput.refill

2006-01-24 Thread Dmitry Goldenberg
Has anyone seen this exception and been able to resolve the cause? I have seen numerous mentions of it in the Lucene lists archives but no resolutions, looks like. Anyone? Thanks. java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java

Keyword fields, Porter stemming, and QueryParser

2006-01-24 Thread Dmitry Goldenberg
I'm having a problem with keyword fields and how they're treated by QueryParser. At indexing time, I index my documents, as follows: Content - tokenized, indexed field (the default field) DocType - not tokenized, indexed, stored field ... - other fields The analyzer I use utilizes Port

Lucene and Regex - ?

2006-01-04 Thread Dmitry Goldenberg
Hi, Can someone provide a quick summary of the Regex capabilities in Lucene? I see there's a RegexQuery and a SpanRegexQuery - what are they intended for and how do I use them? Thanks, - Dmitry

Correlating best fragments back to native documents - ?

2005-12-29 Thread Dmitry Goldenberg
Hello, I was wondering if anyone has seen or implemented the kind of solution where the best fragments generated by Lucene's Highlighter, are correlated back to the native documents such as PDF or MS Word. Basically, I want to be able to use native (or any other) API's to highlight Lucene's

RE: Wildcard and Fuzzy queries - no best fragments generated - ??

2005-12-27 Thread Dmitry Goldenberg
ik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tue 12/27/2005 12:13 PM To: java-user@lucene.apache.org Subject: Re: Wildcard and Fuzzy queries - no best fragments generated - ?? On Dec 27, 2005, at 2:34 PM, Dmitry Goldenberg wrote: > What do you mean by _rewriting_ the query? I checked all the >

RE: Wildcard and Fuzzy queries - no best fragments generated - ??

2005-12-27 Thread Dmitry Goldenberg
ecent postings. Please create a new message rather than reply to one and change the subject. Thanks. Erik On Dec 27, 2005, at 1:55 PM, Dmitry Goldenberg wrote: > Hello, > > While testing my code that integrates the Highlighter class from > org.apache.lucene.search.highl

Field searches and special characters - ??

2005-12-27 Thread Dmitry Goldenberg
Hello, Trying to get my field searches to work with special characters. It appears that Lucene is not able to interpret these searches correctly (but works as expected with generic content searches). For instance, I created a document named item+with+pluses (plus being the special character t

Wildcard and Fuzzy queries - no best fragments generated - ??

2005-12-27 Thread Dmitry Goldenberg
Hello, While testing my code that integrates the Highlighter class from org.apache.lucene.search.highlight, I found out that for wildcard and fuzzy queries, it generates no best fragments. Any particular reason why that is the case? Shouldn't the highlighter be able to work just like with any

RE: Proximity searches and Porter stemming - ??

2005-12-27 Thread Dmitry Goldenberg
Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tue 12/27/2005 10:56 AM To: java-user@lucene.apache.org Subject: Re: Proximity searches and Porter stemming - ?? On Dec 27, 2005, at 1:45 PM, Dmitry Goldenberg wrote: > I tried using Porter stemming in our application and it worked > great exc

Wildcard and Fuzzy queries - no best fragments generated - ??

2005-12-27 Thread Dmitry Goldenberg
Hello, While testing my code that integrates the Highlighter class from org.apache.lucene.search.highlight, I found out that for wildcard and fuzzy queries, it generates no best fragments. Any particular reason why that is the case? Shouldn't the highlighter be able to work just like with a

Proximity searches and Porter stemming - ??

2005-12-27 Thread Dmitry Goldenberg
Hello, I tried using Porter stemming in our application and it worked great except it broke the proximity searches. Is there any way at all that these two pieces of functionality could coexist peacefully? I do not see any reason why they should not. It seems to me that proximity query term

Field searches and special characters - ??

2005-12-27 Thread Dmitry Goldenberg
Hello, Trying to get my field searches to work with special characters. It appears that Lucene is not able to interpret these searches correctly (but works as expected with generic content searches). For instance, I created a document named item+with+pluses (plus being the special character

RE: searching portions of an index

2005-12-25 Thread Dmitry Goldenberg
You can implement a security filter, kind of like what the book Lucene in Action describes. It is a class that extends org.apache.lucene.search.Filter; you're required to implement the following method: public BitSet bits(IndexReader reader) In it, you can decide whether a particular documen