Hello,
What are the best practices for document classification / categorization using
Lucene? Any recommendations as far as manual vs. automatic, which products to
use or not to use? Does Lucene offer anything out of the box?
Thanks,
- Dmitry
Simon,
I wonder if using Zoe might do the trick - http://guests.evectors.it/zoe/
Have you tried it?
- Dmitry
From: Fisheye [mailto:[EMAIL PROTECTED]
Sent: Fri 4/21/2006 7:23 AM
To: java-user@lucene.apache.org
Subject: Lucene - FileFormat
Im trying to const
Agreed, an inverted index cannot be efficiently maintained in a
B-tree(hence RDBMS). But I think we can(or should) have the option of
a B-tree based storage for unindexed fields, whereas for indexed fields
we can use the existing lucene's architecture.
prasen
[EMAIL PROTECTED] wrote:
>
r
storing the actual "documents"? This way you're using lucene for what
lucene is best at, and using the database for what it's good at. At
least up to a point -- RDBMSs have their limits too. OR maybe if you
have a huge dataset, you might want to check out Nutch.
On 4/6/06, Dmitry G
I firmly believe that clustering support should be a part of Lucene. We've
tried implementing it ourselves and so far have been unsuccessful. We tried
storing Lucene indices in a database that is the back-end repository for our
app in a clustered environment and could not overcome the indexing
Ideally, I'd love to see an article explaining both in detail: the index
structure as well as the merge algorithm...
From: Prasenjit Mukherjee [mailto:[EMAIL PROTECTED]
Sent: Tue 3/28/2006 11:57 PM
To: java-user@lucene.apache.org
Subject: Data structure of a Luce
rmance, so maybe we could also make this more common
setting the default also?
Erik
On Feb 8, 2006, at 2:17 PM, Dmitry Goldenberg wrote:
> Duh! Bingo! Mistery solved. I should have thought of this :)
> The discrepancies come in with larger documents, definitely > 10K
> terms whi
Chris,
Awesome stuff. A few questions: is your Excel extractor somehow better than
POI's? and, what do you see as the timeframe for adding WordPerfect support?
Are you considering supporting any other sources such as MS Project,
Framemaker, etc?
Thanx,
- Dmitry
_
f the raw term field/text and the freq
: counts you get back to see if that helps you spot the problem?
:
:
: : Date: Mon, 6 Feb 2006 14:34:05 -0800
: : From: Dmitry Goldenberg <[EMAIL PROTECTED]>
: : Reply-To: java-user@lucene.apache.org
: : To: java-user@lucene.apache.org
: : Subject: How to get m
manually, or
by QueryParser). the direct equals comparisons you are dong should be
fine.
have you tried adding logging of the raw term field/text and the freq
counts you get back to see if that helps you spot the problem?
: Date: Mon, 6 Feb 2006 14:34:05 -0800
: From: Dmitry Goldenberg <[EMA
Given a query, I want to be able to, for each query term, get the number of
occurrences of the term. I have tried what I'm including below and it does not
seem to provide reliable results. Seems to work fine with exact matching but
as soon as stemming kicks in, all bets are off as to value of
d fashion, e.g. function\() -- or is function() ok?
Thanks,
- Dmitry
From: Michael D. Curtin [mailto:[EMAIL PROTECTED]
Sent: Fri 1/27/2006 2:14 PM
To: java-user@lucene.apache.org
Subject: Re: How to find "function()" - ?
Dmitry Goldenberg wrote:
>
Hi,
I'm trying to figure out a way to locate tokens which include special
characters. The actual text in the file being indexed is something like
"function() { statement1; statement2; }"
The query I'm using is "function\()" since I want to locate precisely
"function()" - the query succeeds
Dave,
Thanks for the pointer. The Wrapper worked marvellously! This was exactly the
situation - wanting to treat the standard fields and keyword fields differently
as far as stemming is concerned (no stemming for the latter).
- Dmitry
From: Dave Kor [mailt
clues?
From: Dmitry Goldenberg [mailto:[EMAIL PROTECTED]
Sent: Tue 1/24/2006 3:52 PM
To: java-user@lucene.apache.org
Cc: java-dev@lucene.apache.org
Subject: java.io.IOException: read past EOF in BufferedIndexInput.refill
Has anyone seen this exception and been able to resolve the
Has anyone seen this exception and been able to resolve the cause? I have seen
numerous mentions of it in the Lucene lists archives but no resolutions, looks
like. Anyone? Thanks.
java.io.IOException: read past EOF
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java
I'm having a problem with keyword fields and how they're treated by QueryParser.
At indexing time, I index my documents, as follows:
Content - tokenized, indexed field (the default field)
DocType - not tokenized, indexed, stored field
... - other fields
The analyzer I use utilizes Port
Hi,
Can someone provide a quick summary of the Regex capabilities in Lucene? I see
there's a RegexQuery and a SpanRegexQuery - what are they intended for and how
do I use them?
Thanks,
- Dmitry
Hello,
I was wondering if anyone has seen or implemented the kind of solution where
the best fragments generated by Lucene's Highlighter, are correlated back to
the native documents such as PDF or MS Word.
Basically, I want to be able to use native (or any other) API's to highlight
Lucene's
ik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tue 12/27/2005 12:13 PM
To: java-user@lucene.apache.org
Subject: Re: Wildcard and Fuzzy queries - no best fragments generated - ??
On Dec 27, 2005, at 2:34 PM, Dmitry Goldenberg wrote:
> What do you mean by _rewriting_ the query? I checked all the
>
ecent postings. Please
create a new message rather than reply to one and change the
subject. Thanks.
Erik
On Dec 27, 2005, at 1:55 PM, Dmitry Goldenberg wrote:
> Hello,
>
> While testing my code that integrates the Highlighter class from
> org.apache.lucene.search.highl
Hello,
Trying to get my field searches to work with special characters. It appears
that Lucene is not able to interpret these searches correctly (but works as
expected with generic content searches).
For instance, I created a document named item+with+pluses (plus being the
special character t
Hello,
While testing my code that integrates the Highlighter class from
org.apache.lucene.search.highlight, I found out that for wildcard and fuzzy
queries, it generates no best fragments.
Any particular reason why that is the case? Shouldn't the highlighter be able
to work just like with any
Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tue 12/27/2005 10:56 AM
To: java-user@lucene.apache.org
Subject: Re: Proximity searches and Porter stemming - ??
On Dec 27, 2005, at 1:45 PM, Dmitry Goldenberg wrote:
> I tried using Porter stemming in our application and it worked
> great exc
Hello,
While testing my code that integrates the Highlighter class from
org.apache.lucene.search.highlight, I found out that for wildcard and fuzzy
queries, it generates no best fragments.
Any particular reason why that is the case? Shouldn't the highlighter be able
to work just like with a
Hello,
I tried using Porter stemming in our application and it worked great except it
broke the proximity searches. Is there any way at all that these two pieces of
functionality could coexist peacefully?
I do not see any reason why they should not. It seems to me that proximity
query term
Hello,
Trying to get my field searches to work with special characters. It appears
that Lucene is not able to interpret these searches correctly (but works as
expected with generic content searches).
For instance, I created a document named item+with+pluses (plus being the
special character
You can implement a security filter, kind of like what the book Lucene in
Action describes. It is a class that extends org.apache.lucene.search.Filter;
you're required to implement the following method:
public BitSet bits(IndexReader reader)
In it, you can decide whether a particular documen
28 matches
Mail list logo