Hi,
we have not yet discussed about that. At moment Lucene uses one custom
annotation "@SuppressForbidden") which is detected by the forbiddenapis
plugin based on pure class name (not package). Forbiddenapis
(https://github.com/policeman-tools/forbidden-apis) is a static analysis
I'm wondering if the Lucene community would be supportive of adopting
common annotations, such a @Nullable, to enable better static analysis for
downstream projects and within Lucene as well. Lucene makes extensive use
of nulls for performance reasons, but using this code can be prone to
One more idea. It's possible to ask Solr for essential tokenization via
/analysis/field API (here's a clue https://stackoverflow.com/a/37785401),
get token stream in structured response, and pass it into NPL pipeline for
enrichment.
On Wed, Feb 22, 2023 at 5:26 PM Luke Kot-Zaniewski
@lucene.apache.org
Subject: Re: Offset-Based Analysis
Hello Luke.
Using offsets seems really doubtful to me. What comes to my mind is
pre-analyzed field
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe
s.html#the-preanalyzedfield-type.
Thus, external NLP service can provide ready
g a CharFilter that decodes some special header,
> which itself passes along an offset-sorted list of data for enrichment.
> This metadata could be referenced during analysis via custom attributes and
> ideally could handle a variety of use cases with the same offset-accounting
> lo
metadata could
be referenced during analysis via custom attributes and ideally could handle a
variety of use cases with the same offset-accounting logic. Some uses that come
to mind are stashing values in term/payload attributes or even offset based
tokenization for those wishing to tokenize
ways. Looking at the library internals
reveals unsynchronized lazy initialization of shared components.
Unfortunately the lucene integration kind of sweeps this under the rug by
wrapping everything in a pretty big synchronized block, here is an example
https://github.com/apache/lucene/blob/main/lucen
example, on the other hand, are slow. If you have to
put NLP processing inside the analysis chain, you may have to give up
certain NLP capacities...
>
> My 2cents,
>
> Guan
>
> -Original Message-
> From: Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A)
> Sent: Saturday, November
upposed to be shared among threads. They
can be re-used among threads though.
NLPs, stemming for example, on the other hand, are slow. If you have to put NLP
processing inside the analysis chain, you may have to give up certain NLP
capacities...
My 2cents,
Guan
-Original Message-
From
Hello, Benoit.
I just came across
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TypeAsSynonymFilterFactory.html
It sounds similar to what you asking, but it watches TypeAttribute only.
Also, spans are superseded with intervals
https
f sweeps this under the rug by wrapping everything
in a pretty big synchronized block, here is an example
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
. This itself is problematic because these funct
he open-nlp library itself. It is not
> > thread-safe in some very unexpected ways. Looking at the library internals
> > reveals unsynchronized lazy initialization of shared components.
> > Unfortunately the lucene integration kind of sweeps this under the rug by
> &g
rug by
> wrapping everything in a pretty big synchronized block, here is an example
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
> . This itself is problematic because these functions run in real
zed block, here is an example
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
. This itself is problematic because these functions run in really tight loops
and probably shouldn’t be blocking. Even if one
This change was intentional to make it consistent with package naming,
Dawid
On Tue, Jul 26, 2022 at 10:34 PM Baris Kazar wrote:
> Dear Folks,-
> I see that Lucene has changed one of the JAR files' name to
> lucene-analysis-common-9.1.0.jar in Lucene version 9.1.0.
> It used
Dear Folks,-
I see that Lucene has changed one of the JAR files' name to
lucene-analysis-common-9.1.0.jar in Lucene version 9.1.0.
It used to use analyzers. Can someone please confirm?
Best regards
Hello,
I created issue in elasticsearch:
https://github.com/elastic/elasticsearch/issues/71483
I was redirected to Lucene project.
I want to ask if I can create issue on your Jira about this problem? Or maybe
there is any solution?
Regards
Dominik Seweryn
Dominik Seweryn
Programista Aplikacj
, 2017 at 3:12 PM, Nawab Zada Asad Iqbal
wrote:
> Hi,
>
> I upgraded to solr 7 today and i am seeing tonnes of following errors for
> various fields.
>
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
> Exception writing document id file_3881549 to the index
Hi,
I upgraded to solr 7 today and i am seeing tonnes of following errors for
various fields.
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception
writing document id file_3881549 to the index; possible analysis error:
startOffset must be non-negative, and endOffset
On line 114, in the init() method, a ClassicTokenizerImpl object is created,
but the constructor is passed a parameter called input. Where does this
variable come from? It doesn't seem to be declared anywhere in the java file.
>> some terms from analysis be silently dropped when indexing
Then I presume the same need to be also be exempted/dropped while searching
process.
else the desired results are not as expected.
with regards
karthik
On Mon, Aug 25, 2014 at 12:52 PM, Trejkaz wrote:
> It seems li
It seems like nobody knows the answer, so I'm just going to file a bug.
TX
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene 4.9 gives much the same result.
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import
On Tue, Aug 19, 2014 at 5:27 PM, Uwe Schindler wrote:
> Hi,
>
> You forgot to close (or commit) IndexWriter before opening the reader.
Huh? The code I posted is closing it:
try (IndexWriter writer = new IndexWriter(directory,
new IndexWriterConfig(Version.LUCENE_36, analyser))) {
9, 2014 6:50 AM
> To: Lucene Users Mailing List
> Subject: Can some terms from analysis be silently dropped when indexing?
> Because I'm pretty sure I'm seeing that happen.
>
> Unrelated to my previous mail to the list, but related to the same
> investigation...
>
&
Also in case it makes a difference, we're using Lucene v3.6.2.
TX
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Unrelated to my previous mail to the list, but related to the same
investigation...
The following test program just indexes a phrase of nonsense words
using and then queries for one of the words using the same analyser.
The same analyser is being used both for indexing and for querying,
yet in th
Thanks Mike.
On Mon, Oct 8, 2012 at 4:30 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Fri, Oct 5, 2012 at 10:24 AM, selvakumar netaji
> wrote:
> > Hi All,
> >
> >
> > In the TokenStreamAPI section of the analysis documentation for lucene
On Fri, Oct 5, 2012 at 10:24 AM, selvakumar netaji
wrote:
> Hi All,
>
>
> In the TokenStreamAPI section of the analysis documentation for lucene 4.0
> beta, MyAnalyzer class is defined.
>
> They've added the lengthFilter in the create components method. The length
>
f Apache Lucene.
>>
>> I just read through the docs of the analyser
>> docs/core/org/apache/lucene/analysis/package-summary.html.
>>
>>
>> Here they have given a code snippet,I've ambiguities in the add attribute
>> method. Should it be added to t
Can you please help me to sort this out.
On Fri, Oct 5, 2012 at 7:54 PM, selvakumar netaji wrote:
> Hi All,
>
>
> In the TokenStreamAPI section of the analysis documentation for lucene
> 4.0 beta, MyAnalyzer class is defined.
>
> They've added the lengthFilter in th
Hi All,
In the TokenStreamAPI section of the analysis documentation for lucene 4.0
beta, MyAnalyzer class is defined.
They've added the lengthFilter in the create components method. The length
filter doesn't accept method with three arguments in 4.0. Should I create a
length filter
docs of the analyser
> docs/core/org/apache/lucene/analysis/package-summary.html.
>
>
> Here they have given a code snippet,I've ambiguities in the add attribute
> method. Should it be added to the token stream instance?
>
> Version matchVersion = Version.LUC
ywordAnalyzer was applied, while at
indexing time additional logic of removing spaces was (first) applied,
therefore the different results at indexing and search.
Doron
Hi, sort of I had an error in the reusableTokenStream() method of my
analyzer, so it wasn't doing the full analysis at que
>
> In my particular case I add album catalogsno to my index as a keyword
> field , but of course if the cat log number contains a space as they often
> do (i.e. cad 6) there is a mismatch. Ive now changed my indexing to index
> the value as 'cad6' removing spaces. Now if the query sent to the quer
On 01/02/2012 22:03, Robert Muir wrote:
On Wed, Feb 1, 2012 at 4:32 PM, Paul Taylor wrote:
So it seems like it just broke the text up at spaces, and does text analysis
within getFieldQuery(), but how can it make the assumption that text should
only be broken at whitespace ?
you are right, see
: So it seems like it just broke the text up at spaces, and does text analysis
: within getFieldQuery(), but how can it make the assumption that text should
: only be broken at whitespace ?
whitespace is a significant metacharacter to the Queryparser - it is used
to distinguish multiple clauses
On Wed, Feb 1, 2012 at 4:32 PM, Paul Taylor wrote:
>
> So it seems like it just broke the text up at spaces, and does text analysis
> within getFieldQuery(), but how can it make the assumption that text should
> only be broken at whitespace ?
you are right, see this bug r
the analyser I use remove accents
So it seems like it just broke the text up at spaces, and does text
analysis within getFieldQuery(), but how can it make the assumption that
text should only be broken at whitespace ?
This seemed to be confirmed that when i pass it query 'dug/up' it just
Caveat to the below is that I am very new to lucene. (That said though,
following the below strategy, after a couple of days work I have a set of
per field analyzers for various languages, using various custom filters,
caching of initial analysis; and capable of outputting stemmed, reversed
http://snowball.tartarus.org/ for stemming
2011/8/22 Saar Carmi
> Hi
> Where can I find a guide for building analyzers, filters and tokenizers?
>
> Saar
>
Hi
Where can I find a guide for building analyzers, filters and tokenizers?
Saar
thanks
will try the code and get back if i have any problems.
Rohit Banga
On Fri, Feb 12, 2010 at 10:38 PM, Ahmet Arslan wrote:
>
> > i want to consider the current word
> > & the next as a single term.
> >
> > when analyzing "Arun Kumar"
> >
> > i want my analyzer to consider "Arun", "Arun
> i want to consider the current word
> & the next as a single term.
>
> when analyzing "Arun Kumar"
>
> i want my analyzer to consider "Arun", "Arun Kumar"
> as synonyms.
>
> in the tokenstream method, how do we read the next token
> "Kumar"
> i am going through the setPositionIncrements meth
On Feb 10, 2010, at 8:33 AM, Rohit Banga wrote:
> basically i want to use my own filter wrapping around a standard analyzer.
>
> the kind explained on page 166 of Lucene in Action, uses input.next() which
> is perhaps not available in lucene 3.0
>
> what is the substitute method.
captureState(
basically i want to use my own filter wrapping around a standard analyzer.
the kind explained on page 166 of Lucene in Action, uses input.next() which
is perhaps not available in lucene 3.0
what is the substitute method.
Rohit Banga
On Wed, Feb 10, 2010 at 6:46 PM, Rohit Banga wrote:
> i want
i want to consider the current word & the next as a single term.
when analyzing "Arun Kumar"
i want my analyzer to consider "Arun", "Arun Kumar" as synonyms.
in the tokenstream method, how do we read the next token "Kumar"
i am going through the setPositionIncrements method for considering them
My colleague at Lucid Imagination, Tom Hill, will be presenting a free
webinar focused on analysis in Lucene/Solr. If you're interested, please
sign up and join us.
Here is the official notice:
We'd like to invite you to a free webinar our company is offering next
Thursday, 28 Janua
Hi all,
I am giving a free talk/ workshop next week on how to analyze and
improve Lucene search performance for native lucene apps. If you've ever
been challenged to get your Java Lucene search apps running faster, I
think you might find the talk of interest.
Free online workshop:
Thursday,
Original Message-
> From: Bradford Stephens [mailto:bradfordsteph...@gmail.com]
> Sent: Thursday, August 06, 2009 12:46 PM
> To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
> Subject: Language Detection for Analysis?
>
> Hey there,
>
> We're trying to
Google Translate just released (last week) its language API with translation
and LANGUAGE DETECTION.
:)
It's very simple to use, and you can query it with some text to define witch
language is it.
Here is a simple example using groovy, but all you need is the url to
query: http://groovyconsole.ap
There are several free Language Detection libraries out there, as well
as a few commercial ones. I think Karl Wettin has even written one as
a plugin for Lucene. Nutch also has one, AIUI. I would just Google
"language detection".
Also see http://www.lucidimagination.com/search/?q=languag
You could write your own analyzer that worked out a boost as it
analyzed the document fields and had a getBoost() method that you
would call to get the value to add to the document as a separate
field. If you write your own you can pass it what you like and it can
do whatever you want.
--
Ian.
, NER, IR
- Original Message
> From: Bradford Stephens
> To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
> Sent: Thursday, August 6, 2009 3:46:21 PM
> Subject: Language Detection for Analysis?
>
> Hey there,
>
> We're trying to add foreign
Thanks Robert for the explanation. I thought that you meant something
different, like doing stemming in some sophisticated manner by somehow
detecting the language. Doing these normalizations makes sense of course,
especially if the letters look similar.
Thanks again,
Shai
On Thu, Aug 6, 2009 at
Shai, I mean doing language-agnostic things that apply to all of these
since they are based on the same writing system, like normalizing all
yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form,
removing harakat, the kinds of things in ArabicNormalizationFilter and
PersianNormaliza
Robert - can you elaborate on what you mean by "just treat it at the script
level"?
On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir wrote:
> Bradford, there is an arabic analyzer in trunk. for farsi there is
> currently a patch available:
> http://issues.apache.org/jira/browse/LUCENE-1628
>
> one o
Bradford, there is an arabic analyzer in trunk. for farsi there is
currently a patch available:
http://issues.apache.org/jira/browse/LUCENE-1628
one option is not to detect languages at all.
it could be hard for short queries due to the languages you mentioned
borrowing from each other.
but you do
Hey there,
We're trying to add foreign language support into our new search
engine -- languages like Arabic, Farsi, and Urdu (that don't work with
standard analyzers). But our data source doesn't tell us which
languages we're actually collecting -- we just get blocks of text. Has
anyone here worke
Hi Anshum-
> You might want to look at writing a custom analyzer or something and
> add a
> document boost (while indexing) for documents containing those terms.
Do you know how to access the document from an analyzer? It seems to only have
access to the field...
Thanks,
-Chris
---
rms or phrases. What's the best way
> to accomplish this?
> Thanks,
> -Chris
>
> > -Original Message-
> > From: Christopher Condit [mailto:con...@sdsc.edu]
> > Sent: Tuesday, July 21, 2009 2:48 PM
> > To: java-user@lucene.apache.org
> > Subject:
e-
> From: Christopher Condit [mailto:con...@sdsc.edu]
> Sent: Tuesday, July 21, 2009 2:48 PM
> To: java-user@lucene.apache.org
> Subject: Analysis Question
>
> I'm trying to implement an analyzer that will compute a score based on
> vocabulary terms in the indexed content
I'm trying to implement an analyzer that will compute a score based on
vocabulary terms in the indexed content (ie a document field with more terms in
the vocabulary will score higher). Although I can see the tokens I can't seem
to access the document from the analyzer to set a new field on it a
e((ts.next(token)) != null) {
String t = new String(token.termBuffer()).substring(0,
token.termLength());
System.out.println("Got token " + t);
}
-Original Message-
From: Marek Rei
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: ana
Hi,
I'm rather new to Lucene and could use some help.
My Analyzer uses a set of filters (PorterStemFilter, LowerCaseFilter,
WhitespaceTokenizer). I need to replicate the effect of these filters
outside of the normal Lucene pipeline. Basically I would like to input a
String from one end and get a
response times decent, but to maintain performance during peak write
rates, we've had to make N a much larger number than we'd like.
One idea we're floating would be to do all the analysis centrally,
perhaps on N/4 machines, and then stream the raw tokens and data
directly t
Aaron Schon wrote:
...I was wondering if taking a bag of words approach might work. For example chunking the
sentences to be analyzed and running a Lucene query against an index storing sentiment
polarity. Has anyone had success with this approach? I do not need a super accurate
system, someth
Yes, this makes sense to me. I think I'll just keep all words, including
stop words, and if performance ever becomes an issue, I'll look at
bigrams again. But I think there's a good chance that I'll never see
significant impact either way.
Thanks guys!
Grant Ingersoll wrote:
Yep, still good r
Yep, still good reasons like I said, but becoming less important as
the hardware, etc. gets faster and cheaper, IMO, especially in the
context of more advanced search capabilities.
On Mar 3, 2008, at 10:49 AM, Mathieu Lecarme wrote:
Not sure, you might want to ask on Nutch. From a strict
Not sure, you might want to ask on Nutch. From a strict language
standpoint, the notion of a stopword in my mind is a bit dubious. If
the word really has no meaning, then why does the language have it to
begin with? In a search context, it has been treated as of minimal
use in the early da
On Mar 3, 2008, at 5:40 AM, John Byrne wrote:
Hi,
I need to use stop-word bigrams, liike the Nutch analyzer, as
described in LIA 4.8 (Nutch Analysis). What I don't understand is,
why does it keep the original stop word intact? I can see great
advantage to being able to search
Hi,
I need to use stop-word bigrams, liike the Nutch analyzer, as described
in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it
keep the original stop word intact? I can see great advantage to being
able to search for a combination of stop word + real word, but I don
an <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, March 1, 2008 6:35:04 AM
Subject: Re: Lucene for Sentiment Analysis
We've been working on Sentiment Analysis. We use GATE and Wordnet for the
lexical / semantic analysis and J Free Charts for the visualization. The
domain
We've been working on Sentiment Analysis. We use GATE and Wordnet for the
lexical / semantic analysis and J Free Charts for the visualization. The
domain is reviews on retail banking and in general our accuracy is around
75% and recall around 25%
We tried out lingpipe as well which also gave
some TF-IDF statistical information from Lucene Index) and it
worked well.
Maybe, you can do the same for sentiment analysis i.e. use LingPipe
capabilities but enhance it with the corpus statistics that the Lucene
index provides - and which are more powerful now.
HTH,
Srikant
Aaron Schon wrote
Hello, I was interested to learn about using Lucene for text analytics work
such as for Sentiment Analysis. Has anyone done work along these lines? if so,
could you share your approach, experiences, accuracy levels obtained etc.
Thanks,
AS
eks dev wrote:
Depends what yo need to do with it, if you need this to be only used as "kind of
stemming" for searching documents, solution is not all that complex. If you need
linguisticly correct splitting than it gets complicated.
This is a very good point. Stemming for
high recall is mu
Thanks for the pointers, Pasquale!
Otis
- Original Message
From: Pasquale Imbemba <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 4:24:16 AM
Subject: Re: Analysis/tokenization of compound words
Otis,
I forgot to mention that I make use of
ation or just n-gram the input). Guess who their
biggest customer is? Hint: starts with the letter G.
Otis
- Original Message
From: Marvin Humphrey <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 11:14:49 AM
Subject: Re: Analysis/toke
On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote:
Writing a decomposer is difficult as you need both a large dictionary
*without* compounds and a set of rules to avoid splitting at too many
positions.
Conceptually, how different is the problem of decompounding German
from tokenizing languag
have used the one Maaten De Rijke and
Christof Monz have published in /Shallow Morphological Analysis in
Monolingual
Information Retrieval for Dutch, German and Italian /(website here
<http://www.dcs.qmul.ac.uk/%7Echristof/>, document here
<http://www.dcs.qmul.ac.uk/%7Echristof/publicat
suggested, I have used
the lexicon of German nouns extracted from Morphy
(http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for the
splitting algorithm, I have used the one Maaten De Rijke and Christof
Monz have published in /Shallow Morphological Analysis in Monolingual
Information
Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 19, 2006 10:22 AM
To: java-user@lucene.apache.org
Subject: Analysis/tokenization of
On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote:
>
> How do people typically analyze/tokenize text with compounds (e.g.
> German)? I took a look at GermanAnalyzer hoping to see how one can
> deal with that, but it turns out GermanAnalyzer doesn't treat
> compounds in any special way at
On Tuesday 19 September 2006 22:15, eks dev wrote:
> Daniel Naber made some work with German dictionaries as well, if I
> recall well, maybe he has something that helps
The company I work for offers a commercial Java component for decomposing
and lemmatizing German words, see http://demo.intrafi
On Tuesday 19 September 2006 22:41, eks dev wrote:
> ahh, another one, when you strip suffix, check if last char on remaining
> "stem" is "s" (magic thing in German), delete it if not the only
> letter do not ask why, long unexplained mistery of German language
This is called "Fugenelement" a
OTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 10:15:04 PM
Subject: Re: Analysis/tokenization of compound words
Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind
of stemming" for searching documents, solution is not all that comple
mething similar ages ago ("stemming like" splitting of word
in German)
Have fun, e.
- Original Message
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 6:21:55 PM
Subject: Analysis/tokenization of compound words
Hi,
Ho
On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote:
How do people typically analyze/tokenize text with compounds (e.g.
German)? I took a look at GermanAnalyzer hoping to see how one can
deal with that, but it turns out GermanAnalyzer doesn't treat
compounds in any special way at all.
O
netic <[EMAIL PROTECTED]>
19/09/2006 17:21
Please respond to
java-user@lucene.apache.org
To
java-user@lucene.apache.org
cc
Subject
Analysis/tokenization of compound words
Hi,
How do people typically analyze/tokenize text with compounds (e.g. Germ
Hi,
How do people typically analyze/tokenize text with compounds (e.g. German)? I
took a look at GermanAnalyzer hoping to see how one can deal with that, but it
turns out GermanAnalyzer doesn't treat compounds in any special way at all.
One way to go about this is to have a word dictionary and
Adding to this growing thread, there's really no reason to
index all the term bigrams, trigrams, etc. It's not
only slow, it's very memory/disk intensive. All you need
to do is two passes over the collection.
Pass One
Collect counts of bigrams (or trigrams, or whatever -- if
size is an
Nader Akhnoukh wrote:
Yes, Chris is correct, the goal is to determine the most frequently
occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irre
little more detail. Are you
> suggesting manually traversing each document and doing a search on each
> phrase? That seems very intensive as I have tens of thousands of documents.
>
> Thanks.
>
package analysis;
import stem.*;
import java.util.Vector;
import java.io.IOExceptio
Yes, Chris is correct, the goal is to determine the most frequently occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irregular basis and could ru
Chris Hostetter wrote:
I think either you missunderstood Nader's question or I did: I belive the
goal is to determine what the most frequently occuring phrases are -- not
determine how frequently a particular input phrase appears.
Isn't the latter a pre-requisite for the former ? ;)
Regardi
: > I am trying to get the most frequently occurring phrases in a document and
: > in the index as a whole. The goal is compare the two to get something like
: > Amazon's SIPs.
: Other than indexing the phrases directly, you could use a SpanNearQuery
: over the words, use getSpans() on its SpanS
high ratio indicates that the term appears in this doc much more
> than the other docs on average.
>
> Does anyone have an idea of how to do this with phrases of say 1 to 3 words?
>
> Just to be clear, in this case I am only using Lucene for it's built in
> frequency
ne have an idea of how to do this with phrases of say 1 to 3 words?
Just to be clear, in this case I am only using Lucene for it's built in
frequency analysis, I'm not actually using it to search for anything that is
indexed.
Thanks,
NSA
Thanks very much Erik. The QueryParser method was pretty useful in
writing my own one.
-Venu
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 18, 2006 7:09 PM
To: java-user@lucene.apache.org
Subject: Re: How to do analysis when creating a query
1 - 100 of 121 matches
Mail list logo