Re: Support for static analysis annotations

2025-01-03 Thread Uwe Schindler
Hi, we have not yet discussed about that. At moment Lucene uses one custom annotation "@SuppressForbidden") which is detected by the forbiddenapis plugin based on pure class name (not package). Forbiddenapis (https://github.com/policeman-tools/forbidden-apis) is a static analysis

Support for static analysis annotations

2024-12-05 Thread Evan Darke
I'm wondering if the Lucene community would be supportive of adopting common annotations, such a @Nullable, to enable better static analysis for downstream projects and within Lucene as well. Lucene makes extensive use of nulls for performance reasons, but using this code can be prone to

Re: Offset-Based Analysis

2023-02-22 Thread Mikhail Khludnev
One more idea. It's possible to ask Solr for essential tokenization via /analysis/field API (here's a clue https://stackoverflow.com/a/37785401), get token stream in structured response, and pass it into NPL pipeline for enrichment. On Wed, Feb 22, 2023 at 5:26 PM Luke Kot-Zaniewski

Re: Offset-Based Analysis

2023-02-22 Thread Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A)
@lucene.apache.org Subject: Re: Offset-Based Analysis Hello Luke. Using offsets seems really doubtful to me. What comes to my mind is pre-analyzed field https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe s.html#the-preanalyzedfield-type. Thus, external NLP service can provide ready

Re: Offset-Based Analysis

2023-02-21 Thread Mikhail Khludnev
g a CharFilter that decodes some special header, > which itself passes along an offset-sorted list of data for enrichment. > This metadata could be referenced during analysis via custom attributes and > ideally could handle a variety of use cases with the same offset-accounting > lo

Offset-Based Analysis

2023-02-21 Thread Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A)
metadata could be referenced during analysis via custom attributes and ideally could handle a variety of use cases with the same offset-accounting logic. Some uses that come to mind are stashing values in term/payload attributes or even offset based tokenization for those wishing to tokenize

RE: Re: Integrating NLP into Lucene Analysis Chain

2022-11-22 Thread Lucas Kot-Zaniewski
ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucen

RE: RE: Integrating NLP into Lucene Analysis Chain

2022-11-22 Thread Lucas Kot-Zaniewski
example, on the other hand, are slow. If you have to put NLP processing inside the analysis chain, you may have to give up certain NLP capacities... > > My 2cents, > > Guan > > -Original Message- > From: Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) > Sent: Saturday, November

RE: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Wang, Guan
upposed to be shared among threads. They can be re-used among threads though. NLPs, stemming for example, on the other hand, are slow. If you have to put NLP processing inside the analysis chain, you may have to give up certain NLP capacities... My 2cents, Guan -Original Message- From

Re: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Mikhail Khludnev
Hello, Benoit. I just came across https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TypeAsSynonymFilterFactory.html It sounds similar to what you asking, but it watches TypeAttribute only. Also, spans are superseded with intervals https

Re: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Benoit Mercier
f sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these funct

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
he open-nlp library itself. It is not > > thread-safe in some very unexpected ways. Looking at the library internals > > reveals unsynchronized lazy initialization of shared components. > > Unfortunately the lucene integration kind of sweeps this under the rug by > &g

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
rug by > wrapping everything in a pretty big synchronized block, here is an example > https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 > . This itself is problematic because these functions run in real

Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A)
zed block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one

Re: Lucene 9.1.0 has changed name of lucene-analysis-common-9.1.0.jar

2022-07-27 Thread Dawid Weiss
This change was intentional to make it consistent with package naming, Dawid On Tue, Jul 26, 2022 at 10:34 PM Baris Kazar wrote: > Dear Folks,- > I see that Lucene has changed one of the JAR files' name to > lucene-analysis-common-9.1.0.jar in Lucene version 9.1.0. > It used

Lucene 9.1.0 has changed name of lucene-analysis-common-9.1.0.jar

2022-07-26 Thread Baris Kazar
Dear Folks,- I see that Lucene has changed one of the JAR files' name to lucene-analysis-common-9.1.0.jar in Lucene version 9.1.0. It used to use analyzers. Can someone please confirm? Best regards

Analysis-stempel incorrect tokens generation for numbers

2021-06-04 Thread Seweryn Dominik
Hello, I created issue in elasticsearch: https://github.com/elastic/elasticsearch/issues/71483 I was redirected to Lucene project. I want to ask if I can create issue on your Jira about this problem? Or maybe there is any solution? Regards Dominik Seweryn Dominik Seweryn Programista Aplikacj

Re: solr 7.0: possible analysis error: startOffset must be non-negative

2017-09-27 Thread Nawab Zada Asad Iqbal
, 2017 at 3:12 PM, Nawab Zada Asad Iqbal wrote: > Hi, > > I upgraded to solr 7 today and i am seeing tonnes of following errors for > various fields. > > o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: > Exception writing document id file_3881549 to the index

solr 7.0: possible analysis error: startOffset must be non-negative

2017-09-27 Thread Nawab Zada Asad Iqbal
Hi, I upgraded to solr 7 today and i am seeing tonnes of following errors for various fields. o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id file_3881549 to the index; possible analysis error: startOffset must be non-negative, and endOffset

"input" parameter in src\lucene\analysis\common\src\java\org\apache\lucene\analysis\standard\ClassicTokenizer.java

2016-08-09 Thread Christopher
On line 114, in the init() method, a ClassicTokenizerImpl object is created, but the constructor is passed a parameter called input. Where does this variable come from? It doesn't seem to be declared anywhere in the java file.

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-25 Thread KARTHIK SHIVAKUMAR
>> some terms from analysis be silently dropped when indexing Then I presume the same need to be also be exempted/dropped while searching process. else the desired results are not as expected. with regards karthik On Mon, Aug 25, 2014 at 12:52 PM, Trejkaz wrote: > It seems li

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-25 Thread Trejkaz
It seems like nobody knows the answer, so I'm just going to file a bug. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
Lucene 4.9 gives much the same result. import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.ja.JapaneseAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
On Tue, Aug 19, 2014 at 5:27 PM, Uwe Schindler wrote: > Hi, > > You forgot to close (or commit) IndexWriter before opening the reader. Huh? The code I posted is closing it: try (IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_36, analyser))) {

RE: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Uwe Schindler
9, 2014 6:50 AM > To: Lucene Users Mailing List > Subject: Can some terms from analysis be silently dropped when indexing? > Because I'm pretty sure I'm seeing that happen. > > Unrelated to my previous mail to the list, but related to the same > investigation... > &

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-18 Thread Trejkaz
Also in case it makes a difference, we're using Lucene v3.6.2. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-18 Thread Trejkaz
Unrelated to my previous mail to the list, but related to the same investigation... The following test program just indexes a phrase of nonsense words using and then queries for one of the words using the same analyser. The same analyser is being used both for indexing and for querying, yet in th

Re: Apache Lucene Analysis

2012-10-08 Thread selvakumar netaji
Thanks Mike. On Mon, Oct 8, 2012 at 4:30 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Fri, Oct 5, 2012 at 10:24 AM, selvakumar netaji > wrote: > > Hi All, > > > > > > In the TokenStreamAPI section of the analysis documentation for lucene

Re: Apache Lucene Analysis

2012-10-08 Thread Michael McCandless
On Fri, Oct 5, 2012 at 10:24 AM, selvakumar netaji wrote: > Hi All, > > > In the TokenStreamAPI section of the analysis documentation for lucene 4.0 > beta, MyAnalyzer class is defined. > > They've added the lengthFilter in the create components method. The length >

Re: Apache Lucene Analysis

2012-10-08 Thread Michael McCandless
f Apache Lucene. >> >> I just read through the docs of the analyser >> docs/core/org/apache/lucene/analysis/package-summary.html. >> >> >> Here they have given a code snippet,I've ambiguities in the add attribute >> method. Should it be added to t

Re: Apache Lucene Analysis

2012-10-08 Thread selvakumar netaji
Can you please help me to sort this out. On Fri, Oct 5, 2012 at 7:54 PM, selvakumar netaji wrote: > Hi All, > > > In the TokenStreamAPI section of the analysis documentation for lucene > 4.0 beta, MyAnalyzer class is defined. > > They've added the lengthFilter in th

Re: Apache Lucene Analysis

2012-10-05 Thread selvakumar netaji
Hi All, In the TokenStreamAPI section of the analysis documentation for lucene 4.0 beta, MyAnalyzer class is defined. They've added the lengthFilter in the create components method. The length filter doesn't accept method with three arguments in 4.0. Should I create a length filter

Re: Apache Lucene Analysis

2012-10-05 Thread selvakumar netaji
docs of the analyser > docs/core/org/apache/lucene/analysis/package-summary.html. > > > Here they have given a code snippet,I've ambiguities in the add attribute > method. Should it be added to the token stream instance? > > Version matchVersion = Version.LUC

Re: When does Query Parser do its analysis ?

2012-02-02 Thread Paul Taylor
ywordAnalyzer was applied, while at indexing time additional logic of removing spaces was (first) applied, therefore the different results at indexing and search. Doron Hi, sort of I had an error in the reusableTokenStream() method of my analyzer, so it wasn't doing the full analysis at que

Re: When does Query Parser do its analysis ?

2012-02-01 Thread Doron Cohen
> > In my particular case I add album catalogsno to my index as a keyword > field , but of course if the cat log number contains a space as they often > do (i.e. cad 6) there is a mismatch. Ive now changed my indexing to index > the value as 'cad6' removing spaces. Now if the query sent to the quer

Re: When does Query Parser do its analysis ?

2012-02-01 Thread Paul Taylor
On 01/02/2012 22:03, Robert Muir wrote: On Wed, Feb 1, 2012 at 4:32 PM, Paul Taylor wrote: So it seems like it just broke the text up at spaces, and does text analysis within getFieldQuery(), but how can it make the assumption that text should only be broken at whitespace ? you are right, see

Re: When does Query Parser do its analysis ?

2012-02-01 Thread Chris Hostetter
: So it seems like it just broke the text up at spaces, and does text analysis : within getFieldQuery(), but how can it make the assumption that text should : only be broken at whitespace ? whitespace is a significant metacharacter to the Queryparser - it is used to distinguish multiple clauses

Re: When does Query Parser do its analysis ?

2012-02-01 Thread Robert Muir
On Wed, Feb 1, 2012 at 4:32 PM, Paul Taylor wrote: > > So it seems like it just broke the text up at spaces, and does text analysis > within getFieldQuery(), but how can it make the assumption that text should > only be broken at whitespace ? you are right, see this bug r

When does Query Parser do its analysis ?

2012-02-01 Thread Paul Taylor
the analyser I use remove accents So it seems like it just broke the text up at spaces, and does text analysis within getFieldQuery(), but how can it make the assumption that text should only be broken at whitespace ? This seemed to be confirmed that when i pass it query 'dug/up' it just

Re: Analysis

2011-08-22 Thread Graham Sugden
Caveat to the below is that I am very new to lucene. (That said though, following the below strategy, after a couple of days work I have a set of per field analyzers for various languages, using various custom filters, caching of initial analysis; and capable of outputting stemmed, reversed

Re: Analysis

2011-08-22 Thread Mihai Caraman
http://snowball.tartarus.org/ for stemming 2011/8/22 Saar Carmi > Hi > Where can I find a guide for building analyzers, filters and tokenizers? > > Saar >

Analysis

2011-08-22 Thread Saar Carmi
Hi Where can I find a guide for building analyzers, filters and tokenizers? Saar

Re: read more tokens during analysis

2010-02-12 Thread Rohit Banga
thanks will try the code and get back if i have any problems. Rohit Banga On Fri, Feb 12, 2010 at 10:38 PM, Ahmet Arslan wrote: > > > i want to consider the current word > > & the next as a single term. > > > > when analyzing "Arun Kumar" > > > > i want my analyzer to consider "Arun", "Arun

Re: read more tokens during analysis

2010-02-12 Thread Ahmet Arslan
> i want to consider the current word > & the next as a single term. > > when analyzing "Arun Kumar" > > i want my analyzer to consider "Arun",  "Arun Kumar" > as synonyms. > > in the tokenstream method, how do we read the next token > "Kumar" > i am going through the setPositionIncrements meth

Re: read more tokens during analysis

2010-02-10 Thread Grant Ingersoll
On Feb 10, 2010, at 8:33 AM, Rohit Banga wrote: > basically i want to use my own filter wrapping around a standard analyzer. > > the kind explained on page 166 of Lucene in Action, uses input.next() which > is perhaps not available in lucene 3.0 > > what is the substitute method. captureState(

Re: read more tokens during analysis

2010-02-10 Thread Rohit Banga
basically i want to use my own filter wrapping around a standard analyzer. the kind explained on page 166 of Lucene in Action, uses input.next() which is perhaps not available in lucene 3.0 what is the substitute method. Rohit Banga On Wed, Feb 10, 2010 at 6:46 PM, Rohit Banga wrote: > i want

read more tokens during analysis

2010-02-10 Thread Rohit Banga
i want to consider the current word & the next as a single term. when analyzing "Arun Kumar" i want my analyzer to consider "Arun", "Arun Kumar" as synonyms. in the tokenstream method, how do we read the next token "Kumar" i am going through the setPositionIncrements method for considering them

Solr Analysis Webinar Jan 28, 2010

2010-01-20 Thread Jay Hill
My colleague at Lucid Imagination, Tom Hill, will be presenting a free webinar focused on analysis in Lucene/Solr. If you're interested, please sign up and join us. Here is the official notice: We'd like to invite you to a free webinar our company is offering next Thursday, 28 Janua

Lucene Search Performance Analysis Workshop

2009-08-26 Thread Andrzej Bialecki
Hi all, I am giving a free talk/ workshop next week on how to analyze and improve Lucene search performance for native lucene apps. If you've ever been challenged to get your Java Lucene search apps running faster, I think you might find the talk of interest. Free online workshop: Thursday,

RE: Language Detection for Analysis?

2009-08-10 Thread Teruhiko Kurosaka
Original Message- > From: Bradford Stephens [mailto:bradfordsteph...@gmail.com] > Sent: Thursday, August 06, 2009 12:46 PM > To: solr-u...@lucene.apache.org; java-user@lucene.apache.org > Subject: Language Detection for Analysis? > > Hey there, > > We're trying to

Re: Language Detection for Analysis?

2009-08-09 Thread Lucas F. A. Teixeira
Google Translate just released (last week) its language API with translation and LANGUAGE DETECTION. :) It's very simple to use, and you can query it with some text to define witch language is it. Here is a simple example using groovy, but all you need is the url to query: http://groovyconsole.ap

Re: Language Detection for Analysis?

2009-08-07 Thread Grant Ingersoll
There are several free Language Detection libraries out there, as well as a few commercial ones. I think Karl Wettin has even written one as a plugin for Lucene. Nutch also has one, AIUI. I would just Google "language detection". Also see http://www.lucidimagination.com/search/?q=languag

Re: Analysis Question

2009-08-07 Thread Ian Lea
You could write your own analyzer that worked out a boost as it analyzed the document fields and had a getBoost() method that you would call to get the value to add to the document as a separate field. If you write your own you can pass it what you like and it can do whatever you want. -- Ian.

Re: Language Detection for Analysis?

2009-08-06 Thread Otis Gospodnetic
, NER, IR - Original Message > From: Bradford Stephens > To: solr-u...@lucene.apache.org; java-user@lucene.apache.org > Sent: Thursday, August 6, 2009 3:46:21 PM > Subject: Language Detection for Analysis? > > Hey there, > > We're trying to add foreign

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
Thanks Robert for the explanation. I thought that you meant something different, like doing stemming in some sophisticated manner by somehow detecting the language. Doing these normalizations makes sense of course, especially if the letters look similar. Thanks again, Shai On Thu, Aug 6, 2009 at

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
Shai, I mean doing language-agnostic things that apply to all of these since they are based on the same writing system, like normalizing all yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form, removing harakat, the kinds of things in ArabicNormalizationFilter and PersianNormaliza

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
Robert - can you elaborate on what you mean by "just treat it at the script level"? On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir wrote: > Bradford, there is an arabic analyzer in trunk. for farsi there is > currently a patch available: > http://issues.apache.org/jira/browse/LUCENE-1628 > > one o

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
Bradford, there is an arabic analyzer in trunk. for farsi there is currently a patch available: http://issues.apache.org/jira/browse/LUCENE-1628 one option is not to detect languages at all. it could be hard for short queries due to the languages you mentioned borrowing from each other. but you do

Language Detection for Analysis?

2009-08-06 Thread Bradford Stephens
Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worke

RE: Analysis Question

2009-08-06 Thread Christopher Condit
Hi Anshum- > You might want to look at writing a custom analyzer or something and > add a > document boost (while indexing) for documents containing those terms. Do you know how to access the document from an analyzer? It seems to only have access to the field... Thanks, -Chris ---

Re: Analysis Question

2009-08-06 Thread Anshum
rms or phrases. What's the best way > to accomplish this? > Thanks, > -Chris > > > -Original Message- > > From: Christopher Condit [mailto:con...@sdsc.edu] > > Sent: Tuesday, July 21, 2009 2:48 PM > > To: java-user@lucene.apache.org > > Subject:

RE: Analysis Question

2009-08-05 Thread Christopher Condit
e- > From: Christopher Condit [mailto:con...@sdsc.edu] > Sent: Tuesday, July 21, 2009 2:48 PM > To: java-user@lucene.apache.org > Subject: Analysis Question > > I'm trying to implement an analyzer that will compute a score based on > vocabulary terms in the indexed content

Analysis Question

2009-07-21 Thread Christopher Condit
I'm trying to implement an analyzer that will compute a score based on vocabulary terms in the indexed content (ie a document field with more terms in the vocabulary will score higher). Although I can see the tokens I can't seem to access the document from the analyzer to set a new field on it a

Re: analysis filter wrapper

2009-05-14 Thread Joel Halbert
e((ts.next(token)) != null) { String t = new String(token.termBuffer()).substring(0, token.termLength()); System.out.println("Got token " + t); } -Original Message- From: Marek Rei Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: ana

analysis filter wrapper

2009-05-14 Thread Marek Rei
Hi, I'm rather new to Lucene and could use some help. My Analyzer uses a set of filters (PorterStemFilter, LowerCaseFilter, WhitespaceTokenizer). I need to replicate the effect of these filters outside of the normal Lucene pipeline. Basically I would like to input a String from one end and get a

Streaming results of analysis to shards ... possible?

2009-03-24 Thread Cass Costello
response times decent, but to maintain performance during peak write rates, we've had to make N a much larger number than we'd like. One idea we're floating would be to do all the analysis centrally, perhaps on N/4 machines, and then stream the raw tokens and data directly t

Re: Lucene for Sentiment Analysis

2008-03-07 Thread Bob Carpenter
Aaron Schon wrote: ...I was wondering if taking a bag of words approach might work. For example chunking the sentences to be analyzed and running a Lucene query against an index storing sentiment polarity. Has anyone had success with this approach? I do not need a super accurate system, someth

Re: bigram analysis

2008-03-03 Thread John Byrne
Yes, this makes sense to me. I think I'll just keep all words, including stop words, and if performance ever becomes an issue, I'll look at bigrams again. But I think there's a good chance that I'll never see significant impact either way. Thanks guys! Grant Ingersoll wrote: Yep, still good r

Re: bigram analysis

2008-03-03 Thread Grant Ingersoll
Yep, still good reasons like I said, but becoming less important as the hardware, etc. gets faster and cheaper, IMO, especially in the context of more advanced search capabilities. On Mar 3, 2008, at 10:49 AM, Mathieu Lecarme wrote: Not sure, you might want to ask on Nutch. From a strict

Re: bigram analysis

2008-03-03 Thread Mathieu Lecarme
Not sure, you might want to ask on Nutch. From a strict language standpoint, the notion of a stopword in my mind is a bit dubious. If the word really has no meaning, then why does the language have it to begin with? In a search context, it has been treated as of minimal use in the early da

Re: bigram analysis

2008-03-03 Thread Grant Ingersoll
On Mar 3, 2008, at 5:40 AM, John Byrne wrote: Hi, I need to use stop-word bigrams, liike the Nutch analyzer, as described in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it keep the original stop word intact? I can see great advantage to being able to search

bigram analysis

2008-03-03 Thread John Byrne
Hi, I need to use stop-word bigrams, liike the Nutch analyzer, as described in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it keep the original stop word intact? I can see great advantage to being able to search for a combination of stop word + real word, but I don&#

Re: Lucene for Sentiment Analysis

2008-03-01 Thread Aaron Schon
an <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, March 1, 2008 6:35:04 AM Subject: Re: Lucene for Sentiment Analysis We've been working on Sentiment Analysis. We use GATE and Wordnet for the lexical / semantic analysis and J Free Charts for the visualization. The domain

Re: Lucene for Sentiment Analysis

2008-03-01 Thread Vivek Balaraman
We've been working on Sentiment Analysis. We use GATE and Wordnet for the lexical / semantic analysis and J Free Charts for the visualization. The domain is reviews on retail banking and in general our accuracy is around 75% and recall around 25% We tried out lingpipe as well which also gave

Re: Lucene for Sentiment Analysis

2008-02-29 Thread Srikant Jakilinki
some TF-IDF statistical information from Lucene Index) and it worked well. Maybe, you can do the same for sentiment analysis i.e. use LingPipe capabilities but enhance it with the corpus statistics that the Lucene index provides - and which are more powerful now. HTH, Srikant Aaron Schon wrote

Lucene for Sentiment Analysis

2008-02-29 Thread Aaron Schon
Hello, I was interested to learn about using Lucene for text analytics work such as for Sentiment Analysis. Has anyone done work along these lines? if so, could you share your approach, experiences, accuracy levels obtained etc. Thanks, AS

Re: Analysis/tokenization of compound words (German, Chinese, etc.)

2006-11-21 Thread Bob Carpenter
eks dev wrote: Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated. This is a very good point. Stemming for high recall is mu

Re: Analysis/tokenization of compound words

2006-09-23 Thread Otis Gospodnetic
Thanks for the pointers, Pasquale! Otis - Original Message From: Pasquale Imbemba <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 23, 2006 4:24:16 AM Subject: Re: Analysis/tokenization of compound words Otis, I forgot to mention that I make use of

Re: Analysis/tokenization of compound words

2006-09-23 Thread Otis Gospodnetic
ation or just n-gram the input). Guess who their biggest customer is? Hint: starts with the letter G. Otis - Original Message From: Marvin Humphrey <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 23, 2006 11:14:49 AM Subject: Re: Analysis/toke

Re: Analysis/tokenization of compound words

2006-09-23 Thread Marvin Humphrey
On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote: Writing a decomposer is difficult as you need both a large dictionary *without* compounds and a set of rules to avoid splitting at too many positions. Conceptually, how different is the problem of decompounding German from tokenizing languag

Re: Analysis/tokenization of compound words

2006-09-23 Thread Pasquale Imbemba
have used the one Maaten De Rijke and Christof Monz have published in /Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German and Italian /(website here <http://www.dcs.qmul.ac.uk/%7Echristof/>, document here <http://www.dcs.qmul.ac.uk/%7Echristof/publicat

Re: Analysis/tokenization of compound words

2006-09-23 Thread Pasquale Imbemba
suggested, I have used the lexicon of German nouns extracted from Morphy (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for the splitting algorithm, I have used the one Maaten De Rijke and Christof Monz have published in /Shallow Morphological Analysis in Monolingual Information

RE: Analysis/tokenization of compound words

2006-09-21 Thread Binkley, Peter
Libraries Edmonton, Alberta Canada T6G 2J8 Phone: (780) 492-3743 Fax: (780) 492-9243 e-mail: [EMAIL PROTECTED] -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 19, 2006 10:22 AM To: java-user@lucene.apache.org Subject: Analysis/tokenization of

Re: Analysis/tokenization of compound words

2006-09-20 Thread karl wettin
On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote: > > How do people typically analyze/tokenize text with compounds (e.g. > German)? I took a look at GermanAnalyzer hoping to see how one can > deal with that, but it turns out GermanAnalyzer doesn't treat > compounds in any special way at

Re: Analysis/tokenization of compound words

2006-09-20 Thread Daniel Naber
On Tuesday 19 September 2006 22:15, eks dev wrote: > Daniel Naber made some work with German dictionaries as well, if I > recall well, maybe he has something that helps The company I work for offers a commercial Java component for decomposing and lemmatizing German words, see http://demo.intrafi

Re: Analysis/tokenization of compound words

2006-09-19 Thread Daniel Naber
On Tuesday 19 September 2006 22:41, eks dev wrote: > ahh, another one, when you strip suffix, check if last char on remaining > "stem" is "s" (magic thing in German), delete it if not the only > letter do not ask why, long unexplained mistery of German language This is called "Fugenelement" a

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
OTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 19 September, 2006 10:15:04 PM Subject: Re: Analysis/tokenization of compound words Hi Otis, Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that comple

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
mething similar ages ago ("stemming like" splitting of word in German) Have fun, e. - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 19 September, 2006 6:21:55 PM Subject: Analysis/tokenization of compound words Hi, Ho

Re: Analysis/tokenization of compound words

2006-09-19 Thread Marvin Humphrey
On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote: How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all. O

Re: Analysis/tokenization of compound words

2006-09-19 Thread Jonathan O'Connor
netic <[EMAIL PROTECTED]> 19/09/2006 17:21 Please respond to java-user@lucene.apache.org To java-user@lucene.apache.org cc Subject Analysis/tokenization of compound words Hi, How do people typically analyze/tokenize text with compounds (e.g. Germ

Analysis/tokenization of compound words

2006-09-19 Thread Otis Gospodnetic
Hi, How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all. One way to go about this is to have a word dictionary and

Re: Phrase Frequency For Analysis

2006-06-22 Thread Bob Carpenter
Adding to this growing thread, there's really no reason to index all the term bigrams, trigrams, etc. It's not only slow, it's very memory/disk intensive. All you need to do is two passes over the collection. Pass One Collect counts of bigrams (or trigrams, or whatever -- if size is an

Re: Phrase Frequency For Analysis

2006-06-22 Thread Andrzej Bialecki
Nader Akhnoukh wrote: Yes, Chris is correct, the goal is to determine the most frequently occuring phrases in a document compared to the frequency of that phrase in the index. So there are only output phrases, no inputs. Also performance is not really an issue, this would take place on an irre

Re: Phrase Frequency For Analysis

2006-06-22 Thread Kamal Abou Mikhael
little more detail. Are you > suggesting manually traversing each document and doing a search on each > phrase? That seems very intensive as I have tens of thousands of documents. > > Thanks. > package analysis; import stem.*; import java.util.Vector; import java.io.IOExceptio

Re: Phrase Frequency For Analysis

2006-06-22 Thread Nader Akhnoukh
Yes, Chris is correct, the goal is to determine the most frequently occuring phrases in a document compared to the frequency of that phrase in the index. So there are only output phrases, no inputs. Also performance is not really an issue, this would take place on an irregular basis and could ru

Re: Phrase Frequency For Analysis

2006-06-22 Thread Andrzej Bialecki
Chris Hostetter wrote: I think either you missunderstood Nader's question or I did: I belive the goal is to determine what the most frequently occuring phrases are -- not determine how frequently a particular input phrase appears. Isn't the latter a pre-requisite for the former ? ;) Regardi

Re: Phrase Frequency For Analysis

2006-06-22 Thread Chris Hostetter
: > I am trying to get the most frequently occurring phrases in a document and : > in the index as a whole. The goal is compare the two to get something like : > Amazon's SIPs. : Other than indexing the phrases directly, you could use a SpanNearQuery : over the words, use getSpans() on its SpanS

Re: Phrase Frequency For Analysis

2006-06-22 Thread Paul Elschot
high ratio indicates that the term appears in this doc much more > than the other docs on average. > > Does anyone have an idea of how to do this with phrases of say 1 to 3 words? > > Just to be clear, in this case I am only using Lucene for it's built in > frequency

Phrase Frequency For Analysis

2006-06-21 Thread Nader Akhnoukh
ne have an idea of how to do this with phrases of say 1 to 3 words? Just to be clear, in this case I am only using Lucene for it's built in frequency analysis, I'm not actually using it to search for anything that is indexed. Thanks, NSA

RE: How to do analysis when creating a query programmatically?

2006-05-18 Thread Satuluri, Venu_Madhav
Thanks very much Erik. The QueryParser method was pretty useful in writing my own one. -Venu -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, May 18, 2006 7:09 PM To: java-user@lucene.apache.org Subject: Re: How to do analysis when creating a query

  1   2   >