Re: When does QueryParser creates PhraseQueries

2008-02-29 Thread duiduder

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thanks a lot for your help Daniel, I have found a solution :)

The 'token' field is public inside QueryParser, and inside
'token.image' you can read the origin String with apostrophe.

Thus, I can differ between the two situations - and simply
return a BooleanQuery in the case there is no apostrophe.

best regards


Christian


Daniel Noll schrieb:
| On Wednesday 27 February 2008 00:50:04 [EMAIL PROTECTED] wrote:
|> Looks that this is really hard-coded behaviour, and not Analyzer-specific.
|
| The whitespace part is coded into QueryParser.jj, yes.  So are the quotes
| and : and other query-specific things.
|
|> I want to search for directories with tokenizing them, e.g.
|> /home/reuschling - this seems to be not possible with the current
|> queryparser.
|
| That's possible by changing the analyser.  For instance StandardAnalyzer will
| tokenise that as two terms, but WhitespaceAnalyzer will tokenise it as one.
|
|> | If you subclass QueryParser than you can override this method and modify
|> | it to do whatever evil trick you want to do.
|>
|> Overriding getFieldQuery() will not work because I can't differ between
|> "home/reuschling", which should trigger a PhraseQuery, and home/reuschling
|> without apostrophe, which should trigger a BooleanQuery...I will search
|> whether I can find a better place for this:)
|
| That much is true.  Likewise, there is no difference between quoting "cat" and
| typing cat without quotes.
|
| You could possibly override the parse(String) method and mangle the string in
| some way so that you know.  So if the user enters /a/b it could pass
| down /a/b, but if they enter "/a/b" it could pass down "SOMETHING/a/b", and
| you then detect the SOMETHING in getFieldQuery.  Just have to make sure the
| something isn't tokenised out by the analyser.
|
| Or you could clone QueryParser.jj itself and modify it to call different
| methods for the two situations.
|
| Daniel
|
| -
| To unsubscribe, e-mail: [EMAIL PROTECTED]
| For additional commands, e-mail: [EMAIL PROTECTED]
|
|
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHx8UlQoTr50f1tpcRAtpvAKCPOzw/DbQeAbcAGr0gclWK+ROJawCfbmu9
9zM2QgBgozErW5sj7xGK1Ns=
=nL71
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Mathieu Lecarme

Petite Abeille a écrit :

A proposal for a Lua entry for the "Google Summer of Code" '08:
A Lua implementation of Lucene.
For me, Lua is just a glue between C coded object, a super config file. 
Like used in lighttpd or WoW.

Lulu will work on top of Lucy?
Did I miss something?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: how to all documents from

2008-02-29 Thread Shailendra Sharma
Create a match all docs query like following:
MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery();
And then search as you search for any other query -
searcher.search(matchAllDocsQuery) - it returns hit

Thanks,
Shailendra

-Original Message-
From: sandyg [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 29, 2008 4:24 PM
To: java-user@lucene.apache.org
Subject: how to all documents from


How to retreive all the documents from the index directory
please some one help me yar

-- 
View this message in context:
http://www.nabble.com/how-to-all-documents-from-tp15756174p15756174.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to all documents from

2008-02-29 Thread sandyg

How to retreive all the documents from the index directory
please some one help me yar

-- 
View this message in context: 
http://www.nabble.com/how-to-all-documents-from-tp15756174p15756174.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do i get a text summary

2008-02-29 Thread Karl Wettin

h t skrev:

Where is the introduction of below algorithm? Thanks.


I can't recall where I picked it up, but something like this:

Score terms by count and distribution. A term occuring 20 times in the 
same paragraph is not as important as a term occuring 20 times over 10 
paragraphs. Similar terms affect each others score (I used ngrams and I 
also detected abbreviations). Be sure to remove stop words using some 
term reduction algorithm. Language detection makes sense. Rank sentances 
by length normalized term score. The top n sentances is your summary. If 
enough of these sentances are parts of the same paragraph and the 
paragraph is small enough, that is your summary instead.


I hope this helps.


karl



"Very simple algorithmic solutions usually involve ranking top senstances
by looking at distribution of terms in sentances, paragraphs and the
whole document. I implemented something like this a couple of years back
that worked fairly well."



2008/2/29, Karl Wettin <[EMAIL PROTECTED]>:

[EMAIL PROTECTED] skrev:


If you want something from an index it has to be IN the
index. So, store a
summary field in each document and make sure that field is part of the
query.

And how could one create automatically such a summary?
Taking the first 2 lines of a document makes not always much sense.
How does google this?


Google don't summarize, they highlight parts that match the query. See
previous reponses.

If you really want to summarize there are a number of more and less
scientific ways to figure out what's important and what's not.

Very simple algorithmic solutions usually involve ranking top senstances
by looking at distribution of terms in sentances, paragraphs and the
whole document. I implemented something like this a couple of years back
that worked fairly well.

Citeseer is a great source for papers on pretty much any IR related
subject: 



karl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Grant Ingersoll


On Feb 29, 2008, at 5:39 AM, Mathieu Lecarme wrote:


Petite Abeille a écrit :

A proposal for a Lua entry for the "Google Summer of Code" '08:
A Lua implementation of Lucene.
For me, Lua is just a glue between C coded object, a super config  
file. Like used in lighttpd or WoW.

Lulu will work on top of Lucy?


That implies the Lucy actually is under development...  Perhaps they  
will take up work on Lucy...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: how to all documents from

2008-02-29 Thread sandyg

Thanks for the immediate response Shailendra Sharma u had saved a lot of my
time .
Once again thanks for ur reply 

Shailendra Sharma wrote:
> 
> Create a match all docs query like following:
>   MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery();
> And then search as you search for any other query -
>   searcher.search(matchAllDocsQuery) - it returns hit
> 
> Thanks,
> Shailendra
> 
> -Original Message-
> From: sandyg [mailto:[EMAIL PROTECTED] 
> Sent: Friday, February 29, 2008 4:24 PM
> To: java-user@lucene.apache.org
> Subject: how to all documents from
> 
> 
> How to retreive all the documents from the index directory
> please some one help me yar
> 
> -- 
> View this message in context:
> http://www.nabble.com/how-to-all-documents-from-tp15756174p15756174.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-all-documents-from-tp15756174p15757977.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Mathieu Lecarme

Grant Ingersoll a écrit :


On Feb 29, 2008, at 5:39 AM, Mathieu Lecarme wrote:


Petite Abeille a écrit :

A proposal for a Lua entry for the "Google Summer of Code" '08:
A Lua implementation of Lucene.
For me, Lua is just a glue between C coded object, a super config 
file. Like used in lighttpd or WoW.

Lulu will work on top of Lucy?


That implies the Lucy actually is under development...  Perhaps they 
will take up work on Lucy...
In other hands, a Lucy with C for persistance and parsing, and Lua for 
filter and other fine configuration can be great.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiFieldQueryParser - BooleanClause.Occur

2008-02-29 Thread JensBurkhardt

Hey everybody,

I read that it's possible to generate a query like:

(title:term1 OR author:term1) AND (title:term2 OR author:term2)

and so on. I also read that BooleanClause.Occur should help me handle this
problem.
But i have to admit that i totally don't understand how to use it.
If someone can explain or has a link to an explanation, this would be
terrific

Thanks and best regards

Jens Burkhardt
-- 
View this message in context: 
http://www.nabble.com/MultiFieldQueryParser---BooleanClause.Occur-tp15761243p15761243.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser - BooleanClause.Occur

2008-02-29 Thread Donna L Gresh
I believe something like the following will do what you want:

QueryParser parserTitle = new QueryParser("title", analyzer);
QueryParser parserAuthor = new QueryParser("author", analyzer);

BooleanQuery overallquery = new BooleanQuery();

BolleanQuery firstQuery = new BooleanQuery();
Query q1= parserTitle.parse(term1);
Query q2= parserAuthor.parse(term1);
firstQuery.add(q1, BooleanClause.Occur.SHOULD); //should is like an OR
firstQuery.add(q2, BooleanClause.Occur.SHOULD); 

BolleanQuery secondQuery = new BooleanQuery();
Query q3= parserTitle.parse(term2);
Query q4= parserAuthor.parse(term2);
secondQuery.add(q3, BooleanClause.Occur.SHOULD);
secondQuery.add(q4, BooleanClause.Occur.SHOULD);

overallquery.add(firstQuery, BooleanClause.Occur.MUST); //must is like an 
AND
overallquery.add(secondQuery, BooleanClause.Occur.MUST):

Donna  Gresh



JensBurkhardt <[EMAIL PROTECTED]> wrote on 02/29/2008 10:46:51 AM:

> 
> Hey everybody,
> 
> I read that it's possible to generate a query like:
> 
> (title:term1 OR author:term1) AND (title:term2 OR author:term2)
> 
> and so on. I also read that BooleanClause.Occur should help me handle 
this
> problem.
> But i have to admit that i totally don't understand how to use it.
> If someone can explain or has a link to an explanation, this would be
> terrific
> 
> Thanks and best regards
> 
> Jens Burkhardt
> -- 
> View this message in context: http://www.nabble.
> com/MultiFieldQueryParser---BooleanClause.Occur-tp15761243p15761243.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


Re: Indexing source code files

2008-02-29 Thread Bill Au
There is an opensource project, OpenGrok, that uses Lucene for indexing and
searching source code:

http://opensolaris.org/os/project/opengrok/

It has Analyzers for different type of source files.  It does link source
code to requirements but you can
take a look at the source code to see how it does the indexing.

Bill

On Thu, Feb 28, 2008 at 11:18 AM, Ken Krugler <[EMAIL PROTECTED]>
wrote:

> >I am working on some sort of search mechanism to link a requirement (i.e.
> a
> >query) to source code files (i.e., documents). For that purpose, I
> indexed
> >the source code files using Lucene. Contrary to traditional natural
> language
> >search scenario, we search for code files that are relevant to a given
> >requirement. One problem here is that the source files usually contain a
> lot
> >of abbreviations, words joint by _ or combination of words and/or
> >abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether
> anyone
> >of you already did indexing of (source) files or documents which contain
> >that kind of words.
>
> Yes, that's been something we've spent a fair amount of time on...see
> http://www.krugle.org (public code search).
>
> As Mathieu noted, the first thing you really want to do is split the
> file up into at least comments vs. code. Then you can use a regular
> analyzer (or perhaps something more human language-specific, e.g.
> with stemming support) on the comment text, and your own custom
> tokenizer on the code.
>
> In the code, you might further want to treat literals (strings, etc)
> differently than other terms.
>
> And in "real" code terms, then you want to do essentially synonym
> processing, where youhttp://opensolaris.org/os/project/opengrok/ turn a
> single term into multiple terms based on
> the automatic splitting of the term using '_', '-', camelCasing,
> letter/digit transitions, etc.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: MultiFieldQueryParser - BooleanClause.Occur

2008-02-29 Thread Paul Elschot
Op Friday 29 February 2008 18:04:47 schreef Donna L Gresh:
> I believe something like the following will do what you want:
>
> QueryParser parserTitle = new QueryParser("title", analyzer);
> QueryParser parserAuthor = new QueryParser("author", analyzer);
>
> BooleanQuery overallquery = new BooleanQuery();
>
> BolleanQuery firstQuery = new BooleanQuery();
> Query q1= parserTitle.parse(term1);
> Query q2= parserAuthor.parse(term1);
> firstQuery.add(q1, BooleanClause.Occur.SHOULD); //should is like an
> OR firstQuery.add(q2, BooleanClause.Occur.SHOULD);
>
> BolleanQuery secondQuery = new BooleanQuery();
> Query q3= parserTitle.parse(term2);
> Query q4= parserAuthor.parse(term2);
> secondQuery.add(q3, BooleanClause.Occur.SHOULD);
> secondQuery.add(q4, BooleanClause.Occur.SHOULD);
>
> overallquery.add(firstQuery, BooleanClause.Occur.MUST); //must is
> like an AND
> overallquery.add(secondQuery, BooleanClause.Occur.MUST):

There is no need for a QueryParser in this case when using a
TermQuery instead of a Query for q1, q2, q3 and q4:

TermQuery q1 = new TermQuery(new Term("title", "term1"));

Regards,
Paul Elschot


>
> Donna  Gresh
>
> JensBurkhardt <[EMAIL PROTECTED]> wrote on 02/29/2008 10:46:51 AM:
> > Hey everybody,
> >
> > I read that it's possible to generate a query like:
> >
> > (title:term1 OR author:term1) AND (title:term2 OR author:term2)
> >
> > and so on. I also read that BooleanClause.Occur should help me
> > handle
>
> this
>
> > problem.
> > But i have to admit that i totally don't understand how to use it.
> > If someone can explain or has a link to an explanation, this would
> > be terrific
> >
> > Thanks and best regards
> >
> > Jens Burkhardt
> > --
> > View this message in context: http://www.nabble.
> > com/MultiFieldQueryParser---BooleanClause.Occur-tp15761243p15761243
> >.html Sent from the Lucene - Java Users mailing list archive at
> > Nabble.com.
> >
> >
> > ---
> >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Alternate spelling suggestion (was [Resent] Document boosting based on .. semantics? )

2008-02-29 Thread Markus Fischer

Hi

Mathieu Lecarme wrote:
On a related topic, I'm also searching for a way to suggest alternate 
spelling of words to the user, when we found a word which is very less 
frequent used in the index or not in the index at all. I'm Austrian 
based, when I e.g. search for "retthich" (wrong spelled "rettich" 
which is radish), Google suggests me the proper spelled word. I'm 
searching for a way to figure how to accomplish this, but again this 
may be lucene off-topic and/or I should properly start a separate 
thread ...

you can use the ngram pattern and levestein distance to find near words.
I try with  phonetic and aspell dictionnary.


Thanks for the dictionary hint. So obvious, but still I haven't thought about 
it until you mentioned it!


I've had really great success with the following pattern:

* after the Lucene index was created, I generate a myspell compatible 
dictionary from it


* when the search returns no result, every term is ran through myspell 
suggestion

* the top five myspell suggestion of each term are re-sorted by their 
frequency from the Lucene index


Additionally: when the user only entered a single term, I'm also providing the 
second best results as alternative (did you mean x or y?).


The results have been very good so far, thanks again!

- Markus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Alternate spelling suggestion (was [Resent] Document boosting based on .. semantics? )

2008-02-29 Thread Mathieu Lecarme



Hi

Mathieu Lecarme wrote:
On a related topic, I'm also searching for a way to suggest  
alternate spelling of words to the user, when we found a word  
which is very less frequent used in the index or not in the index  
at all. I'm Austrian based, when I e.g. search for  
"retthich" (wrong spelled "rettich" which is radish), Google  
suggests me the proper spelled word. I'm searching for a way to  
figure how to accomplish this, but again this may be lucene off- 
topic and/or I should properly start a separate thread ...
you can use the ngram pattern and levestein distance to find near  
words.

I try with  phonetic and aspell dictionnary.


Thanks for the dictionary hint. So obvious, but still I haven't  
thought about it until you mentioned it!


I've had really great success with the following pattern:

* after the Lucene index was created, I generate a myspell  
compatible dictionary from it


* when the search returns no result, every term is ran through  
myspell suggestion


* the top five myspell suggestion of each term are re-sorted by  
their frequency from the Lucene index


Additionally: when the user only entered a single term, I'm also  
providing the second best results as alternative (did you mean x or  
y?).


The results have been very good so far, thanks again!

- Markus


I submit a patch for doing that nicely :
https://issues.apache.org/jira/browse/LUCENE-1190

Here is an example code :

LexiconReader lexiconReader = new DirectoryReader(directory);
Lexicon lexicon = new Lexicon(new RAMDirectory());
lexicon.addAnalyser(new NGramAnalyzer());
lexicon.read(lexiconReader);
QueryParser parser = new QueryParser("txt", new WhitespaceAnalyzer());
Query query = parser.parse("bio:brawn");
SuggestiveSearcher searcher = new SuggestiveSearcher(new  
IndexSearcher(directory), lexicon);

SuggestiveHits hits = searcher.searchWithSuggestions(query);
System.out.println(hits.getSuggestedQuery());

In this example, a lexicon is build from a directory, with a Ngram  
analyzer.

The directory contains the term "brown" in field "bio".
Hits is final, so, you can't heritate from it, that's why there is a  
SuggestiveHits.
SuggestiveHits has a thresold, if there's to few answer, similar word  
search is triggered.


Using myspell dictionnary is a good idea, i'll implement this.

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser - BooleanClause.Occur

2008-02-29 Thread Donna L Gresh
Paul-
Thanks (that was one of my ulterior motives for answering the question; I 
figured if there was something inefficient or unnecessary about my 
approach, I'd hear about it :) )

Donna Gresh


Paul Elschot <[EMAIL PROTECTED]> wrote on 02/29/2008 01:42:57 PM:

> Op Friday 29 February 2008 18:04:47 schreef Donna L Gresh:
> > I believe something like the following will do what you want:
> >
> > QueryParser parserTitle = new QueryParser("title", analyzer);
> > QueryParser parserAuthor = new QueryParser("author", analyzer);
> >
> > BooleanQuery overallquery = new BooleanQuery();
> >
> > BolleanQuery firstQuery = new BooleanQuery();
> > Query q1= parserTitle.parse(term1);
> > Query q2= parserAuthor.parse(term1);
> > firstQuery.add(q1, BooleanClause.Occur.SHOULD); //should is like an
> > OR firstQuery.add(q2, BooleanClause.Occur.SHOULD);
> >
> > BolleanQuery secondQuery = new BooleanQuery();
> > Query q3= parserTitle.parse(term2);
> > Query q4= parserAuthor.parse(term2);
> > secondQuery.add(q3, BooleanClause.Occur.SHOULD);
> > secondQuery.add(q4, BooleanClause.Occur.SHOULD);
> >
> > overallquery.add(firstQuery, BooleanClause.Occur.MUST); //must is
> > like an AND
> > overallquery.add(secondQuery, BooleanClause.Occur.MUST):
> 
> There is no need for a QueryParser in this case when using a
> TermQuery instead of a Query for q1, q2, q3 and q4:
> 
> TermQuery q1 = new TermQuery(new Term("title", "term1"));
> 
> Regards,
> Paul Elschot
> 
> 
> >
> > Donna  Gresh
> >
> > JensBurkhardt <[EMAIL PROTECTED]> wrote on 02/29/2008 10:46:51 AM:
> > > Hey everybody,
> > >
> > > I read that it's possible to generate a query like:
> > >
> > > (title:term1 OR author:term1) AND (title:term2 OR author:term2)
> > >
> > > and so on. I also read that BooleanClause.Occur should help me
> > > handle
> >
> > this
> >
> > > problem.
> > > But i have to admit that i totally don't understand how to use it.
> > > If someone can explain or has a link to an explanation, this would
> > > be terrific
> > >
> > > Thanks and best regards
> > >
> > > Jens Burkhardt
> > > --
> > > View this message in context: http://www.nabble.
> > > com/MultiFieldQueryParser---BooleanClause.Occur-tp15761243p15761243
> > >.html Sent from the Lucene - Java Users mailing list archive at
> > > Nabble.com.
> > >
> > >
> > > ---
> > >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Petite Abeille


On Feb 29, 2008, at 11:39 AM, Mathieu Lecarme wrote:

For me, Lua is just a glue between C coded object, a super config  
file. Like used in lighttpd or WoW.


This is a rather very narrow view of what one could do with Lua.

For example, this wiki engine is written exclusively in Lua:

http://alt.textdrive.com/nanoki/
http://dev.alt.textdrive.com/browser/HTTP



Lulu will work on top of Lucy?


No.


Did I miss something?


Yes.

Lulu is a meant to be a pure Lua implementation of Lucene.

Cheers,

PA.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Petite Abeille


On Feb 29, 2008, at 1:09 PM, Grant Ingersoll wrote:

That implies the Lucy actually is under development...  Perhaps they  
will take up work on Lucy...


Lulu has nothing to do with Lucy... goes back to something at high  
school or something...



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Most efficient way to find related terms

2008-02-29 Thread Martin Bayly
I'm wondering what the most efficient approach is to finding all terms
which are related to a specific term X.

 

By related I mean terms which appear in a specific document field that
also contains the target term X.

 

e.g. Document has a keyword field, field1 that can contain multiple
keywords.

 

document1 - field1 HAS key1, key2, key3

document2 - field1 HAS key2, key4

document3 - field1 HAS key5

 

If I want to find terms related to key2, I need to return key1, key3,
key4

 

Obviously I can do a search for key2, iterate all the docs and collect
there field1 terms manually.

 

But presumably a more efficient way is to use TermDocs:

 

1. TermDocs termdocs = IndexReader.termDocs(new Term("field1", "key2")

2. Iterate term docs to get documents containing that term 

3. Now this is the bit I'm not sure of:

a. I could call Document doc = IndexReader.document(n), but that will
load all fields and I only want the field1

b. Presumably better to call Document doc = IndexReader.document(n,
fieldSelector)

c. Or would I be better to turn on TermFrequencyVectors for this field
so I can call IndexReader.getTermFrequencyVector(n, "field1") - don't
particularly care about the frequencies as it will always be 1 for a
particular doc.

 

Other approaches?

 

I'm going to perf test to see how (b) and (c) compare but would be glad
if anyone has any insights.

 

Thanks

Martin



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Petite Abeille


On Feb 29, 2008, at 3:42 PM, Mathieu Lecarme wrote:

In other hands, a Lucy with C for persistance and parsing, and Lua  
for filter and other fine configuration can be great.


Who is that Lucy you keep talking about? :P

Cheers,

PA.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Petite Abeille


On Feb 29, 2008, at 8:37 PM, Simon Willnauer wrote:


or go to http://lucene.apache.org/lucy/


Looks rather, hmmm, inactive:

http://svn.apache.org/viewvc/lucene/lucy/

Is there any working code anywhere?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Simon Willnauer
goto www.google.com
type Lucy lucene

feel lucky

or go to http://lucene.apache.org/lucy/

simon



On Fri, Feb 29, 2008 at 8:24 PM, Petite Abeille <[EMAIL PROTECTED]> wrote:
>
>  On Feb 29, 2008, at 3:42 PM, Mathieu Lecarme wrote:
>
>  > In other hands, a Lucy with C for persistance and parsing, and Lua
>  > for filter and other fine configuration can be great.
>
>  Who is that Lucy you keep talking about? :P
>
>  Cheers,
>
>  PA.
>
>
>
>
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Petite Abeille


On Feb 29, 2008, at 11:39 AM, Mathieu Lecarme wrote:

For me, Lua is just a glue between C coded object, a super config  
file. Like used in lighttpd or WoW.


Here is a an online demo of a wiki engine implemented purely in Lua:

http://svr225.stepx.com:3388/a
http://svr225.stepx.com:3388/nanoki


Cheers,

PA.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene for Sentiment Analysis

2008-02-29 Thread Aaron Schon
Hello, I was interested to learn about using Lucene for text analytics work 
such as for Sentiment Analysis. Has anyone done work along these lines? if so, 
could you share your approach, experiences, accuracy levels obtained etc.

Thanks,
AS


  

Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  
http://tools.search.yahoo.com/newsearch/category.php?category=shopping

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Corrupted Indexes Under Lucene 2.3 (and 2.3.1)

2008-02-29 Thread Tyler V
After upgrading to Lucene 2.3 (and subsequently 2.3.1), our
application has experienced sporadic index corruptions on our larger
(and more frequently updated) indexes. These indexes experienced no
corruptions under any prior version of Lucene (which we have been
using for several years).

The pattern of failure seems to be consistent.  First, we receive an
exception like the following:

java.lang.IndexOutOfBoundsException: Index: 4788, Size: 4762
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at 
org.apache.lucene.index.DocumentsWriter$ThreadState.init(DocumentsWriter.java:749)
at 
org.apache.lucene.index.DocumentsWriter.getThreadState(DocumentsWriter.java:2391)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2434)
at 
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:2422)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1445)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1424)
at com.myapp.indexing.IndexerRunner.run(IndexerRunner.java:134)
at java.lang.Thread.run(Thread.java:619)

When we experience this error, we run a writer.flush() then a writer.close().

Then, we get this exception when trying to re-open the index:

org.apache.lucene.index.CorruptIndexException: doc counts differ for
segment _c2z13: fieldsReader shows 2 but segmentInfo shows 3
at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:197)
at 
org.apache.lucene.index.MultiSegmentReader.(MultiSegmentReader.java:55)
at 
org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:75)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
at 
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
at com.myapp.indexing.IndexerRunner.run(IndexerRunner.java:107)
at java.lang.Thread.run(Thread.java:619)

Running the check index application included with 2.3 enables us to
remove the bad documents from the index, but this workaround is less
than desirable.  It would be greatly appreciated if anyone could shed
some light on our issue.

Regards,

Tyler

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Corrupted Indexes Under Lucene 2.3 (and 2.3.1)

2008-02-29 Thread Michael McCandless


Not good!  (I'm sorry).

That first exception is worrisome.  It's the root cause here.

Can you describe your documents?  That exception, if I'm reading it  
right, seems to imply that you have documents with 4762 fields.  Is  
that right?


Are you using multiple threads?  Is it possible that you cal  
addDocument with a given document, but another thread removing fields  
from that document (or, removing elements from the List returned by  
Document.getFields())?  That's the only way I can explain that first  
exception.


Mike

Tyler V wrote:


After upgrading to Lucene 2.3 (and subsequently 2.3.1), our
application has experienced sporadic index corruptions on our larger
(and more frequently updated) indexes. These indexes experienced no
corruptions under any prior version of Lucene (which we have been
using for several years).

The pattern of failure seems to be consistent.  First, we receive an
exception like the following:

java.lang.IndexOutOfBoundsException: Index: 4788, Size: 4762
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.lucene.index.DocumentsWriter$ThreadState.init 
(DocumentsWriter.java:749)
at org.apache.lucene.index.DocumentsWriter.getThreadState 
(DocumentsWriter.java:2391)
at org.apache.lucene.index.DocumentsWriter.updateDocument 
(DocumentsWriter.java:2434)
at org.apache.lucene.index.DocumentsWriter.addDocument 
(DocumentsWriter.java:2422)
at org.apache.lucene.index.IndexWriter.addDocument 
(IndexWriter.java:1445)
at org.apache.lucene.index.IndexWriter.addDocument 
(IndexWriter.java:1424)
at com.myapp.indexing.IndexerRunner.run(IndexerRunner.java: 
134)

at java.lang.Thread.run(Thread.java:619)

When we experience this error, we run a writer.flush() then a  
writer.close().


Then, we get this exception when trying to re-open the index:

org.apache.lucene.index.CorruptIndexException: doc counts differ for
segment _c2z13: fieldsReader shows 2 but segmentInfo shows 3
at org.apache.lucene.index.SegmentReader.initialize 
(SegmentReader.java:313)
at org.apache.lucene.index.SegmentReader.get 
(SegmentReader.java:262)
at org.apache.lucene.index.SegmentReader.get 
(SegmentReader.java:197)
at org.apache.lucene.index.MultiSegmentReader. 
(MultiSegmentReader.java:55)
at org.apache.lucene.index.DirectoryIndexReader$1.doBody 
(DirectoryIndexReader.java:75)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run 
(SegmentInfos.java:636)
at org.apache.lucene.index.DirectoryIndexReader.open 
(DirectoryIndexReader.java:63)
at org.apache.lucene.index.IndexReader.open 
(IndexReader.java:209)
at org.apache.lucene.index.IndexReader.open 
(IndexReader.java:192)
at com.myapp.indexing.IndexerRunner.run(IndexerRunner.java: 
107)

at java.lang.Thread.run(Thread.java:619)

Running the check index application included with 2.3 enables us to
remove the bad documents from the index, but this workaround is less
than desirable.  It would be greatly appreciated if anyone could shed
some light on our issue.

Regards,

Tyler

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do i get a text summary

2008-02-29 Thread Bob Carpenter

Mathieu Lecarme wrote:

[EMAIL PROTECTED] a écrit :



And how could one create automatically such a summary?


Here's a site with some pointers to the literature
and some systems out there to do summarization:

http://www.summarization.com/

This is actually whole-document or even
multiple-document summarization.

Snippet production's a rather different problem,
which needs to be sensitive to the query.
What to show isn't so easy when there are many
instances of the query terms in the document
and very limited space.

Have a look to http://alias-i.com/lingpipe/index.html 

> or http://www.nzdl.org/Kea/

We haven't written any kind of text summarization
package for LingPipe.  Kea works at a keyphrase
level, not a doc summary level, though they reference
a paper on using it for summarization:

http://www.hicss.hawaii.edu/HICSS_35/HICSSpapers/PDFdocuments/DDUAC04.pdf

Both LingPipe and Kea are able to find significant
phrases, which is useful for query refinement or
summarizing sets of search results, but not so
useful for individual documents.  It can be a huge
help to add part-of-speech information to these
kinds of approaches.

- Bob Carpenter
  Alias-i

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Corrupted Indexes Under Lucene 2.3 (and 2.3.1)

2008-02-29 Thread Tyler V
Mike -- Thanks so much for the prompt reply.

You are right, we are accessing these documents with multiple threads
(and have always been). However, I am wondering if the increased
indexing speed in 2.3 has revealed a hidden concurrency issue.

I am going to add in some additional concurrency checks, if I have
questions I will post another question to the list.

Thanks again for the reply.

Tyler

On Fri, Feb 29, 2008 at 2:46 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
>  Not good!  (I'm sorry).
>
>  That first exception is worrisome.  It's the root cause here.
>
>  Can you describe your documents?  That exception, if I'm reading it
>  right, seems to imply that you have documents with 4762 fields.  Is
>  that right?
>
>  Are you using multiple threads?  Is it possible that you cal
>  addDocument with a given document, but another thread removing fields
>  from that document (or, removing elements from the List returned by
>  Document.getFields())?  That's the only way I can explain that first
>  exception.
>
>  Mike
>
>
>
>  Tyler V wrote:
>
>  > After upgrading to Lucene 2.3 (and subsequently 2.3.1), our
>  > application has experienced sporadic index corruptions on our larger
>  > (and more frequently updated) indexes. These indexes experienced no
>  > corruptions under any prior version of Lucene (which we have been
>  > using for several years).
>  >
>  > The pattern of failure seems to be consistent.  First, we receive an
>  > exception like the following:
>  >
>  > java.lang.IndexOutOfBoundsException: Index: 4788, Size: 4762
>  > at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>  > at java.util.ArrayList.get(ArrayList.java:322)
>  > at org.apache.lucene.index.DocumentsWriter$ThreadState.init
>  > (DocumentsWriter.java:749)
>  > at org.apache.lucene.index.DocumentsWriter.getThreadState
>  > (DocumentsWriter.java:2391)
>  > at org.apache.lucene.index.DocumentsWriter.updateDocument
>  > (DocumentsWriter.java:2434)
>  > at org.apache.lucene.index.DocumentsWriter.addDocument
>  > (DocumentsWriter.java:2422)
>  > at org.apache.lucene.index.IndexWriter.addDocument
>  > (IndexWriter.java:1445)
>  > at org.apache.lucene.index.IndexWriter.addDocument
>  > (IndexWriter.java:1424)
>  > at com.myapp.indexing.IndexerRunner.run(IndexerRunner.java:
>  > 134)
>  > at java.lang.Thread.run(Thread.java:619)
>  >
>  > When we experience this error, we run a writer.flush() then a
>  > writer.close().
>  >
>  > Then, we get this exception when trying to re-open the index:
>  >
>  > org.apache.lucene.index.CorruptIndexException: doc counts differ for
>  > segment _c2z13: fieldsReader shows 2 but segmentInfo shows 3
>  > at org.apache.lucene.index.SegmentReader.initialize
>  > (SegmentReader.java:313)
>  > at org.apache.lucene.index.SegmentReader.get
>  > (SegmentReader.java:262)
>  > at org.apache.lucene.index.SegmentReader.get
>  > (SegmentReader.java:197)
>  > at org.apache.lucene.index.MultiSegmentReader.
>  > (MultiSegmentReader.java:55)
>  > at org.apache.lucene.index.DirectoryIndexReader$1.doBody
>  > (DirectoryIndexReader.java:75)
>  > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run
>  > (SegmentInfos.java:636)
>  > at org.apache.lucene.index.DirectoryIndexReader.open
>  > (DirectoryIndexReader.java:63)
>  > at org.apache.lucene.index.IndexReader.open
>  > (IndexReader.java:209)
>  > at org.apache.lucene.index.IndexReader.open
>  > (IndexReader.java:192)
>  > at com.myapp.indexing.IndexerRunner.run(IndexerRunner.java:
>  > 107)
>  > at java.lang.Thread.run(Thread.java:619)
>  >
>  > Running the check index application included with 2.3 enables us to
>  > remove the bad documents from the index, but this workaround is less
>  > than desirable.  It would be greatly appreciated if anyone could shed
>  > some light on our issue.
>  >
>  > Regards,
>  >
>  > Tyler
>  >
>  > -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
>  >
>
>
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Marvin Humphrey


On Feb 29, 2008, at 4:09 AM, Grant Ingersoll wrote:

That implies the Lucy actually is under development...  Perhaps they  
will take up work on Lucy...



:)

I haven't been feeding back my KinoSearch commits into the Lucy  
repository, true.  Not much has changed status-wise since this:  (Link to mail-archives.apache.org).  I miss Dave :( but work  
continues apace.  I just figured I wouldn't bring anything up here  
until I had some experimental Java bindings for you folks to play  
with.  Some of the stuff I'm working on with KS isn't compatible with  
Lucene, and it doesn't make sense to impede progress by imposing that  
constraint.  Better to finish it up, present a polished version and a  
coherent explanation of the design, then have Mike McCandless riff off  
of it -- that worked pretty well for indexing speed improvements in  
Lucene 2.3.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene Search Performance

2008-02-29 Thread Andreas Guther
Just some comment and I understand that you cannot change your index:

What we did is to organize our index based on creation date of entries.
We limit our search to a given number of years starting from the current
year.  Organizing the index in that way allows us to take off outdated
information.  We can provide the user an option to search across older
indexes as well, if wanted.

Andreas

-Original Message-
From: Jamie [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 27, 2008 10:17 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Search Performance

Hi

Thanks for the suggestions. This would require us to change the index 
and right now we literally have millions of documents stored in current 
index format. I'll bear it in mind, but I am not entirely sure how I 
would go about implementing the change at this point.

Much appreciate

Jamie


h t wrote:
> 1. redefine the archivedate field as YYmmDD format,
> 2. add another field using timestamp for sort use.
> 3. use RangeFilter to get result and then sort by timestamp.
>
> 2008/2/27, Jamie <[EMAIL PROTECTED]>:
>   
>> Hi Michael & Others
>>
>> Ok. I've gathered some more statistics from a different machine for
your
>> analysis.
>> (I had to switch machines because the original one was in production
and
>> my tests were interfering).
>>
>> Here are the statistics from the new machine:
>>
>> Total Documents: 1.2 million
>> Results Returned:  900k
>> Store Size 238G (size of original documents)
>> Index Size 1.2G (lucene index size)
>> Index / Store Ratio 0.5%
>>
>> The search query is as follows:
>>
>> archivedate:[d2007122901 TO d20080228235900]
>> ~~~why there
is an
>> extra 'd' ?
>> As you can see, I am using a range query to search between specific
dates.
>> Question: should this query be moved to a filter rather? I did not do
>> this as I needed to have the option to sort on date.
>>
>> There are no other specific filters applied and in this example
sorting
>> is turned off.
>>
>> On this particular machine the search time varies between 2.64
seconds
>> and about 5 seconds.
>>
>> The limitations of this machine are that it does uses a normal IDE
drive
>> to house the index, not a SATA drive
>>
>> IOStat Statistics
>>
>> Linux 2.6.20-15-server 27/02/2008
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>  20.250.003.230.340.00   76.19
>>
>> Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read
Blk_wrtn
>> sda   7.1250.67   186.41   38936841
143240688
>>
>> See attached for hardware info and the CPU call tree (taken from
YourKit).
>>
>> I would appreciate your recommendations.
>>
>>
>> Jamie
>>
>>
>> h t wrote:
>> Hi Michael,
>> I guess the hotspot of lucene is
>> org.apache.lucene.search.IndexSearcher.search()
>>
>> Hi Jamie,
>> What's the original text size of a million emails?
>> I estimate the size of an email is around 100k, is this true?
>> When you doing search, what kind keywords did you input, words or
short
>> sentence?
>> How many results return?
>> Did you use filter to shrink the results size?
>>
>> 2008/2/27, Michael Stoppelman <[EMAIL PROTECTED]>:
>>   So you're saying searches are taking 10 seconds on a 5G index? If
so
>> that
>> seems ungodly slow.
>> If you're on *nix, have you watched your iostat statistics? Maybe
>> something
>> is hammering your hds.
>> Something seems amiss.
>>
>> What lucene methods were pointed to as hotspots by YourKit?
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> 
>
>   


-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-366-8072 ext 3
UK Tel: +44-20-80991035 ext 3
Email: [EMAIL PROTECTED]
Web: http://www.mailarchiva.com

To receive MailArchiva Enterprise Edition product announcements, send a
message to: <[EMAIL PROTECTED]> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Marvin Humphrey


On Feb 29, 2008, at 11:22 AM, Petite Abeille wrote:


Lulu is a meant to be a pure Lua implementation of Lucene.


How fast is Lua's method dispatch, compared to Java's?  That has a  
huge impact on performance, since *everything* is a method in Lucene  
-- down to writeByte().


There have been several attempts at pure dynamic language ports of  
Lucene, including Plucene (Perl), Lupy (Python), and the original pure- 
Ruby implementation of Ferret.  All of those have proven unacceptably  
slow.  The ports which have achieved good performance, including  
CLucene (C++), Lucene.NET, the latter-day Ferret (C/Ruby), and  
KinoSearch (C/Perl) (and possibly others, I haven't reviewed them all)  
all get close to the metal.  None rely on hash-based method dispatch  
for inner-loop code, and all manipulate primitive types directly.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Corrupted Indexes Under Lucene 2.3 (and 2.3.1)

2008-02-29 Thread Yonik Seeley
On Fri, Feb 29, 2008 at 7:05 PM, Tyler V <[EMAIL PROTECTED]> wrote:
> Mike -- Thanks so much for the prompt reply.
>
>  You are right, we are accessing these documents with multiple threads
>  (and have always been). However, I am wondering if the increased
>  indexing speed in 2.3 has revealed a hidden concurrency issue.

You are modifying the documents from multiple threads?

My fault... I removed the synchronization on Document (changed from
Vector to ArrayList).  It was never guaranteed to be thread-safe for
modification, and almost never makes sense without external
synchronization anyway.

If you really need to modify a single document from multiple threads,
please synchronize.
That explains the first exception, but no the second.  I assume you
aren't still changing the document while it's being indexed?
It appears as if the original exception causes corruption.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



More IndexDeletionPolicy questions

2008-02-29 Thread Tim Brennan
Is there a direct way to ask an IndexReader what segment it is pointing at?  
That would make implementing custom deletion policies a LOT easier.

It seems like it should be pretty simple -- keep a list of open IndexReaders, 
track what Segment files they're pointing to, and in onCommit don't delete 
those segments.  

Unfortunately it ends up being very difficult to directly determine what 
Segment an IndexReader is pointing to.  Is there some straightforward way that 
I'm missing -- all I've managed to do so far is to remember the most recent one 
from onCommit/onInit and use that onethat works OK, but makes bootstrapping 
a pain if you try to open a Reader before you've opened the writer once.  

Also, when I use IndexReader.reopen(), can I assume that the newly returned 
reader is pointing at the "most recent" segment?  I think so...


Here's a sequence of steps that happen in my app:

0) open writer, onInit tells me that seg_1 is the most recent segment

1) Open reader, assume it is pointing to seg_1 from (0)

2) New write commits into seg_2, don't delete seg_1 b/c of (1)

3) Call reader.reopen() on the reader from (1)new reader is pointing to 
seg_2 now?

4) seg_1 stays around until the next time I open or commit a writer, then it is 
removed.


Does that seem reasonable?


--tim







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene for Sentiment Analysis

2008-02-29 Thread Srikant Jakilinki

Hi

I remember doing it once a long time ago (Lucene 1.9) but could not go 
anywhere since term vectors and such were not easily accessible then.


However, if text analytics is what you are after, perhaps you already 
know about LingPipe. I used it for named entity extraction (enhanced 
with some TF-IDF statistical information from Lucene Index) and it 
worked well.


Maybe, you can do the same for sentiment analysis i.e. use LingPipe 
capabilities but enhance it with the corpus statistics that the Lucene 
index provides - and which are more powerful now.


HTH,
Srikant

Aaron Schon wrote:

Hello, I was interested to learn about using Lucene for text analytics work 
such as for Sentiment Analysis. Has anyone done work along these lines? if so, 
could you share your approach, experiences, accuracy levels obtained etc.

Thanks,
AS


--
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]