Re: Phrase Highlighting

2009-06-04 Thread Michael McCandless
Mark, is this because the highlighter package doesn't include enough information as to why the fragmenter picked a given fragment? Because... the SpanScorer is in fact doing all the work to properly locate the full span for the phrase (I think?), so it's ashame that because there's no way for it t

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
KK, ok, so you only really want to stem the english. This is good. Is it possible for you to consider using solr? solr's default analyzer for type 'text' will be good for your case. it will do the following 1. tokenize on whitespace 2. handle both indian language and english punctuation 3. lowerca

Re: Phrase Highlighting

2009-06-04 Thread Mark Miller
Yeah, the highlighter framework as is is certainly limiting. When I first did the SpanHighlighter without trying to fit it into the old Highlighter (an early incomplete prototype type thing anyway) I made them merge right off the bat because it was very easy. That was because I could just use t

RE: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Uwe Schindler
You can also re-use the solr analyzers, as far as I found out. There is an issue in jIRA/discussion on java-dev to merge them. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Robert Muir [mailto:rcm...@g

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
yes this is true. for starters KK, might be good to startup solr and look at http://localhost:8983/solr/admin/analysis.jsp?highlight=on if you want to stick with lucene, the WordDelimiterFilter is the piece you will want for your text, mainly for punctuation but also for format characters such as

P2P Lucene

2009-06-04 Thread Shashi Kant
Hi all, I am writing to gauge the group's interest level in building a P2P application using Lucene. Nothing fancy, just good old-fashioned P2P search across one's social-network or work-network (very unlike Gnutella, Kazaa etc.). The obvious business-case for this could be many such as document s

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
Thank you all. To be frank I was using Solr in the begining half a month ago. The problem[rather bug] with solr was creation of new index on the fly. Though they have a restful method for teh same, but it was not working. If I remember properly one of Solr commiter "Noble Paul"[I dont know his real

RE: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Uwe Schindler
> I request Uwe to give me some more ideas on using the analyzers from solr > that will do the job for me, handling a mix of both english and non- > english content. Look here: http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h tml As you see, the Solr analyzers are just

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
Uwe, thanks for your lightening fast reponse :-). I'm looking into that and let me see how far I can go...Also I request Muir to point me to the exact analyzer he mentiioned in thr previous mail. Thanks, KK On Thu, Jun 4, 2009 at 6:10 PM, Uwe Schindler wrote: > > I request Uwe to give me some

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
KK, for your case, you don't really need to go to the effort of detecting whether fragments are english or not. Because the English stemmers in lucene will not modify your Indic text, and neither will the LowerCaseFilter. what you want to do is create a custom analyzer that works like this -White

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
uwe what KK needs here is 'proper unicode handling'. since the latest WordDelimiterFilter has pretty good handling of unicode categories, combining this with WhiteSpaceTokenizer effectively gives you a pretty good solution for unicode tokenization. KK doesn't need detection of anything, the porte

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
Thanks Muir. Thanks for letting me know that I dont need language identifiers. I'll have a look and will try to write the analyzer. For my case I think it wont be that difficult. BTW, can you point me to some sample codes/tutorials writing custom analyzers. I could not find something in LIA2ndEdn.

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
KK well you can always get some good examples from the lucene contrib codebase. For example, look at the DutchAnalyzer, especially: TokenStream tokenStream(String fieldName, Reader reader) See how it combines a specified tokenizer with various filters? this is what you want to do, except of cours

Re: Extending StandardAnalyzer considered harmful

2009-06-04 Thread Michael McCandless
Hmm, sorry about that, and thank you for raising it. This is indeed not good and should be considered a break in back-compat since silently things change and it's not easy for you to discover that. I think we should at least deprecate Analyzer.tokenStream, and fix QueryParser (and any others) to

Custom sorting!

2009-06-04 Thread vanshi
I am doing custom sorting within lucene using overloaded searcher.search(query, sort). First precedence is to sort based on 'last name' and then on 'network status', where 'INN' is better than 'OUT'. Fields are stored in the indexes like this: FIRST_NAME(Field.Store.NO, Field.Index.NO_NORMS) LAS

Query:Adding all docs at once or creating smaller indexes and merge

2009-06-04 Thread Tarandeep Singh
Hi, >From efficiency point of view, what will be more efficient- Creating a single big index (big enough for one machine) by adding all documents in it at once or Creating smaller indexes and then merge them to make one bigger index? And if there is a performance penalty, then any rough estima

Re: P2P Lucene

2009-06-04 Thread Otis Gospodnetic
Big +1 ! :) It would make for a cool case study for Lucene in Action 3rd edition ;) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Shashi Kant > To: java-user@lucene.apache.org > Sent: Thursday, June 4, 2009 8:03:56 AM > Subject: P2P Lu

cannot retrieve the values of a field is not stored in the index

2009-06-04 Thread Alex Steward
Hi,   Is there a way I can retrieve the value of a field that is not stored in the Index? private static void indexFile(IndexWriter writer, File f)     throws IOException {     if (f.isHidden() || !f.exists() || !f.canRead()) {   return;     }     System.out.println("Indexing " + f.getC

Re: P2P Lucene

2009-06-04 Thread Ye Minjiao
I guess sixearch might be the thing you are looking for... http://sixearch.org/ Sixearch is a collaborative peer network application, which aims to address the scalability and context limitations of centralized search engines and also provides a complementary way for Web search. Sixearch uses t

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
Hello Robert, I was thinking of kind of chaining analyzers, does this sound logical? Currently I'm using the whitespace analyzer which tokenizes on whitespaces only. As you mentioned earlier I dont need to use language identifiers, which means I've to pass the full content through say first via whi