Mark, is this because the highlighter package doesn't include enough
information as to why the fragmenter picked a given fragment?
Because... the SpanScorer is in fact doing all the work to properly
locate the full span for the phrase (I think?), so it's ashame that
because there's no way for it t
KK, ok, so you only really want to stem the english. This is good.
Is it possible for you to consider using solr? solr's default analyzer for
type 'text' will be good for your case. it will do the following
1. tokenize on whitespace
2. handle both indian language and english punctuation
3. lowerca
Yeah, the highlighter framework as is is certainly limiting. When I
first did the SpanHighlighter without trying to fit it into the old
Highlighter (an early incomplete prototype type thing anyway) I made
them merge right off the bat because it was very easy. That was because
I could just use t
You can also re-use the solr analyzers, as far as I found out. There is an
issue in jIRA/discussion on java-dev to merge them.
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Robert Muir [mailto:rcm...@g
yes this is true. for starters KK, might be good to startup solr and look at
http://localhost:8983/solr/admin/analysis.jsp?highlight=on
if you want to stick with lucene, the WordDelimiterFilter is the piece you
will want for your text, mainly for punctuation but also for format
characters such as
Hi all,
I am writing to gauge the group's interest level in building a P2P
application using Lucene. Nothing fancy, just good old-fashioned P2P
search across one's social-network or work-network (very unlike
Gnutella, Kazaa etc.). The obvious business-case for this could be
many such as document s
Thank you all.
To be frank I was using Solr in the begining half a month ago. The
problem[rather bug] with solr was creation of new index on the fly. Though
they have a restful method for teh same, but it was not working. If I
remember properly one of Solr commiter "Noble Paul"[I dont know his real
> I request Uwe to give me some more ideas on using the analyzers from solr
> that will do the job for me, handling a mix of both english and non-
> english content.
Look here:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h
tml
As you see, the Solr analyzers are just
Uwe, thanks for your lightening fast reponse :-).
I'm looking into that and let me see how far I can go...Also I request Muir
to point me to the exact analyzer he mentiioned in thr previous mail.
Thanks,
KK
On Thu, Jun 4, 2009 at 6:10 PM, Uwe Schindler wrote:
> > I request Uwe to give me some
KK, for your case, you don't really need to go to the effort of detecting
whether fragments are english or not.
Because the English stemmers in lucene will not modify your Indic text, and
neither will the LowerCaseFilter.
what you want to do is create a custom analyzer that works like this
-White
uwe what KK needs here is 'proper unicode handling'.
since the latest WordDelimiterFilter has pretty good handling of unicode
categories, combining this with WhiteSpaceTokenizer effectively gives you a
pretty good solution for unicode tokenization.
KK doesn't need detection of anything, the porte
Thanks Muir.
Thanks for letting me know that I dont need language identifiers.
I'll have a look and will try to write the analyzer. For my case I think it
wont be that difficult.
BTW, can you point me to some sample codes/tutorials writing custom
analyzers. I could not find something in LIA2ndEdn.
KK well you can always get some good examples from the lucene contrib
codebase.
For example, look at the DutchAnalyzer, especially:
TokenStream tokenStream(String fieldName, Reader reader)
See how it combines a specified tokenizer with various filters? this is what
you want to do, except of cours
Hmm, sorry about that, and thank you for raising it.
This is indeed not good and should be considered a break in
back-compat since silently things change and it's not easy for you to
discover that.
I think we should at least deprecate Analyzer.tokenStream, and fix
QueryParser (and any others) to
I am doing custom sorting within lucene using overloaded
searcher.search(query, sort). First precedence is to sort based on 'last
name' and then on 'network status', where 'INN' is better than 'OUT'.
Fields are stored in the indexes like this:
FIRST_NAME(Field.Store.NO, Field.Index.NO_NORMS)
LAS
Hi,
>From efficiency point of view, what will be more efficient-
Creating a single big index (big enough for one machine) by adding all
documents in it at once
or
Creating smaller indexes and then merge them to make one bigger index?
And if there is a performance penalty, then any rough estima
Big +1 ! :)
It would make for a cool case study for Lucene in Action 3rd edition ;)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Shashi Kant
> To: java-user@lucene.apache.org
> Sent: Thursday, June 4, 2009 8:03:56 AM
> Subject: P2P Lu
Hi,
Is there a way I can retrieve the value of a field that is not stored in the
Index?
private static void indexFile(IndexWriter writer, File f)
throws IOException {
if (f.isHidden() || !f.exists() || !f.canRead()) {
return;
}
System.out.println("Indexing " + f.getC
I guess sixearch might be the thing you are looking for...
http://sixearch.org/
Sixearch is a collaborative peer network application, which aims to address
the scalability and context limitations of centralized search engines and
also provides a complementary way for Web search.
Sixearch uses t
Hello Robert,
I was thinking of kind of chaining analyzers, does this sound logical?
Currently I'm using the whitespace analyzer which tokenizes on whitespaces
only. As you mentioned earlier I dont need to use language identifiers,
which means I've to pass the full content through say first via
whi
20 matches
Mail list logo