Thanks, Erick.
I agree that it might be unlikely to reconstruct from an existing index, but I
think document boosting (that is, one document has a higher boost factor than
other documents) as well as field boosting is specified during indexing.
Our use case is performancce/results tuning. We hav
Because I wanted to use the javaCC input code from Lucene. 99.99% of
what the standard parser did was VERY GOOD. having worked with
computer-generated compilers in the past, I realized that if I were to
modify the parser itself, I would eventually get into real trouble. So
I took the time to
On Aug 29, 2006, at 3:54 PM, Mag Gam wrote:
"Index the date". Do you mean, index date, or the document date?
Could this be in a LIA book?
This is entirely up to you. What gets indexed is entirely within the
developers control. What date do you want indexed? I presume by
"document date
Your problem is that StandardTokenizer doesn's fit your requirements.
Since you know how to implement a new one, just do it.
If you just want to modify StandardTokenizer, you can get the codes and
rename it to your class, then modify something that you dislike. I think
it's a so simple stuff, why
On Aug 29, 2006, at 7:12 PM, Mark Miller wrote:
2. The ParseException that is generated when making the
StandardAnalyzer must be killed because there is another
ParseException class (maybe in queryparser?) that must be used
instead. The lucene build file excludes the StandardAnalyzer
Parse
Have you looked at the MoreLikeThis class in the similarity package?
On 8/30/06, Winton Davies <[EMAIL PROTECTED]> wrote:
Hi All,
I'm scratching my head - can someone tell me which class implements
an efficient multiple term TF.IDF Cosine similarity scoring mechanism?
There is clearly the sin
A couple of things..
1> I don't think you set the boost when indexing. You set the boost when
querying, so you don't need to re-index for boosting.
2> A recurring theme is that you can't do an update-in-place for a lucene
document. You might search the mail archive for a discussion of this. The
Bill Taylor wrote:
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name, LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(
Bill Taylor wrote:
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name, LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name,
LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(reader)); fails bec
Hi,
Got a question. Here is what I want to achieve:
Create a new index from an existing index, to change the boosting factor for
some of the documents (and potentially some other tweaks), without reindexing
it from the source.
Is there any tools or ways to do this?
Thanks!
Xiaocheng Luan
Tucked away in the contrib section of Lucene (I'm using 2.0) there is
org.apache.lucene.index.memory.PatternAnalyzer
which takes a regular expression as and tokenizes with it. Would that help?
Word of warning... the regex determines what is NOT a token, not what IS a
token (as I remember),
: mmdd so that it can be sorted. having done that, however, I am
: unsure how to ask Lucene to sort on that date, but I'll figure it out
: in time or someone will tell me.
you don't need to wait ... it's already been explained in this thread,
look at the Sort class and the methods in IdexSea
i gave each of my documents a special field named date and I put in a
normalized Lucene date with a precision of one day. This date is
mmdd so that it can be sorted. having done that, however, I am
unsure how to ask Lucene to sort on that date, but I'll figure it out
in time or someone wi
"Index the date". Do you mean, index date, or the document date?
Could this be in a LIA book?
On 8/29/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Aug 29, 2006, at 11:50 AM, Mag Gam wrote:
> Is it possible to sort results by date of the document?
Sure, check out the Sort class and the ov
Bill Taylor wrote:
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
ri
On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:
: Have a look at PerFieldAnalyzerWrapper:
:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/
PerFieldAnalyzerWrapper.html
...which can be specified in the constructors for IndexWriter and
QueryParser.
As I understand
: However, I noticed that when I say "Field.Store.YES" it stores the
: original, pre-tokenized version, so it seems like I'm doubling up here.
: Is there a better way to do this?
if you are doubling up to get the benefit of two seperate Analyzers, then
there is no need to "Store.YES" in both fiel
Hi All,
I'm scratching my head - can someone tell me which class implements
an efficient multiple term TF.IDF Cosine similarity scoring mechanism?
There is clearly the single TermScorer - but I can't find the class
that would do a bucketed TF.IDF cosine - i.e. fill an accumulator
with the tf
: Have a look at PerFieldAnalyzerWrapper:
:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
...which can be specified in the constructors for IndexWriter and
QueryParser.
-Hoss
--
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
right up.
that almos
: Ok, for the sort object, but my problem is I don't know haox to retrieve (or
: store) information of the sell rate of the products (the sell rate deponds
: on the QUERY ! The sort is different for each queries.)
:
: I imagine to connect to the DB and get sell rate of products for this
: specific
Thanks. Sort of what I was thinking of was the fact that document X,
field N, was built via tokenizer/analyzer N. If I need to search an
index of document Xs, then I should be using tokenizer/analyzer N
without having to "know" that it was built that way.
-Original Message-
From: Steven
The only reason you need to store a token is if you need to retrieve it from
the document, storing is completely unnecessary for answering the question
"is this term in the document?". So I guess I'm wondering why you don't just
use Field.Store.NO on *both* of the fields...
On 8/29/06, Furash
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.
Same kind of thing for a Query.
Erick
On 8/29/06, Bill Taylor <[EM
On Aug 29, 2006, at 11:50 AM, Mag Gam wrote:
Is it possible to sort results by date of the document?
Sure, check out the Sort class and the overloaded IndexSearcher.search
() methods that take a Sort. You will need to index the date in a
sortable way. DateTools provides handy methods for t
The behavior I want is that if I store a name (Gary Furash), a user who
searches for "Gary Furash" gets a strong hit, wheras a user who seaches
for "Gray Furish" gets a moderate hit. I currently achieve this by
1. using a custom analyzer on insertion/search that tokenizes a
"soundex" version of t
Have a look at PerFieldAnalyzerWrapper:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
Citerar Bill Taylor <[EMAIL PROTECTED]>:
> is there some way to get the standard Field constructor to use, say,
> the Whitespace Tokenizer as opposed to the s
Is it possible to sort results by date of the document?
is there some way to get the standard Field constructor to use, say,
the Whitespace Tokenizer as opposed to the standard tokenizer?
On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote:
I suspect that my issue is getting the Field constructor to use a
different tokenizer. Can anyone help?
There has been a long-running thread on the java-dev list about how to
allow application-specific "extra stuff" to be placed in the index, at
multiple levels of granularity. Some of this conversation is captured
on the Wiki at:
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
Maybe you cou
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer. Can anyone help?
You need to basically come up with your own Tokenizer (You can always
write a corresponding JavaCC grammar and compiling it would give the
Tokenizer)
Then you need to extend org.apache.lu
Furash Gary wrote:
I'm sure this is just a design point that I'm missing, but is there a
way to have my document objects know more about themselves?
At the time I create my document, I know a bit about how information is
being stored in it (e.g., this field represents a SOUNDEX copy, etc.),
yet
I'm sure this is just a design point that I'm missing, but is there a
way to have my document objects know more about themselves?
At the time I create my document, I know a bit about how information is
being stored in it (e.g., this field represents a SOUNDEX copy, etc.),
yet the logic for that ki
I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer breaks this "word" at the
dashes so
I would recommend using the open source project HTMLParser (
http://htmlparser.sourceforge.net/). It provides an excellent API for
parsing html files and extracting the relevant text.
-drj
On 8/29/06, James liu <[EMAIL PROTECTED]> wrote:
i wanna index html,,,but it have image,flash,javascript,
i wanna index html,,,but it have image,flash,javascript, and i wanna make
index quick,,
but i don't know how to get textmode content,,,
anyone can help me?
Stanislav Jordanov wrote:
What might be the possible reason for an IndexReader failing to open
properly,
because it can not find a .fnm file that is expected to be there:
This means the segments files is referencing a segment named _1j8s and
in trying to load that segment, the first thing Luc
On Aug 28, 2006, at 3:51 PM, d rj wrote:
I think that the best method to transfer Document objects across
the wire
from Lucene to Lucene.Net is to write the appropriate xml schema
using xsdl,
then write the necessary translation code for both Java and C# that
would
marshall Lucene Document
Hello,
Ok, for the sort object, but my problem is I don't know haox to retrieve (or
store) information of the sell rate of the products (the sell rate deponds
on the QUERY ! The sort is different for each queries.)
I imagine to connect to the DB and get sell rate of products for this
specific qu
What might be the possible reason for an IndexReader failing to open
properly,
because it can not find a .fnm file that is expected to be there:
java.io.FileNotFoundException: E:\index4\_1j8s.fnm (The system cannot
find the file specified)
at java.io.RandomAccessFile.open(Native Method)
41 matches
Mail list logo