Re: Creating a new index from an existing index

2006-08-29 Thread Xiaocheng Luan
Thanks, Erick. I agree that it might be unlikely to reconstruct from an existing index, but I think document boosting (that is, one document has a higher boost factor than other documents) as well as field boosting is specified during indexing. Our use case is performancce/results tuning. We hav

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
Because I wanted to use the javaCC input code from Lucene. 99.99% of what the standard parser did was VERY GOOD. having worked with computer-generated compilers in the past, I realized that if I were to modify the parser itself, I would eventually get into real trouble. So I took the time to

Re: Sort by Date

2006-08-29 Thread Erik Hatcher
On Aug 29, 2006, at 3:54 PM, Mag Gam wrote: "Index the date". Do you mean, index date, or the document date? Could this be in a LIA book? This is entirely up to you. What gets indexed is entirely within the developers control. What date do you want indexed? I presume by "document date

Re: Installing a custom tokenizer

2006-08-29 Thread yueyu lin
Your problem is that StandardTokenizer doesn's fit your requirements. Since you know how to implement a new one, just do it. If you just want to modify StandardTokenizer, you can get the codes and rename it to your class, then modify something that you dislike. I think it's a so simple stuff, why

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 7:12 PM, Mark Miller wrote: 2. The ParseException that is generated when making the StandardAnalyzer must be killed because there is another ParseException class (maybe in queryparser?) that must be used instead. The lucene build file excludes the StandardAnalyzer Parse

Re: Straight TF-IDF cosine similarity?

2006-08-29 Thread Jason Polites
Have you looked at the MoreLikeThis class in the similarity package? On 8/30/06, Winton Davies <[EMAIL PROTECTED]> wrote: Hi All, I'm scratching my head - can someone tell me which class implements an efficient multiple term TF.IDF Cosine similarity scoring mechanism? There is clearly the sin

Re: Creating a new index from an existing index

2006-08-29 Thread Erick Erickson
A couple of things.. 1> I don't think you set the boost when indexing. You set the boost when querying, so you don't need to re-index for boosting. 2> A recurring theme is that you can't do an update-in-place for a lucene document. You might search the mail archive for a discussion of this. The

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(reader)); fails bec

Creating a new index from an existing index

2006-08-29 Thread Xiaocheng Luan
Hi, Got a question. Here is what I want to achieve: Create a new index from an existing index, to change the boosting factor for some of the documents (and potentially some other tweaks), without reindexing it from the source. Is there any tools or ways to do this? Thanks! Xiaocheng Luan

Re: Installing a custom tokenizer

2006-08-29 Thread Erick Erickson
Tucked away in the contrib section of Lucene (I'm using 2.0) there is org.apache.lucene.index.memory.PatternAnalyzer which takes a regular expression as and tokenizes with it. Would that help? Word of warning... the regex determines what is NOT a token, not what IS a token (as I remember),

Re: Sort by Date

2006-08-29 Thread Chris Hostetter
: mmdd so that it can be sorted. having done that, however, I am : unsure how to ask Lucene to sort on that date, but I'll figure it out : in time or someone will tell me. you don't need to wait ... it's already been explained in this thread, look at the Sort class and the methods in IdexSea

Re: Sort by Date

2006-08-29 Thread Bill Taylor
i gave each of my documents a special field named date and I put in a normalized Lucene date with a precision of one day. This date is mmdd so that it can be sorted. having done that, however, I am unsure how to ask Lucene to sort on that date, but I'll figure it out in time or someone wi

Re: Sort by Date

2006-08-29 Thread Mag Gam
"Index the date". Do you mean, index date, or the document date? Could this be in a LIA book? On 8/29/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Aug 29, 2006, at 11:50 AM, Mag Gam wrote: > Is it possible to sort results by date of the document? Sure, check out the Sort class and the ov

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote: I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you ri

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote: : Have a look at PerFieldAnalyzerWrapper: : http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/ PerFieldAnalyzerWrapper.html ...which can be specified in the constructors for IndexWriter and QueryParser. As I understand

Re: Field.Store.YES and Field.Store.Tokenized with CustomAnalyzer - Double Hit

2006-08-29 Thread Chris Hostetter
: However, I noticed that when I say "Field.Store.YES" it stores the : original, pre-tokenized version, so it seems like I'm doubling up here. : Is there a better way to do this? if you are doubling up to get the benefit of two seperate Analyzers, then there is no need to "Store.YES" in both fiel

Straight TF-IDF cosine similarity?

2006-08-29 Thread Winton Davies
Hi All, I'm scratching my head - can someone tell me which class implements an efficient multiple term TF.IDF Cosine similarity scoring mechanism? There is clearly the single TermScorer - but I can't find the class that would do a bucketed TF.IDF cosine - i.e. fill an accumulator with the tf

Re: Installing a custom tokenizer

2006-08-29 Thread Chris Hostetter
: Have a look at PerFieldAnalyzerWrapper: : http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html ...which can be specified in the constructors for IndexWriter and QueryParser. -Hoss --

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote: I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you right up. that almos

Re: Sorting based on a selling rate

2006-08-29 Thread Chris Hostetter
: Ok, for the sort object, but my problem is I don't know haox to retrieve (or : store) information of the sell rate of the products (the sell rate deponds : on the QUERY ! The sort is different for each queries.) : : I imagine to connect to the DB and get sell rate of products for this : specific

RE: Documents that know more?

2006-08-29 Thread Furash Gary
Thanks. Sort of what I was thinking of was the fact that document X, field N, was built via tokenizer/analyzer N. If I need to search an index of document Xs, then I should be using tokenizer/analyzer N without having to "know" that it was built that way. -Original Message- From: Steven

Re: Field.Store.YES and Field.Store.Tokenized with CustomAnalyzer - Double Hit

2006-08-29 Thread Erick Erickson
The only reason you need to store a token is if you need to retrieve it from the document, storing is completely unnecessary for answering the question "is this term in the document?". So I guess I'm wondering why you don't just use Field.Store.NO on *both* of the fields... On 8/29/06, Furash

Re: Installing a custom tokenizer

2006-08-29 Thread Erick Erickson
I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you right up. Same kind of thing for a Query. Erick On 8/29/06, Bill Taylor <[EM

Re: Sort by Date

2006-08-29 Thread Erik Hatcher
On Aug 29, 2006, at 11:50 AM, Mag Gam wrote: Is it possible to sort results by date of the document? Sure, check out the Sort class and the overloaded IndexSearcher.search () methods that take a Sort. You will need to index the date in a sortable way. DateTools provides handy methods for t

Field.Store.YES and Field.Store.Tokenized with CustomAnalyzer - Double Hit

2006-08-29 Thread Furash Gary
The behavior I want is that if I store a name (Gary Furash), a user who searches for "Gary Furash" gets a strong hit, wheras a user who seaches for "Gray Furish" gets a moderate hit. I currently achieve this by 1. using a custom analyzer on insertion/search that tokenizes a "soundex" version of t

Re: Installing a custom tokenizer

2006-08-29 Thread Ronnie Kolehmainen
Have a look at PerFieldAnalyzerWrapper: http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html Citerar Bill Taylor <[EMAIL PROTECTED]>: > is there some way to get the standard Field constructor to use, say, > the Whitespace Tokenizer as opposed to the s

Sort by Date

2006-08-29 Thread Mag Gam
Is it possible to sort results by date of the document?

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
is there some way to get the standard Field constructor to use, say, the Whitespace Tokenizer as opposed to the standard tokenizer? On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote: I suspect that my issue is getting the Field constructor to use a different tokenizer. Can anyone help?

Re: Documents that know more?

2006-08-29 Thread Steven Rowe
There has been a long-running thread on the java-dev list about how to allow application-specific "extra stuff" to be placed in the index, at multiple levels of granularity. Some of this conversation is captured on the Wiki at: http://wiki.apache.org/jakarta-lucene/FlexibleIndexing Maybe you cou

RE: Installing a custom tokenizer

2006-08-29 Thread Krovi, DVSR_Sarma
> I suspect that my issue is getting the Field constructor to use a > different tokenizer. Can anyone help? You need to basically come up with your own Tokenizer (You can always write a corresponding JavaCC grammar and compiling it would give the Tokenizer) Then you need to extend org.apache.lu

Re: Documents that know more?

2006-08-29 Thread Michael D. Curtin
Furash Gary wrote: I'm sure this is just a design point that I'm missing, but is there a way to have my document objects know more about themselves? At the time I create my document, I know a bit about how information is being stored in it (e.g., this field represents a SOUNDEX copy, etc.), yet

Documents that know more?

2006-08-29 Thread Furash Gary
I'm sure this is just a design point that I'm missing, but is there a way to have my document objects know more about themselves? At the time I create my document, I know a bit about how information is being stored in it (e.g., this field represents a SOUNDEX copy, etc.), yet the logic for that ki

Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
I am indexing documents which are filled with government jargon. As one would expect, the standard tokenizer has problems with governmenteese. In particular, the documents use words such as 310N-P-Q as references to other documents. The standard tokenizer breaks this "word" at the dashes so

Re: how to write this regexps?

2006-08-29 Thread d rj
I would recommend using the open source project HTMLParser ( http://htmlparser.sourceforge.net/). It provides an excellent API for parsing html files and extracting the relevant text. -drj On 8/29/06, James liu <[EMAIL PROTECTED]> wrote: i wanna index html,,,but it have image,flash,javascript,

how to write this regexps?

2006-08-29 Thread James liu
i wanna index html,,,but it have image,flash,javascript, and i wanna make index quick,, but i don't know how to get textmode content,,, anyone can help me?

Re: Reviving a dead index

2006-08-29 Thread Michael McCandless
Stanislav Jordanov wrote: What might be the possible reason for an IndexReader failing to open properly, because it can not find a .fnm file that is expected to be there: This means the segments files is referencing a segment named _1j8s and in trying to load that segment, the first thing Luc

Re: Sharing Documents between Lucene and DotLucene

2006-08-29 Thread Erik Hatcher
On Aug 28, 2006, at 3:51 PM, d rj wrote: I think that the best method to transfer Document objects across the wire from Lucene to Lucene.Net is to write the appropriate xml schema using xsdl, then write the necessary translation code for both Java and C# that would marshall Lucene Document

Re: Sorting based on a selling rate

2006-08-29 Thread John Pailet
Hello, Ok, for the sort object, but my problem is I don't know haox to retrieve (or store) information of the sell rate of the products (the sell rate deponds on the QUERY ! The sort is different for each queries.) I imagine to connect to the DB and get sell rate of products for this specific qu

Reviving a dead index

2006-08-29 Thread Stanislav Jordanov
What might be the possible reason for an IndexReader failing to open properly, because it can not find a .fnm file that is expected to be there: java.io.FileNotFoundException: E:\index4\_1j8s.fnm (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method)