RE: Indexing multiple languages

2005-06-03 Thread Bruce Ritchie
> Tansley, Robert wrote: > > What if we're trying to index multiple languages in the > same site? Is > > it best to have: > > > > 1/ one index for all languages > > 2/ one index for all languages, with an extra language field so > > searches can be constrained to a particular language 3/ separ

Re: deleting on a keyword field

2005-06-03 Thread Ernesto De Santis
from javadoc: public final int *delete*(Term term) throws IOException Deletes all documents containing |term|. This is useful if one uses a document field to hold a

Re: deleting on a keyword field

2005-06-03 Thread Daniel Naber
On Friday 03 June 2005 18:50, Max Pfingsthorn wrote: > reader.delete(new Term(URI_FIELD, uri)); > > This does not remove anything. Do I have to make the uri a normal field? How do you know nothing was deleted? Are you aware that you need to re-open your IndexSearcher/Reader in order to see the c

Re: deleting on a keyword field

2005-06-03 Thread Erik Hatcher
On Jun 3, 2005, at 12:50 PM, Max Pfingsthorn wrote: Hi! I'm trying to delete a document from the index. Somehow it doesn't work. I made a Field.Keyword out of my document's URI and would now like to delete a document with a certain uri like so: reader.delete(new Term(URI_FIELD, uri)); T

Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

2005-06-03 Thread Doug Cutting
Fred Toth wrote: I'm thinking we need something like "HTMLTokenizer" which bridges the gap between StandardAnalyzer and an external HTML parser. Since so many of us are dealing with HTML, I would think this would be generally useful for many problems. It could work this way: Given this input: H

deleting on a keyword field

2005-06-03 Thread Max Pfingsthorn
Hi! I'm trying to delete a document from the index. Somehow it doesn't work. I made a Field.Keyword out of my document's URI and would now like to delete a document with a certain uri like so: reader.delete(new Term(URI_FIELD, uri)); This does not remove anything. Do I have to make the uri a n

Re: managing docids for ParallelReader

2005-06-03 Thread Doug Cutting
Sebastian Marius Kirsch wrote: I took up your suggestion to use a ParallelReader for adding more fields to existing documents. I now have two indexes with the same number of documents, but different fields. Does search work using the ParalleReader? One field is duplicated (the id field.) Wh

Re: Indexing multiple languages

2005-06-03 Thread Doug Cutting
Tansley, Robert wrote: What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Erik Hatcher
On Jun 3, 2005, at 8:50 AM, Corey Keith wrote: With this approach all work is done at the word level. When we have a phrase query the results will contain pages with the entire phrase but when we go to highlight the document _all_ words in the phrase regardless of being in the phrase will

Re: managing docids for ParallelReader

2005-06-03 Thread Sebastian Marius Kirsch
Hi Doug, I took up your suggestion to use a ParallelReader for adding more fields to existing documents. I now have two indexes with the same number of documents, but different fields. One field is duplicated (the id field.) I wrote a small class to merge those two indexes into one index; it is a

RE: Indexing multiple languages

2005-06-03 Thread Max Pfingsthorn
Hi You could use the ParalellReader for this if you have all documents in all languages. Then, the metadata fields can be stored in one of the field data files, while each languages gets its own field data file... max -Original Message- From: Paul Libbrecht [mailto:[EMAIL PROTECTED] Se

RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Grant Ingersoll
If you can, I think there has been enough interest in the past on this, a patch that exposes the wi information would probably be useful to others (not that I am saying it would be committed, as I can't speak for the committers on the project) >>> [EMAIL PROTECTED] 6/3/2005 8:19:16 AM >>> Thanks f

RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Max Pfingsthorn
Aha :) So you want to do blind relevance feedback? I guess the term vectors will be the way to go then. Otherwise, I don't know how to access the terms of a document. And: Are you sure you need the TF.IDF weights for each term ]? Maybe it would be enough to just use TF for sorting, as that is a

Re: Indexing multiple languages

2005-06-03 Thread Paul Libbrecht
Robert, Le 2 juin 05, à 21:42, Tansley, Robert a écrit : It seems that there are even more options -- 4/ One index, with a separate Lucene document for each (item,language) combination, with one field that specifies the language 5/ One index, one Lucene document per item, with field names that

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Richard Krenek
Corey, I have one off the wall approach that may or may not work for you. If you convert your scanned images to PDF then use something like Acrobat to convert those PDFs into PDFs with hidden text (The OCR data). You can then tell Acrobat Reader via XML what to highlight when your user opens the

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Corey Keith
With this approach all work is done at the word level. When we have a phrase query the results will contain pages with the entire phrase but when we go to highlight the document _all_ words in the phrase regardless of being in the phrase will be highlighted. Is that correct? It would also be

RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Andrew Boyd
Thanks for the reply. It looks like I can use parts of Similarity. I'll post back once I get it working or at least closer ;-) Andrew -Original Message- From: Grant Ingersoll <[EMAIL PROTECTED]> Sent: Jun 3, 2005 6:51 AM To: java-user@lucene.apache.org Subject: RE: calculate wi = tfi *

RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Andrew Boyd
Thanks for bearing with me Max. I do understand that the hits come back sorted by decending score after their Similarity has been computed relative to the query vector. What I was hoping to do was use the built in fuctionality of lucene to calculate some term weights specifically wi = ti * I

Re: Indexing multiple languages

2005-06-03 Thread Grant Ingersoll
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages >>> [EMAIL PROTECTED] 6/3/2005 6:03:31 AM >>> On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote: > Btw, I did try running the lucene demo (web template) to index the > HTML > files after I added one including English and Chinese characters

RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Grant Ingersoll
I think the TermFreqVector (reader.getTermVector) has the info you want per document. You will need to sort it by frequency to get the top terms in each document. It doesn't give you the wi, just tfi, but the whole score is implied by the fact that you have the top 10 documents, I think. -Grant

Re: Indexing multiple languages

2005-06-03 Thread Erik Hatcher
On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote: Btw, I did try running the lucene demo (web template) to index the HTML files after I added one including English and Chinese characters. I was not able to search for any Chinese in that HTML file (returned no hits). I wonder whether I need to

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Erik Hatcher
On Jun 2, 2005, at 9:02 PM, Chris Hostetter wrote: This is a pretty interesting problem. I envy you. I would avoid the existing highlighter for your purposes -- highlighting in token space is a very differnet problem from "highlihgting" in 2D space. based on the XML sample you provided, it

Re: managing docids for ParallelReader (was Augmenting an existing index)

2005-06-03 Thread Markus Wiederkehr
On 5/31/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > I have wondered about this as well. Are there any *sure fire* ways of > > creating (and updating) two indices so that doc numbers in one index > > deliberately correspond to doc numbers in the other index? > > If you add the documents in the

RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Max Pfingsthorn
Hi, when IndexSearcher.search gives you a Hits object back, all results are already sorted by their score, which is computed internally using the Similarity. You can access it via Hits.score(n) (see http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Hits.html). This is also shown

Re: Indexing multiple languages

2005-06-03 Thread Andy Roberts
On Friday 03 Jun 2005 01:06, Bob Cheung wrote: > For the StandardAnalyzer, will it have to be modified to accept > different character encodings. > > We have customers in China, Taiwan and Hong Kong. Chinese data may come > in 3 different encoding: Big5, GB and UTF8. > > What is the default encod