Re: Extract the text that was indexed

2008-12-30 Thread Karl Wettin
30 dec 2008 kl. 17.13 skrev Lebiram: Hi Lebiram, contrib/misc contains a couple of tools that might be of help. Just wanted to reconstruct a new index based on an existing index(but turning off norms) that's all. If you want to create an identical index but without norms use FieldNormModi

Re: Filtering accents

2008-12-30 Thread Otis Gospodnetic
Tom: Have a look at ASCIIFoldingFilter. o...@lesina:~/workspace/asf-lucene$ svn log ./src/java/org/apache/lucene/analysis/ASCIIFoldingFilter.java r724053 | markrmiller | 2008-12-06 18:25:42 -0500 (Sat, 06 Dec 2008) | 1 line

Re: Extract the text that was indexed

2008-12-30 Thread Lebiram
Hi All, Thanks for the reply. Just wanted to reconstruct a new index based on an existing index(but turning off norms) that's all. However, as it is nearly impossible to extract the terms of unstored fields, we might think of other ways. Thanks for the inputs guys! __

Re: Extract the text that was indexed

2008-12-30 Thread Erick Erickson
Actually, you can reconstruct the text, but it's a lossy process. Stop words aren't in the index for instance. And it's very time-consuming. Luke makes a "best guess" at this process, so you might want to take a look at that code. But even the very bright folks who put Luke together caution that it

Re: Filtering accents

2008-12-30 Thread Erick Erickson
You might want to take a look at using the ISOLatinAccentFilter or similar at both index and query time. It basically folds accented characters into their un-accented form. Matthew: You wrote: <<>> I also did this before realizing that the second field is unnecessary. Storing is orthogonal to in

Re: Extract the text that was indexed

2008-12-30 Thread Greg Shackles
That is my understanding of it too. Terms in the index will point to the position of the tokens they map to. Since one index term can point at any number of tokens, this isn't a sequence map, but just a search map. If you still have the text that was indexed you could run it through an analyzer

Re: Filtering accents

2008-12-30 Thread Greg Shackles
Just thought I'd comment since I had to do word processing before indexing in my application as well. Matt's method is pretty similar to what I did. I wrote a filter that transforms the tokens as they get indexed (and also use that for searching). Since I am indexing a block of words, rather than

Re: Filtering accents

2008-12-30 Thread Matthew Hall
If you are constrained in such a way as to not use the French Analyzer you might instead consider transforming the input as an additional step at both search/indexing time. Use something like a regex that looks for é and always replaces it with e in the index, and at search time. (expand this

Filtering accents

2008-12-30 Thread legrand thomas
Dear all, I'd like my lucene searches to be insensitive to (French) accents. For example, considering a indexed term "métal", I want to get it when searching for "metal" or "métal" . I use lucene-2.3.2 and the searches are performed with: IndexSearcher.search(query,filter,sorter), Another filte

Re: Extract the text that was indexed

2008-12-30 Thread Alexander Aristov
I am not sure but from my understanding fields that are only indexed and not stored do not keep position. So even if you get back all terms for a field for a given document you won't be able to reconstruct original words sequence. And remember that not all words are indexed. Alex 2008/12/30 Lebi

Re: IndexCommit#getFileNames() returning duplicates?

2008-12-30 Thread Michael McCandless
OK I think I see what's going on here... I'll open an issue & fix it. Thanks Shalin! Mike Shalin Shekhar Mangar wrote: > Hello, > > Solr uses IndexCommit#getFileNames() to get a list of files for > replication. > One windows user reported an exception which looks like it may have been > caused

Re: Lucene retrieval model

2008-12-30 Thread Paul Elschot
Op Tuesday 30 December 2008 10:03:03 schreef Claudia Santos: > Hello, > > I would like to know more about Lucene's retrieval model, more > specifically about the boolean model. > Is that a standard model or an extended model? I mean, it returns > just documents that match the boolean expression or

Extract the text that was indexed

2008-12-30 Thread Lebiram
Hi All, Is it possible to extract the text that was indexed but not stored for a field in a document? Right now, reader.document() returns only fields that was stored. However I'd also want to get the text on the indexed only field... I'd appreciate your help

Re: duplication checking while indexing

2008-12-30 Thread Chris Lu
JDBM is surely a better way than in memory hash map. But I feel since all previous documents are already in the index, although not closed yet, there should be a way to read all previous terms. It's ok to use additional data structure, like JDBM or hash map, to duplicate the terms, in order to look

Lucene retrieval model

2008-12-30 Thread Claudia Santos
Hello, I would like to know more about Lucene's retrieval model, more specifically about the boolean model. Is that a standard model or an extended model? I mean, it returns just documents that match the boolean expression or include in the search result all Documents which correspond to the gi