nswer may be a bit off.
Search on the internet about log structured storage. There you will find
why rewriting an entry is better than updating an existing entry.
Leveldb/cassandra/bigTable use it. maybe search these terms as well.
2012/10/22 Shaya Potter
so there are lots of Qs that are asked
so there are lots of Qs that are asked about wanting to modify a lucene
document (i.e. remove fields, add fields) but are told that one
needs to reindex.
No one ever answers the technical Q of why this is, and I'm interested
in that. presumambly because documents aren't stored as document
On 08/19/2012 08:07 PM, Shaya Potter wrote:
On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
Is there an easy way to figure out
the most common tokens and then remove those tokens from the
documents.
Probably this :
http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc
On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
Is there an easy way to figure out
the most common tokens and then remove those tokens from the
documents.
Probably this :
http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html
unsure how to use this
as far as I can
-
From: Shaya Potter [mailto:spot...@gmail.com]
Sent: Wednesday, August 15, 2012 8:44 PM
To: java-user@lucene.apache.org
Subject: Re: easy way to figure out most common tokens?
On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
Is there an easy way to figure out
the most common tokens and then remove
On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
Is there an easy way to figure out
the most common tokens and then remove those tokens from the
documents.
Probably this :
http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html
ah, that's a good part 1. Then the Q w
On 08/15/2012 02:29 PM, Erick Erickson wrote:
I don't see how you could without indexing everything first
since you can't know what the most frequent terms until
you've processed all your documents
exactly
If you know these terms in advance, it seems like you could
just call then stopword
Is there an easy way to figure out the most common tokens and then
remove those tokens from the documents.
use case: imagine one is indexing a mailing list (such as this
java-user) and is extracting all e-mail addresses in the messages and
adding them to a doc.
What that means is that one wi