Re: understanding the need to reindex a document

2012-10-22 Thread Shaya Potter
nswer may be a bit off. Search on the internet about log structured storage. There you will find why rewriting an entry is better than updating an existing entry. Leveldb/cassandra/bigTable use it. maybe search these terms as well. 2012/10/22 Shaya Potter so there are lots of Qs that are asked

understanding the need to reindex a document

2012-10-22 Thread Shaya Potter
so there are lots of Qs that are asked about wanting to modify a lucene document (i.e. remove fields, add fields) but are told that one needs to reindex. No one ever answers the technical Q of why this is, and I'm interested in that. presumambly because documents aren't stored as document

Re: easy way to figure out most common tokens?

2012-08-19 Thread Shaya Potter
On 08/19/2012 08:07 PM, Shaya Potter wrote: On 08/15/2012 02:34 PM, Ahmet Arslan wrote: Is there an easy way to figure out the most common tokens and then remove those tokens from the documents. Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc

Re: easy way to figure out most common tokens?

2012-08-19 Thread Shaya Potter
On 08/15/2012 02:34 PM, Ahmet Arslan wrote: Is there an easy way to figure out the most common tokens and then remove those tokens from the documents. Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html unsure how to use this as far as I can

Re: easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
- From: Shaya Potter [mailto:spot...@gmail.com] Sent: Wednesday, August 15, 2012 8:44 PM To: java-user@lucene.apache.org Subject: Re: easy way to figure out most common tokens? On 08/15/2012 02:34 PM, Ahmet Arslan wrote: Is there an easy way to figure out the most common tokens and then remove

Re: easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
On 08/15/2012 02:34 PM, Ahmet Arslan wrote: Is there an easy way to figure out the most common tokens and then remove those tokens from the documents. Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html ah, that's a good part 1. Then the Q w

Re: easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
On 08/15/2012 02:29 PM, Erick Erickson wrote: I don't see how you could without indexing everything first since you can't know what the most frequent terms until you've processed all your documents exactly If you know these terms in advance, it seems like you could just call then stopword

easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
Is there an easy way to figure out the most common tokens and then remove those tokens from the documents. use case: imagine one is indexing a mailing list (such as this java-user) and is extracting all e-mail addresses in the messages and adding them to a doc. What that means is that one wi