Hi Eric,
Yes they help for particularly this problem - they shrik all avilable
Longs to 14 character strings. So they free me from the limitation that
I wrote about (1970 - 2280 years).
So now the advantage that I have with treating fields representing UTC
dates as numbers (but not as strings) seems to be some times smaller. If
storing UTCs in radix 36 and considering the real situation in my case -
I have only 2 UTC-fields with which keep dates of theese years I will
save approximate 20MB for 1 million docs which is 0,5 - 2 % of the index
size.
But may be this approach that I describe in my first mail (and tested it
successfully by now) will help to someone who uses float values in the
index or longs that vary reasonably in length.
Best Regards,
Ivan
Erik Hatcher wrote:
Ivan - have you considered using NumberUtils?
<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/document/NumberTools.html>
I'm curious if those utility methods solve the same problem you're
working on.
Erik
On Sep 13, 2007, at 1:19 PM, Ivan Vasilev wrote:
Hi All,
I have made some changes in my Lucene source, so that values of
numeric fields to be treated as numbers but not as Strings. After
testing everything seems to work correctly, but I still would like to
know your opinion about this.
So my approach is the following:
1. As during the indexing process the terms are ordered according
their values I made changes in the methods
TermBuffer.compareTo(TermBuffer other) and Term.compareTo(Term other)
so that when the filed contains numeric data comparison to be made
based on numeric logic (Integer.compareto(..), Float.compareTo(..)).
So this orders terms based on numeric logic but not based on
lexicographical one.
2. To work correctly range searches similar changes were made in
RangeFilter.bits(IndexReader reader) and
RangeQuery.rewrite(IndexReader reader) methods.
Changes seem to be very simple, but I did not found case when they
lead to wrong behavior.
Before these changes to make range queries on numeric fields I made
values of those fields with fixed length so that the lexicographical
order to be the same like the numeric one. So I had to keep dates in
some fields and I made them 13 length fields that keep UTC
representation of the date. When the UTC was short I prefixed it with
zeroes. This made range searches to work correctly for the time
interval 1970 - 2280 approximately. Now with the new implementation I
do not have any restrictions about this time interval.
So before some time I digged in Lucene forum and saw there were some
discussions about this. So if anybody uses also such approach, or
have bad experience with it, please tell me.
Tanks in advance.
Best Regards,
Ivan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
__________ NOD32 2528 (20070913) Information __________
This message was checked by NOD32 antivirus system.
http://www.eset.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]