[
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-5914:
--------------------------------
Attachment: LUCENE-5914.patch
Some updates:
* port to trunk apis (i guess this was outdated?)
* fix some javadoc bugs
* nuke lots of now-unused stuff in .compressing, only still used for term
vectors
* improve float/double compression: it was not so effective and wasteful. these
now write 1..5 and 1..9 bytes.
We should try to do more cleanup:
* I don't like the delegator. maybe its the best solution, but at least it
should not write its own file. I think we should revive SI.attributes
(properly: so it rejects any attribute puts on dv updates) and use that.
* The delegator shouldnt actually need to delegate the writer? If i add this
code, all tests pass:
{code}
final StoredFieldsWriter in = format.fieldsWriter(directory, si, context);
if (true) return in; // wrapper below is useless
return new StoredFieldsWriter() {
{code}
This seems to be all about delegating some manual file deletion on abort() ? Do
we really need to do this? If we have some bugs around indexfiledeleter where
it doesn't do the right thing, enough to warrant such apis, then we should have
tests for it. Such tests would also show the current code deletes the wrong
filename:
{code}
IOUtils.deleteFilesIgnoringExceptions(directory, formatName); // formatName is
NOT the file the delegator writes
{code}
But this is obselete if we add back SI.attributes.
* The header check logic should be improved. I don't know why we need the
Reader.checkHeader method, why cant we just check it with the other files?
* We should try to use checkFooter(Input, Throwable) for better corruption
messages, with this type of logic. It does more an appends suppressed
exceptions when things go wrong:
{code}
try (ChecksumIndexInput input =(...) {
Throwable priorE = null;
try {
// ... read a bunch of stuff ...
} catch (Throwable exception) {
priorE = exception;
} finally {
CodecUtil.checkFooter(input, priorE);
}
}
{code}
* Any getChildResources() should return immutable list: doesn't seem to always
be the case. Maybe assertingcodec can be improved to actually test this
automatically.
I will look more tomorrow.
> More options for stored fields compression
> ------------------------------------------
>
> Key: LUCENE-5914
> URL: https://issues.apache.org/jira/browse/LUCENE-5914
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Fix For: 5.0
>
> Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the
> same amount of users complaining that compression was too aggressive and that
> compression was too light.
> I think it is due to the fact that we have users that are doing very
> different things with Lucene. For example if you have a small index that fits
> in the filesystem cache (or is close to), then you might never pay for actual
> disk seeks and in such a case the fact that the current stored fields format
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like
> log analytics, and in that case you have huge amounts of data for which you
> don't care much about stored fields performance. However it is very
> frustrating to notice that the data that you store takes several times less
> space when you gzip it compared to your index although Lucene claims to
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that
> would allow to trade speed for compression in the default codec.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]