mbien opened a new pull request, #302:
URL: https://github.com/apache/maven-indexer/pull/302
this is a draft as talking point, it does two things:
- uses `IndexUpdateRequest` as configuration for `IndexDataReader` (tmp
folder, factory etc)
- moves the filtering from post extraction to the read phase
- makes filtering really fast (and multi threaded too), it is no longer
an extra step
- has actually an effect on the on-disk index size (since I've learned
lucene doesn't really remove things since all files are immutable)
I do realize that this is not quite the same behavior as before. To get the
exact same behavior back we could add this as additional filter, one during
read (new), one after extraction (old).
example filter:
```java
final Instant cutoff = ZonedDateTime.now().minusYears(2).toInstant();
iur.setDocumentFilter((doc) -> {
IndexableField field = doc.getField("m"); // usually never null
return field != null &&
Instant.ofEpochMilli(Long.parseLong(field.stringValue())).isAfter(cutoff);
});
```
results (single threaded, since MT has a index size penalty due to merge
overhead):
```
full: 5.6 GB
2y: 2.6 GB
1y: 1.4 GB
```
I did also try to remove some fields from within the filter (e.g
description), this had however no impact at all (but again, this is probably
just me not understanding lucene). Intuition wise, 1.4 GB for one year of maven
artifacts sounds still a bit more than it should be.
`mvn spotless:apply` is responsible for the formatting
would fix MINDEXER-185
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]