[
https://issues.apache.org/jira/browse/LUCENE-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gergő Törcsvári updated LUCENE-5736:
------------------------------------
Attachment: CachingNaiveBayesClassifier.java
The attached class is a working copy!
This is a cache included version of the SimpleNaiveBayes classifier. The cache
is a hash-map, if a word needed, we search it for the all class and take it to
the hash. Next time, we pull out from the cache and not searching in the index
again.
The cache (re)initialization is recalculating the docsWithClassSize, clear the
hash-maps, and prepare new ones. 2 map needed, and a list, the first map will
contains the term-classes-termInClassOccurrence (this is the cache), the list
contains the classnames, and the second map contains the
class-avgUniqueTermNumber. The last two is fully preloaded, the first is
dynamically building in the searches.
If there are a lot term and/or class its need a lot memory so there is a build
in possibility for cutting the cache size. If there are terms thats really rare
we expect that they will rarely come out in the other documents too, and they
are left out from the cache. There is a possibility to left them out full from
the classification calculation too.
> Separate the classifiers to online and caching where possible
> -------------------------------------------------------------
>
> Key: LUCENE-5736
> URL: https://issues.apache.org/jira/browse/LUCENE-5736
> Project: Lucene - Core
> Issue Type: Sub-task
> Components: modules/classification
> Reporter: Gergő Törcsvári
> Attachments: CachingNaiveBayesClassifier.java
>
>
> The Lucene classifier implementations are now near onlines if they get a near
> realtime reader. It is good for the users whoes have a continously changing
> dataset, but slow for not changing datasets.
> The idea is: What if we implement a cache and speed up the results where it
> is possible.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]