Re: Avoid automaton Memory Usage

Michael McCandless Thu, 08 Aug 2013 10:33:12 -0700

On Thu, Aug 8, 2013 at 12:54 PM, Anna Björk Nikulásdóttir
<a...@skerpa.com> wrote:
>
> Am 8.8.2013 um 12:37 schrieb Michael McCandless <luc...@mikemccandless.com>:
>
>> <snip>
>>> What would help in my case as I use the same FST for both analyzers, if the 
>>> same FST object could be shared among both analyzers. So what I am doing is 
>>> to use AnalyzingSuggester.store() and use the stored file for 
>>> AnalyzingSuggester.load() and FuzzySuggester.load().
>>
>> That's interesting ... so you mean you sometimes want fuzzy
>> suggestions and sometimes non-fuzzy ones, off the same built
>> suggester?  I believe AnalyzingSuggester and FuzzySuggester in fact
>> use the same FST (not certain) ... are you able to do
>> FuzzySuggester.load from a previous AnalyzingSuggester.store and it
>> works?  And that's still too much RAM?
>>
>
> Yes it works like a charm.


That's good to know!

> I use it for auto completion of non english language terms. Often the typed 
> beginning of a term can be used as is and then AnlyzingSuggester gives best 
> results, whereas FuzzySuggester would give too many results that need a lot 
> of post processing. If the user is lazy and because the Android keyboard 
> doesn't always provide easy access to specific letters, e.g. 'æ', 'ä', 'ß', 
> etc. or if he mistypes some letters, I use FuzzySuggester as fallback if 
> AnalyzingSuggester doesn't yield appropriate results. It's a bit of a cludge 
> because FuzzySuggester doesn't boost minimal Levenstein-Distance terms.

This (not giving a better score for lookups that require fewer edits)
was a concern on the original FuzzySuggester issue ... can you open a
separate issue to explore this?  Really it should score
"appropriately", in which case maybe you could have just used
FuzzySuggester?  I don't know if anyone has time right now to work out
a patch but we should at least open the issue ...

> Performance wise this is absolutely no problem on Android, but memory wise it 
> means 2x the FST memory. Atm. 1 FST needs ~20MB. If e.g. I would like to 
> simultanously support multiple languages, it's not going to work this way.

OK.

> Ideally all this could be done on disk/flash only. But this then needs 
> changes according to your former proposal via DirectByteBuffer. Do you think 
> going this way would yield acceptable performance ? And does mapping a file 
> into memory not fill the DRAM with the complete content of the file over time 
> ? Are "normal" Lucene indexes accessed this way ?

Well, we'd need to test performance.  Unfortunately access to the FST
is rather random-access, so unless the OS pulls the pages into RAM, ie
if the seeks are "cold", then performance will suffer.  But it could
be it's fine in your case.  But this (accessing FST from disk) is a
biggish change ...

>>> Unfortunately there is no immutable FST class, but as I do not use it in 
>>> mulithreaded environment, that is probably not a problem, no ? A quick fix 
>>> could be to copy the analyzer classes and change these to such behaviour 
>>> and reuse the FST object. Does this make sense functional wise or do I have 
>>> to expect problems ?
>>
>> Sharing an FST across analyzing and fuzzy suggesters does seem
>> worthwhile; it may "just work" today…
>
> I will try then. Do you have any evidence about if it could not work at some 
> point in the future ?

Can you also open a separate issue for this (allowing both fuzzy and
non-fuzzy access to one FST).  Today the formats are in fact
identical, but unless we make an effort to support this (it could be
as easy as accepting maxEdits=0 ... hmm, is this allowed / does it
"just work" today?) then they can easily diverge over time.  It's
crazy that you have to load the same FST twice today...

Maybe we just merge the two suggesters ... who knows :)  These classes
are all very new and experimental so we should feel free to do heavy
iterating!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Avoid automaton Memory Usage

Reply via email to