Re: [Python-ideas] Use lazy loading with hashtable in python gettext module

Serge Ballesta via Python-ideas Sun, 23 Dec 2018 09:31:21 -0800

Hi all,

The feed back on my initial mail convinced me that it was important toallow the current behaviour of eagerly loading the whole catalog, andthat keeping the files opened should also be optional.


All that lead to this proposal:

Features:
========

The gettext module should be allowed to load lazily the catalogs from mofile. This lazy load should be optional and make use of the hash tablesfrom mo files when they are present or revert to a binary search. Thetranslation strings should be cached for better performances.


API changes:
============

3 functions from the gettext module will have 2 new optional parameternamed caching, and keepopen:


gettext.bindtextdomain(domain, localedir=None) would become
gettext.bindtextdomain(domain, localedir=None, caching=None, keepopen=False)

gettext.translation(domain, localedir=None, languages=None, class_=None,fallback=False, codeset=None) would becomegettext.translation(domain, localedir=None, languages=None, class_=None,fallback=False, codeset=None, caching=None, keepopen=False)

gettext.install(domain, localedir=None, codeset=None, names=None) wouldbecomegettext.install(domain, localedir=None, codeset=None, names=None,caching=None, keepopen=False)


The new caching parameter could receive the following values:

caching=None: revert to the previour eager loading of the full catalog.It will be the default to allow previous application to see no change

caching=1: lazy loading with unlimited cache

caching=n where n is a positive (>=0) integer value: lazy loading with aLRU cache limited to n strings


The keepopen parameter would be a boolean:

keepopen=False (default): the mo file is only opened before loading atranslation string and closed immediately after - it is also opened oncewhen the GNUTranslation class is initialized to load the file descriptionkeepopen=True: the mo file is kept open during the lifetime of theGNUTranslation object.

This parameter is ignored and not used if caching is None

Implementation:
==============

The current GNUTranslation class loads the content of the mo file tobuild a dictionnary where the original strings are the keys and thetranslated keys the values. Plural forms use a special processing: thekey is a 2 tuple (singular original string, order), and the value is thecorresponding translated string - order=0 is normally for the singulartranslated string.

The proposed implementation would simply replace this dictionary with aspecial mapping subclass when caching is not None. That subclass woulduse same keys as the original directory and would:

- first search in its cache

- if not found in cache and if the hashtable has not a zero size searchthe original string by hash- if not found in cache and if the hashtable has a zero size, search theoriginal string with a binary search algorithm.- if a string is found, it should feed the LRU cache, eventuallythrowing away the oldest entry (entries)

That should allow to implement the new feature with minimal refactoringfor the gettext module.


Le 18/12/2018 à 10:10, Serge Ballesta via Python-ideas a écrit :

In a project of mine, I have used the gettext module from PythonStandard Library. I have found that several tools could be used togenerate the Machine Object (mo) file from the source Portable Object(one): pybabel (http://babel.pocoo.org/en/latest/), msgfmt.py fromPython tools or the original msgfmt from GNU gettext.
I could find that only the original msgfmt was able to generate ahashtable, and that anyway the Python gettext module loaded everythingin memory and did not use it. But I also find a TODO note saying
# TODO:
# - Lazy loading of .mo files.  Currently the entire catalog is loaded into
#   memory, but that's probably bad for large translated programs.  Instead,
# the lexical sort of original strings in GNU .mo files should beexploited# to do binary searches and lazy initializations. Or you might wantto use# the undocumented double-hash algorithm for .mo files with hashtables, but
#   you'll need to study the GNU gettext code to do this.
I have studied GNU gettext code and found that implemententing thehashing algorithm in Python would not be that hard.
The undocumented features required for implementation are:
- the version number can safely stay to 0 when processing Python code
- the size of the hash table is the first odd prime greater than orequal to 4 * n / 3 where n is the number of strings- the first hashing function uses a variant of PJW hash functiondescribed in https://en.wikipedia.org/wiki/PJW_hash_function, where theline h = h & ~high is replaced with h = h ^ high, and using 32 bitsintegers. The index in the table in the result of the function modulusthe size of the hash table- when there is a conflict (the slot given by the first hashing functionis already used by another string) the following is used: - let h be the result of the PJW variant hash function and size bethe size of the hash table, an increment value is set to 1 +( h % (size -2)) - that increment is repeatedly added to the index in the hash table(modulus the table size) until an empty slot is found (or the correctoriginal string is found)
For now, my (alpha) code is able to generate in pure Python the same mofile that GNU msgfmt generates, and use the hashtable to access the strings.
Remaining problems:
- I had to read GPL copyrighted code to find the undocumented features.I have of course wrote my own code from scratch, but may I use an ApacheFree License 2.1 on it?- the current code for gettext loads everything from the mo file andimmediately closes it. My own code keeps the file opened to be able toaccess it with the mmap module. There could be use case where firstoption is better- I should either rely on the current way (load everything in memory) orimplement a binary search algo for the case where the hash table is notpresent (it is of course optional)- it would be an important change, and I think that options should beallow to choose between an eager or lazy access
Before going further, I would like to know whether implementing lazyaccess through the hash table that way seems to be a interestingimprovement or a dead end.
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Use lazy loading with hashtable in python gettext module

Reply via email to