RE: Comparing caching strategies

avi.e.gross Sat, 18 Feb 2023 15:07:26 -0800

It is not an unusual pattern, Thomas, to do something selective to some object 
rather than do all parts just one way.

The history of computing has often been one where you had to deal with scarcity 
of expensive resources.

Consider the Python "list" as a rather wasteful container that is best used for 
diverse contents. As soon as you make all the contents of the same basic type, 
you start wondering if you are better off with a restricted type that holds 
only one kind of item, often as something like a numpy array. Or, you start 
wondering if maybe a good way to store the contents is in a dictionary instead, 
as searching it is often way faster.

But fundamentally, there is nothing WRONG with uses of a Python list as it 
handles almost anything and for small sample sizes, it is often not work 
spending time redoing it. Still, if you implement a 3-D matrix as a list of 
lists of lists, and use slicing and other techniques to do things like matrix 
multiplication, it gets unwieldly enough so arguably it is the wrong tool.

If you look at the history of Python, they deliberately left out many of the 
containers other programming languages used and stressed lists and tuples. No 
vectors or arrays were the focus. Later stuff had to be added as people noted 
that generality has costs. If you look at a language like JavaScript, it is 
beyond weird how they decided to use attributes of an object to make an array. 
So an object might have attributes like "name" and "length" alongside some like 
"1" and "2" and "666" and when you wanted to treat the object like an array, it 
might look at the attributes and ignore the non-numerical ones and look like it 
has a sparsely populated array/vector of items indexed from 1 to 666. You can 
do all kinds of arithmetic operations and add missing indices or remove some 
and it still works like a typical array but with weird costs such as sometimes 
having to reindex lots of items if you want to insert a new item as the 3rd or 
1st element so any above need renumbering the hard way! It works but strikes me 
as a kludge.

If you look within Python  at numpy and pandas and similar utilities, they are 
well aware of the idea of abstracting out concepts with different 
implementations and use lots of Python tools that access different storage 
methods as needed often by using objects that implement various "protocols" to 
the point where manipulating them similarly seems straightforward. In 
principle, you could create something like a pandas data.frame and then call a 
function that examines the columns, and perhaps rows, and returns a modified 
(or new) data.frame where the contents have been adjusted so the storage is 
smaller or can be accessed more efficiently, based on any other preferences and 
hints you supply, such as if it can be immutable. A column which is all 1's or 
Trues can obviously be done other ways, and one that has all values under 256, 
again, can use less storage. 

Of course this would need to be done very carefully so that changes that 
violate assumptions will result in a refactoring to a version that handles the 
changes, or results in an error. And a long-running system could be set to keep 
track of how an object is used and perhaps make adjustments. As one example, 
after some number of accesses, you might change a function "live" to begin 
caching, or have an object reconfigure to be faster to search even if occupying 
more room.

Back to Python lists, a typical change is simply to convert them to a tuple 
which makes them easier to store and often faster to search. And, if the keys 
are unique, now that dictionaries maintain order of insertion, sometimes you 
may want to convert the list to a dict. 

But I hasten to note some "improvements" may not really improve. In a language 
like R, many operations such as changing one vector in a data.frame are done 
lazily and the new copy is still sharing mostly the same storage as the old. 
Making some changes can result in negative effects. A common problem people 
have is that trying to store some objects in an existing vector can work except 
when done, the entire vector has been transformed into one that can carry any 
of the contents. A vector of integers may become a vector of doubles or even a 
vector of characters that now have entries like "777" as a character string. 
The flexibility comes with lots of programming errors!

Note how this can cause problems with the original idea here of caching 
strategies. Imagine a function that checks the environment as to what encoding 
or human language and so on to produce text in. If you cache it so it produces 
results that are stored in something like a dictionary with a key, and later 
the user changes the environment as it continues running, the cache may now 
contain invalid results. You might need to keep track of the environment and 
empty the cache if things change, or start a new cache and switch to it.  An 
example would be the way I use Google Translate. I sometimes am studying or 
using a language and want to look up a word or phrase or entire sentence. If 
Google Translate keeps track, it may notice repeated requests like "Valentines 
Day" and cache it for re-use. But I often click to switch languages and see if 
the other one uses a similar or different way to describe what it means or 
something similar but spelled another way. German does the latter as in 
Valentinstag which is a fairly literal translation as does Dutch (Valentijnsdag 
) and  Hungarian (Valentin nap) . 

But Hebrew calls it the holiday of love, sort of (חג האהבה). Portuguese is 
similar but includes day as well as love (Dia dos Namorados)

Esperanto tosses in more about sainthood (Sankta Valentín) and in a sense 
Spanish does both ways with day and saint (Día de San Valentín).

You could continue playing with such phrases quite a bit as Translate 
translates N languages into N languages. So even caching from English to 
anything is problematic and the same word spelling can be found in many 
languages and the translation of that word into English also varies.

So if you had a function like ask_translate(from, to, phrase) then something 
like an @lru_cache() decorator would not suffice as caching might require 
either combining the three arguments into one longer string to cache, or 
multiple caches accessed by something like an NxN data structure to select 
which cache.

And it can also have other considerations such as not using naked queries as 
keys, but customizing them first into some canonical format such as removing 
leading and trailing whitespace as  well as multiple instances between words, 
shifting everything to lower case, removing some punctuation and perhaps 
replacing special characters with a single version. Otherwise, the cache may be 
too large and often less effective. 

As stated earlier, any analogies or comparisons I use are for the purpose of 
discussion only and opinions are mine.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail....@python.org> On 
Behalf Of Thomas Passin
Sent: Saturday, February 18, 2023 4:00 PM
To: python-list@python.org
Subject: Re: Comparing caching strategies

On 2/18/2023 2:59 PM, avi.e.gr...@gmail.com wrote:
> I do not know the internals of any Roaring Bitmap implementation so 
> all I did gather was that once the problem is broken into accessing 
> individual things I chose to call zones for want of a more specific 
> name, then each zone is stored in one of an unknown number of ways depending 
> on some logic.

Somewhat tangential, but back quite a while ago there was a C library called 
IIRC "julia list".  It implemented lists in several different ways, some quite 
sophisticated, depending on the size and usage pattern. 
  I remembered it as soon as I took a look at Roaring Bitmap and saw that the 
latter can use different representations depending on size and access patterns.

-- 
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

RE: Comparing caching strategies

Reply via email to