Thank you, Peter. Yes, setting up my own indexes is more or less the idea of the modular cache that I was considering. Seeing others think in the same direction makes it look more viable.

About Scalene, thank you for the pointer. I'll do some research.

Do you have any idea about the speed of a SELECT query against a 100k rows / 300 Mb Sqlite db?

Dino

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
On 2023-01-14 23:26:27 -0500, Dino wrote:
Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be
honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang?
.Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows
and 40 columns or so (memory footprint: order of 300 Mb)

300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with
PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.

You may be able to do that for your Python data structures, too: Can you
set up dicts which map to subsets you need often?

There are some specialized in-memory bitmap implementations which can be
used for filtering. I've used
[Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
past (mostly in Perl).
These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
the most popular. I see several packages on PyPI - but I haven't used
any of them yet, so no recommendation from me.

Numpy might also help. You will still have linear scans, but it is more
compact and many of the searches can probably be done in C and not in
Python.

As you can imagine, this is not very performant in its current form, but
performance was not the point of the PoC - at least initially.

For performanc optimization it is very important to actually measure
performance, and a good profiler helps very much in identifying hot
spots. Unfortunately until recently Python was a bit deficient in this
area, but [Scalene](https://pypi.org/project/scalene/) looks promising.

         hp


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to