Hey,

before you start optimizing. I would suggest, that you measure response times and query 
times, data search times and so on. In order to save time, you have to know where you 
"loose" time.

Does your service really have to load the whole table at once? Yes that might 
lead to quicker response times on requests, but databases are often very good 
with caching themselves, so that the first request might be slower than 
following requests, with similar parameters. Do you use a database, or are you 
reading from a file? Are you maybe looping through your whole dataset on every 
request? Instead of asking for the specific data?

Before you start introducing a cache and its added complexity, do you really 
need that cache?

You are talking about saving microseconds, that sounds a bit as if you might be 
“overdoing” it. How many requests will you have in the future? At least in 
which magnitude and how quick do they have to be? You write about 1-4 seconds 
on your laptop. But that does not really tell you that much, because most 
probably the service will run on a server. I am not saying that you should get 
a server or a cloud-instance to test against, but to talk with your architect 
about that.

I totally understand your impulse to appear as good as can be, but you have to 
know where you really need to debug and optimize. It will not be advantageous 
for you, if you start to optimize for optimizing's sake. Additionally if you 
service is a PoC, optimizing now might be not the first thing you have to worry 
about, but about that you made everything as simple and readable as possible 
and that you do not spend too much time for just showing how it could work.

But of course, I do not know the tasks given to you and the expectations you 
have to fulfil. All I am trying to say is to reconsider where you really could 
improve and how far you have to improve.



Lars Liedtke
Software Entwickler

[Tel.]  +49 721 98993-
[Fax]   +49 721 98993-
[E-Mail]        l...@solute.de<mailto:l...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany


[Logo Solute]


Marken der solute GmbH | brands of solute GmbH
[Marken]
[Advertising Partner]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php




Am 15.01.23 um 05:26 schrieb Dino:

Hello, I have built a PoC service in Python Flask for my work, and - now that 
the point is made - I need to make it a little more performant (to be honest, 
chances are that someone else will pick up from where I left off, and implement 
the same service from scratch in a different language (GoLang? .Net? Java?) but 
I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows and 
40 columns or so (memory footprint: order of 300 Mb) and then accepts queries through a 
REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is 
read only. The endpoint will parse the query and match it against column values 
(equality, inequality, greater than, etc.) Finally, it will return a (JSON) list of all 
rows that satisfy all conditions in the query.

As you can imagine, this is not very performant in its current form, but 
performance was not the point of the PoC - at least initially.

Before I deliver the PoC to a more experienced software architect who will look 
at my code, though, I wouldn't mind to look a bit less lame and do something 
about performance in my own code first, possibly by bringing the average time 
for queries down from where it is now (order of 1 to 4 seconds per query on my 
laptop) to 1 or 2 milliseconds on average).

To be honest, I was already able to bring the time down to a handful of microseconds 
thanks to a rudimentary cache that will associate the "signature" of a query to 
its result, and serve it the next time the same query is received, but this may not be 
good enough: 1) queries might be many and very different from one another each time, AND 
2) I am not sure the server will have a ton of RAM if/when this thing - or whatever is 
derived from it - is placed into production.

How can I make my queries generally more performant, ideally also in case of a 
new query?

Here's what I have been considering:

1. making my cache more "modular", i.e. cache the result of certain (wide) 
queries. When a complex query comes in, I may be able to restrict my search to a subset 
of the rows (as determined by a previously cached partial query). This should keep the 
memory footprint under control.

2. Load my data into a numpy.array and use numpy.array operations to slice and 
dice my data.

3. load my data into sqlite3 and use SELECT statement to query my table. I have 
never used sqllite, plus there's some extra complexity as comparing certain 
colum requires custom logic, but I wonder if this architecture would work well 
also when dealing with a 300Mb database.

4. Other ideas?

Hopefully I made sense. Thank you for your attention

Dino
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to