Re: Fast lookup of bulky "table"

2023-01-15 Thread Lars Liedtke

Hey,

before you start optimizing. I would suggest, that you measure response times and query 
times, data search times and so on. In order to save time, you have to know where you 
"loose" time.

Does your service really have to load the whole table at once? Yes that might 
lead to quicker response times on requests, but databases are often very good 
with caching themselves, so that the first request might be slower than 
following requests, with similar parameters. Do you use a database, or are you 
reading from a file? Are you maybe looping through your whole dataset on every 
request? Instead of asking for the specific data?

Before you start introducing a cache and its added complexity, do you really 
need that cache?

You are talking about saving microseconds, that sounds a bit as if you might be 
“overdoing” it. How many requests will you have in the future? At least in 
which magnitude and how quick do they have to be? You write about 1-4 seconds 
on your laptop. But that does not really tell you that much, because most 
probably the service will run on a server. I am not saying that you should get 
a server or a cloud-instance to test against, but to talk with your architect 
about that.

I totally understand your impulse to appear as good as can be, but you have to 
know where you really need to debug and optimize. It will not be advantageous 
for you, if you start to optimize for optimizing's sake. Additionally if you 
service is a PoC, optimizing now might be not the first thing you have to worry 
about, but about that you made everything as simple and readable as possible 
and that you do not spend too much time for just showing how it could work.

But of course, I do not know the tasks given to you and the expectations you 
have to fulfil. All I am trying to say is to reconsider where you really could 
improve and how far you have to improve.



Lars Liedtke
Software Entwickler

[Tel.]  +49 721 98993-
[Fax]   +49 721 98993-
[E-Mail]l...@solute.de


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany


[Logo Solute]


Marken der solute GmbH | brands of solute GmbH
[Marken]
[Advertising Partner]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de 
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php




Am 15.01.23 um 05:26 schrieb Dino:

Hello, I have built a PoC service in Python Flask for my work, and - now that 
the point is made - I need to make it a little more performant (to be honest, 
chances are that someone else will pick up from where I left off, and implement 
the same service from scratch in a different language (GoLang? .Net? Java?) but 
I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows and 
40 columns or so (memory footprint: order of 300 Mb) and then accepts queries through a 
REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is 
read only. The endpoint will parse the query and match it against column values 
(equality, inequality, greater than, etc.) Finally, it will return a (JSON) list of all 
rows that satisfy all conditions in the query.

As you can imagine, this is not very performant in its current form, but 
performance was not the point of the PoC - at least initially.

Before I deliver the PoC to a more experienced software architect who will look 
at my code, though, I wouldn't mind to look a bit less lame and do something 
about performance in my own code first, possibly by bringing the average time 
for queries down from where it is now (order of 1 to 4 seconds per query on my 
laptop) to 1 or 2 milliseconds on average).

To be honest, I was already able to bring the time down to a handful of microseconds 
thanks to a rudimentary cache that will associate the "signature" of a query to 
its result, and serve it the next time the same query is received, but this may not be 
good enough: 1) queries might be many and very different from one another each time, AND 
2) I am not sure the server will have a ton of RAM if/when this thing - or whatever is 
derived from it - is placed into production.

How can I make my queries generally more performant, ideally also in case of a 
new query?

Here's what I have been considering:

1. making my cache more "modular", i.e. cache the result of certain (wide) 
queries. When a complex query comes in, I may be able to restrict my search to a subset 
of the rows (as determined by a previously cached partial query). This should keep the 
memory footprint under control.

2. Load my data into a numpy.array and use numpy.array operations to slice and 
dice my data.

3. load my data into sqlite3 and use SELECT state

Re: Fast lookup of bulky "table"

2023-01-15 Thread Peter J. Holzer
On 2023-01-14 23:26:27 -0500, Dino wrote:
> Hello, I have built a PoC service in Python Flask for my work, and - now
> that the point is made - I need to make it a little more performant (to be
> honest, chances are that someone else will pick up from where I left off,
> and implement the same service from scratch in a different language (GoLang?
> .Net? Java?) but I am digressing).
> 
> Anyway, my Flask service initializes by loading a big "table" of 100k rows
> and 40 columns or so (memory footprint: order of 300 Mb)

300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with
PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.

You may be able to do that for your Python data structures, too: Can you
set up dicts which map to subsets you need often?

There are some specialized in-memory bitmap implementations which can be
used for filtering. I've used
[Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
past (mostly in Perl).
These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
the most popular. I see several packages on PyPI - but I haven't used
any of them yet, so no recommendation from me.

Numpy might also help. You will still have linear scans, but it is more
compact and many of the searches can probably be done in C and not in
Python.

> As you can imagine, this is not very performant in its current form, but
> performance was not the point of the PoC - at least initially.

For performanc optimization it is very important to actually measure
performance, and a good profiler helps very much in identifying hot
spots. Unfortunately until recently Python was a bit deficient in this
area, but [Scalene](https://pypi.org/project/scalene/) looks promising.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Thomas Passin

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:

On 2023-01-14 23:26:27 -0500, Dino wrote:

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be
honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang?
.Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows
and 40 columns or so (memory footprint: order of 300 Mb)


300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with
PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.


I have an (inherited) server program that uses about 30 MB of data in a 
MySQL database. It services queries received over the network. It too 
had performance problems, to which adding indexes and smarter joins 
helped but not enough.


I changed the program so that at startup it imports much of the data 
into Python dictionaries that are structured to support the kinds of 
queries that need the help.  Response time to queries dropped 
dramatically.  Some kinds of queries needed more help, and I collected 
auxiliary collections of (usually highly pre-processed) data into 
ordinary files, and those too get imported into dictionaries during startup.


Note that these dictionaries do not always match the table structures. 
Some of them change the structure to make queries easier to process. You 
may be able to do that with Python code, or by creating SQL views in the 
database and importing directly from the views (database views take 
almost no database memory).


The drawback is that all that data is now stored in memory while the 
program is running.  In my case, many hundreds of MB.  But if it would 
be too much memory for you - you would need to prototype it to know - 
you should let the database engine do the work. It is more highly 
optimized and efficient for searches than your code could ever be. But 
there will be a price to pay. The price is in denormalizing the database 
table design.  This means to include redundant data, organized to match 
the kinds of queries that will be made. No more 3rd normal form! Your 
sql queries will need to be designed to take advantage of this new 
structure. This will be a cost because the database will be larger, but 
also because the redundancies will make it much harder to update the 
data correctly.  Fortunately you do not need to do that during normal 
operation (my program's data was also static like yours).


PostgreSQL would probably be a better choice than Sqlite, since it 
supports features such as foreign keys, and has a function definition 
capability.




--
https://mail.python.org/mailman/listinfo/python-list


RE: To clarify how Python handles two equal objects

2023-01-15 Thread avi.e.gross
<<< Frank Millman>>> My 'aha' moment came when I understood that a python 
object has only three properties - a type, an id, and a value. It does *not* 
have a name.

Yes, Frank, it is a bit like how some people need to wrap their minds around a 
concept like an anonymous function. It has no name and for many purposes needs 
no name, but at the same time can be bound to one or more names if needed. 

Some people will encounter a situation where you need to pass a function to be 
called such as a key function for determining the sort order and they will 
create a new function first the usual way with a name and then pass it along by 
name. That works fine but the function then persists while not being used again 
and even choosing a name can have consequences if the same name is already in 
use and so on.

Experienced programmers might use some kind of lambda expression in-line as the 
function as an object is then created, and passed along as a reference to be 
used and probably discarded once nothing refers to it. 

Similar things can happen if an object is created as part of a larger complex 
such as a list or deque of them. Each item can be referenced unambiguously 
without having a formal name. You can pop off the next one an use it or ask to 
take the fifth. It may acquire and lose names as the program runs but sometimes 
can have many names or none.

Getting people to see that an object exists whether it has no name and to speak 
about them with that understanding can be a problem. We as humans seem to think 
we are our names. But I know my names have changed over the years partially 
because I moved between countries and I have several additional names used just 
for special purposes, and yet for some purposes I am simply a number. If I 
raise my class in a lecture hall where my name is not known, they may point at 
me or ask the guy in the blue short to talk.  At the Department of Motor 
Vehicles, my name, like everyone else, is NEXT!

The real point is what Python does, not what other situations require. Two 
objects are equal does not always mean what you think or want it to mean. What 
we are discussing here is two objects loosely of type "name" that are being 
compared not for whether THEY are themselves equal as in the same symbols 
composing them and perhaps living in the same namespace. The question was about 
two objects that contained references to another object or even to different 
parts of another object. As it happens, that can be a deeper question in which 
some parts are not a required aspect of the language and may be part of one of 
many possible implementations. And, as noted, the names can be detached from 
that reference too so a link is not permanent.

Have we beaten this one to death yet?


-Original Message-
From: Python-list  On 
Behalf Of Frank Millman
Sent: Sunday, January 15, 2023 12:47 AM
To: python-list@python.org
Subject: Re: To clarify how Python handles two equal objects





--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Dino


Thank you for your answer, Lars. Just a clarification: I am already 
doing a rough measuring of my queries.


A fresh query without any caching: < 4s.

Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

Desired speed for my POC: 10 Also, I didn't want to ask a question with way too many "moving parts", 
but when I talked about the "table", it's actually a 100k long list of 
IDs. I can then use each ID to invoke an API that will return those 40 
attributes. The API is fast, but still, I am bound to loop through the 
whole thing to respond to the query, that's unless I pre-load the data 
into something that allows faster access.


Also, as you correctly observed, "looking good with my colleagues" is a 
nice-to-have feature at this point, not really an absolute requirement :)


Dino

On 1/15/2023 3:17 AM, Lars Liedtke wrote:

Hey,

before you start optimizing. I would suggest, that you measure response 
times and query times, data search times and so on. In order to save 
time, you have to know where you "loose" time.


Does your service really have to load the whole table at once? Yes that 
might lead to quicker response times on requests, but databases are 
often very good with caching themselves, so that the first request might 
be slower than following requests, with similar parameters. Do you use a 
database, or are you reading from a file? Are you maybe looping through 
your whole dataset on every request? Instead of asking for the specific 
data?


Before you start introducing a cache and its added complexity, do you 
really need that cache?


You are talking about saving microseconds, that sounds a bit as if you 
might be “overdoing” it. How many requests will you have in the future? 
At least in which magnitude and how quick do they have to be? You write 
about 1-4 seconds on your laptop. But that does not really tell you that 
much, because most probably the service will run on a server. I am not 
saying that you should get a server or a cloud-instance to test against, 
but to talk with your architect about that.


I totally understand your impulse to appear as good as can be, but you 
have to know where you really need to debug and optimize. It will not be 
advantageous for you, if you start to optimize for optimizing's sake. 
Additionally if you service is a PoC, optimizing now might be not the 
first thing you have to worry about, but about that you made everything 
as simple and readable as possible and that you do not spend too much 
time for just showing how it could work.


But of course, I do not know the tasks given to you and the expectations 
you have to fulfil. All I am trying to say is to reconsider where you 
really could improve and how far you have to improve.




--
https://mail.python.org/mailman/listinfo/python-list


Re: To clarify how Python handles two equal objects

2023-01-15 Thread Mark Bourne

Jen Kris wrote:

Avi,

Your comments go farther afield than my original question, but you made some 
interesting additional points.  For example, I sometimes work with the C API 
and sys.getrefcount may be helpful in deciding when to INCREF and DECREF.  But 
that’s another issue.

The situation I described in my original post is limited to a case such as x = y where both "x" and "y" are arrays – whether they are lists in Python, or from 
the array module – and the question in a compiled C extension is whether the assignment can be done simply by "x" taking the pointer to "y" rather than moving 
all the data from "y" into the memory buffer for "x" which, for a wide array, would be much more time consuming than just moving a pointer.  The other 
advantage to doing it that way is if, as in my case, we perform a math operation on any element in "x" then Python expects that the same change to be reflected in 
"y."  If I don’t use the same pointers then I would have to perform that operation twice – once for "x" and once  for "y" – in addition to the 
expense of moving all the data.

The answers I got from this post confirmed that it I can use the pointer if "y" is not re-defined 
to something else during the lifespan of "x."  If it is then "x" has to be restored to 
its original pointer.  I did it that way, and helpfully the compiler did not overrule me.


I haven't done much with C extensions, but I don't think you'd need to 
do anything with "x" in that case.  If something else is assigned to 
"y", "x" would still be a reference to the original object - why would 
it need to be "restored" to anything?  Unless I've misunderstood what's 
going on here...


--
Mark.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Dino



Thank you, Peter. Yes, setting up my own indexes is more or less the 
idea of the modular cache that I was considering. Seeing others think in 
the same direction makes it look more viable.


About Scalene, thank you for the pointer. I'll do some research.

Do you have any idea about the speed of a SELECT query against a 100k 
rows / 300 Mb Sqlite db?


Dino

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:

On 2023-01-14 23:26:27 -0500, Dino wrote:

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be
honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang?
.Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows
and 40 columns or so (memory footprint: order of 300 Mb)


300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with
PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.

You may be able to do that for your Python data structures, too: Can you
set up dicts which map to subsets you need often?

There are some specialized in-memory bitmap implementations which can be
used for filtering. I've used
[Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
past (mostly in Perl).
These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
the most popular. I see several packages on PyPI - but I haven't used
any of them yet, so no recommendation from me.

Numpy might also help. You will still have linear scans, but it is more
compact and many of the searches can probably be done in C and not in
Python.


As you can imagine, this is not very performant in its current form, but
performance was not the point of the PoC - at least initially.


For performanc optimization it is very important to actually measure
performance, and a good profiler helps very much in identifying hot
spots. Unfortunately until recently Python was a bit deficient in this
area, but [Scalene](https://pypi.org/project/scalene/) looks promising.

 hp



--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Weatherby,Gerard
That’s about what I got using a Python dictionary on random data on a high 
memory machine.

https://github.com/Gerardwx/database_testing.git

It’s not obvious to me how to get it much faster than that.

From: Python-list  on 
behalf of Dino 
Date: Sunday, January 15, 2023 at 1:29 PM
To: python-list@python.org 
Subject: Re: Fast lookup of bulky "table"
*** Attention: This is an external email. Use caution responding, opening 
attachments or clicking on links. ***

Thank you for your answer, Lars. Just a clarification: I am already
doing a rough measuring of my queries.

A fresh query without any caching: < 4s.

Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

Desired speed for my POC: 10  Hey,
>
> before you start optimizing. I would suggest, that you measure response
> times and query times, data search times and so on. In order to save
> time, you have to know where you "loose" time.
>
> Does your service really have to load the whole table at once? Yes that
> might lead to quicker response times on requests, but databases are
> often very good with caching themselves, so that the first request might
> be slower than following requests, with similar parameters. Do you use a
> database, or are you reading from a file? Are you maybe looping through
> your whole dataset on every request? Instead of asking for the specific
> data?
>
> Before you start introducing a cache and its added complexity, do you
> really need that cache?
>
> You are talking about saving microseconds, that sounds a bit as if you
> might be “overdoing” it. How many requests will you have in the future?
> At least in which magnitude and how quick do they have to be? You write
> about 1-4 seconds on your laptop. But that does not really tell you that
> much, because most probably the service will run on a server. I am not
> saying that you should get a server or a cloud-instance to test against,
> but to talk with your architect about that.
>
> I totally understand your impulse to appear as good as can be, but you
> have to know where you really need to debug and optimize. It will not be
> advantageous for you, if you start to optimize for optimizing's sake.
> Additionally if you service is a PoC, optimizing now might be not the
> first thing you have to worry about, but about that you made everything
> as simple and readable as possible and that you do not spend too much
> time for just showing how it could work.
>
> But of course, I do not know the tasks given to you and the expectations
> you have to fulfil. All I am trying to say is to reconsider where you
> really could improve and how far you have to improve.
>
>
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Peter J. Holzer
On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:
> On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
> > On 2023-01-14 23:26:27 -0500, Dino wrote:
> > > Anyway, my Flask service initializes by loading a big "table" of 100k rows
> > > and 40 columns or so (memory footprint: order of 300 Mb)
> > 
> > 300 MB is large enough that you should at least consider putting that
> > into a database (Sqlite is probably simplest. Personally I would go with
> > PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
> > outlier).
> > 
> > The main reason for putting it into a database is the ability to use
> > indexes, so you don't have to scan all 100 k rows for each query.
> 
> I have an (inherited) server program that uses about 30 MB of data in a
> MySQL database. It services queries received over the network. It too had
> performance problems, to which adding indexes and smarter joins helped but
> not enough.
> 
> I changed the program so that at startup it imports much of the data into
> Python dictionaries that are structured to support the kinds of queries that
> need the help.  Response time to queries dropped dramatically.

This is to be expected: Firstly, because you don't have disk accesses
any more, secondly because you don't have network latency any more and
thirdly, because you structured the data to fit the queries.

The thing to keep in mind is that the relational database model was
invented to have a uniform and simple way to model all data, and that
RDBMSs are designed to handle all workloads (from a single tiny table to
thousands of tables with hundreds of terabytes) reasonably well. For any
given application you can always find a more efficient solution than
using an RDBMS. Sometimes it's simple (just load all the data into a
dict and serve from there), sometimes it's a major research project.
The nice thing about RDBMSs isn't that they are the optimal solution for
anything but that they are a "good enough" solution for a large class of
problems.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Weatherby,Gerard
I think any peformance improvements would have to come from a language change 
or better indexing of the data.

From: Python-list  on 
behalf of Weatherby,Gerard 
Date: Sunday, January 15, 2023 at 2:25 PM
To: Dino , python-list@python.org 
Subject: Re: Fast lookup of bulky "table"
That’s about what I got using a Python dictionary on random data on a high 
memory machine.

https://urldefense.com/v3/__https://github.com/Gerardwx/database_testing.git__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqksBjbd9v$

It’s not obvious to me how to get it much faster than that.

From: Python-list  on 
behalf of Dino 
Date: Sunday, January 15, 2023 at 1:29 PM
To: python-list@python.org 
Subject: Re: Fast lookup of bulky "table"
*** Attention: This is an external email. Use caution responding, opening 
attachments or clicking on links. ***

Thank you for your answer, Lars. Just a clarification: I am already
doing a rough measuring of my queries.

A fresh query without any caching: < 4s.

Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

Desired speed for my POC: 10  Hey,
>
> before you start optimizing. I would suggest, that you measure response
> times and query times, data search times and so on. In order to save
> time, you have to know where you "loose" time.
>
> Does your service really have to load the whole table at once? Yes that
> might lead to quicker response times on requests, but databases are
> often very good with caching themselves, so that the first request might
> be slower than following requests, with similar parameters. Do you use a
> database, or are you reading from a file? Are you maybe looping through
> your whole dataset on every request? Instead of asking for the specific
> data?
>
> Before you start introducing a cache and its added complexity, do you
> really need that cache?
>
> You are talking about saving microseconds, that sounds a bit as if you
> might be “overdoing” it. How many requests will you have in the future?
> At least in which magnitude and how quick do they have to be? You write
> about 1-4 seconds on your laptop. But that does not really tell you that
> much, because most probably the service will run on a server. I am not
> saying that you should get a server or a cloud-instance to test against,
> but to talk with your architect about that.
>
> I totally understand your impulse to appear as good as can be, but you
> have to know where you really need to debug and optimize. It will not be
> advantageous for you, if you start to optimize for optimizing's sake.
> Additionally if you service is a PoC, optimizing now might be not the
> first thing you have to worry about, but about that you made everything
> as simple and readable as possible and that you do not spend too much
> time for just showing how it could work.
>
> But of course, I do not know the tasks given to you and the expectations
> you have to fulfil. All I am trying to say is to reconsider where you
> really could improve and how far you have to improve.
>
>
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqkvzm3bP5$
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Thomas Passin

On 1/15/2023 2:39 PM, Peter J. Holzer wrote:

On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:

On 2023-01-14 23:26:27 -0500, Dino wrote:

Anyway, my Flask service initializes by loading a big "table" of 100k rows
and 40 columns or so (memory footprint: order of 300 Mb)


300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with
PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.


I have an (inherited) server program that uses about 30 MB of data in a
MySQL database. It services queries received over the network. It too had
performance problems, to which adding indexes and smarter joins helped but
not enough.

I changed the program so that at startup it imports much of the data into
Python dictionaries that are structured to support the kinds of queries that
need the help.  Response time to queries dropped dramatically.


This is to be expected: Firstly, because you don't have disk accesses
any more, secondly because you don't have network latency any more and
thirdly, because you structured the data to fit the queries.


Of course: that's exactly why I made those changes.  The tradeoff is 
using more memory for your program, sometimes a lot more.



The thing to keep in mind is that the relational database model was
invented to have a uniform and simple way to model all data, and that
RDBMSs are designed to handle all workloads (from a single tiny table to
thousands of tables with hundreds of terabytes) reasonably well. For any
given application you can always find a more efficient solution than
using an RDBMS. Sometimes it's simple (just load all the data into a
dict and serve from there), sometimes it's a major research project.
The nice thing about RDBMSs isn't that they are the optimal solution for
anything but that they are a "good enough" solution for a large class of
problems.


Often the solution is careful (and not very normalized) table design to 
support your queries. In the case I'm discussing, it was easier for me 
to make Python do the work, and I could afford the memory load.  In 
other cases, you have to put in the work on the database side.  Often 
for slow queries, disk latency and I/O are not the limiting factors, but 
you have to put in the work and do the testing to make sure.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread dn via Python-list

On 16/01/2023 08.36, Weatherby,Gerard wrote:

I think any peformance improvements would have to come from a language change 
or better indexing of the data.


Exactly!

Expanding on @Peter's post: databases (relational or not) are best 
organised according to use. Some must accept rapid insert/updates. 
Others are about look-ups (data-retrieval).


A basic RDBMS, just as a Python dict, may only offer a single key for 
efficient retrieval.


Postgres and MySQL (for example) enable the establishment of multiple 
and sophisticated indices/indexes, and the aptly-named "views" of data.


If the queries can be grouped according to the manner in which the data 
must be accessed, a view could be built for each. At which time, even if 
every row must be accessed, the retrieval will be made more efficient 
and/or the response better-organised.


Thus, if we have a DB of people. Many retrievals are likely to utilise 
an index on 'name'. However, if at times interest is limited to place or 
suburb, an index and view of such will speed things from O(n). 
Similarly, if a query is only to return people with a dog license.


Some programmers don't realise that SQL can also be used for 
calculations, eg the eponymous COUNT(), which saves (CPU-time and 
coding-effort) over post-processing in Python.


If there are many way to group/examine the data, then this may not be 
possible, but that depends upon the count of views cf the value of 
speedy-response - and bearing-in-mind that the demands for response-time 
may vary by type of query/data-access.


So many variables to consider ...

--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread rbowman
On Sun, 15 Jan 2023 08:27:29 -0500, Dino wrote:


> Do you have any idea about the speed of a SELECT query against a 100k
> rows / 300 Mb Sqlite db?

https://www.sqlite.org/speed.html

The site is old but has a number of comparisons. I have not used SQLite 
with Python yet but with both C and C# I've been impressed with the speed 
versus Postgres or MSSQL. One thing to watch is insertions. By default 
each insertion is a transaction. There is a dramatic speed increase for 
multiple insertions if you explicitly start a transaction, do the inserts, 
and end the transaction.

My usage is a result of ESRI's dropping their C++/C# Engine API. My new 
approach uses queries against the AcrGIS Server REST interface for much of 
the functionality but some spatial queries can be replaced with 
predetermined tabular data rather than runtime spatial queries. For 
example, for a given dataset you can determine the coordinates of every 
intersection beforehand and the intersection of PINE ST and MAPLE AVE 
becomes a simple search in the SLQite database.

ESRI's ArcPy is the closest replacement for the legacy C++ API so I assume 
in the future I will be using it in conjunction with SQLite. The actual 
geodata will still need to be in a spatially aware RDMBS like SQL Server 
or PostgreSQL/PostGIS but SQLite so far is the fastest and easiest to 
implement for non-spatial data. Also, it is in the public domain which 
avoids the complexities of MySQL and its derivatives for commercial 
applications.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Weatherby,Gerard
With Postgresql, one can also do pre-processing in Python. 
https://www.postgresql.org/docs/15/plpython.html

While it’s not as convenient to develop as client-side Python, it can be used 
to implement complicated constraints or implement filtering on the server side, 
which reduces the amount of data that has to be sent back to the client.


From: Python-list  on 
behalf of Stefan Ram 
Date: Sunday, January 15, 2023 at 5:03 PM
To: python-list@python.org 
Subject: Re: Fast lookup of bulky "table"
*** Attention: This is an external email. Use caution responding, opening 
attachments or clicking on links. ***

dn  writes:
>Some programmers don't realise that SQL can also be used for
>calculations, eg the eponymous COUNT(), which saves (CPU-time and
>coding-effort) over post-processing in Python.

  Yes, I second that! Sometimes, people only re-invent things
  in Python because they don't know SQL well enough, or they
  do not normalize their tables because they have not properly
  learned how to do this.

  I'd always start out with normalized tables and do as many
  operations in SQL as possible. I would then hesitate to
  de-normalize anything or transfer data operations into
  the programming language unless I am very sure that this
  is really advantageous.

  Once I had the task of writing VBA code to query and analyze
  data from a Jet engine (i.e., Microsoft Access). I ended up
  writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
  And it was fast.


--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!kAIZWRJ3oqrlkixX-iwrGeG9VVWjooBvzuMirfp44VTP32cELWf8Dk6MkPQwK2QwWzuUT9eNPNTlN152b23eFcM$
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Thomas Passin

On 1/15/2023 4:49 PM, Stefan Ram wrote:

dn  writes:

Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and
coding-effort) over post-processing in Python.


   Yes, I second that! Sometimes, people only re-invent things
   in Python because they don't know SQL well enough, or they
   do not normalize their tables because they have not properly
   learned how to do this.

   I'd always start out with normalized tables and do as many
   operations in SQL as possible. I would then hesitate to
   de-normalize anything or transfer data operations into
   the programming language unless I am very sure that this
   is really advantageous.


Yes, if you get the indexes and joins right, sometimes you can get a 
very large speed-up.  It takes some experimenting and use of EXPLAIN, 
but it's worth doing.  You especially want to avoid letting the database 
engine do full-table scans over and over. And you never want to send a 
lot of rows to Python and do post-filtering on them if you can avoid it.


Use WHERE instead of HAVING if possible (HAVING works post-scan, WHERE 
works during row retrieval).



   Once I had the task of writing VBA code to query and analyze
   data from a Jet engine (i.e., Microsoft Access). I ended up
   writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
   And it was fast.




--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Greg Ewing

On 16/01/23 2:27 am, Dino wrote:
Do you have any idea about the speed of a SELECT query against a 100k 
rows / 300 Mb Sqlite db?


That depends entirely on the nature of the query and how the
data is indexed. If it's indexed in a way that allows sqlite to
home in directly on the wanted data, it will be very fast. If
it has to fall back on a linear search, it probably won't be
significantly faster than your existing Python implementation.

--
Greg

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python - working with xml/lxml/objectify/schemas, datatypes, and assignments

2023-01-15 Thread aapost

On 1/3/23 22:57, aapost wrote:
I am trying to wrap my head around how one goes about working with and 
editing xml elements ... Back to 
contemplating and tinkering..


For anyone in a similar situation, xmlschema is actually quite nice.

It didn't have the features I was looking for out of the box, but it 
does have a to_objects function and I have learned quite a bit while 
picking it apart. I am able to patch it to be good enough for my 
requirements.


Below is the patch for anyone interested:

#
# Contribution for the xmlschema & elementpath python modules which are
# Copyright (c), 2016-2020, SISSA (International School for Advanced 
Studies).

# All rights reserved.
#
# This file is distributed under the terms of the MIT License.
# See the file 'LICENSE' in the root directory of the present
# distribution, or http://opensource.org/licenses/MIT.
#

# Patching and expansion of the xmlschema.dataobjects.DataElement object 
features
# to get the best demonstration, change schema variable to your .xsd, 
and xmlobj to your .xml files

# then run this as $ python -i filename.py

from typing import Any, Optional, Union, Tuple
#from types import MethodType

class ValueLockedError(Exception):
  def __init__(self, obj, variable_name):
self.message = "Can't set ." + variable_name + \
   "\nThe object:\n" + str(obj) + \
   "\nis Locked (._locked is set to True)"
super().__init__(self.message)

# importing in order necessary for intended monkey patch
import elementpath.etree as ep_etree

# Monkey patching additional static functions to the import of 
elementpath.etree


# for namespace management of xml.etree.ElementTree code paths (which use
# the global variable register_namespace._namespace_map for namespace 
registering)

def etree_remove_registered_namespace(elem: ep_etree.ElementProtocol,
  uri: str = '') -> None:
  etree_module: Any
  if not ep_etree.is_etree_element(elem):
raise TypeError(f"{elem!r} is not an Element")
  elif isinstance(elem, ep_etree.PyElementTree.Element):
etree_module = ep_etree.PyElementTree
  elif not hasattr(elem, 'nsmap'):
etree_module = ep_etree.ElementTree
  else:
import lxml.etree as etree_module  # type: ignore[no-redef]

  if not hasattr(elem, 'nsmap'):
if uri in etree_module.register_namespace._namespace_map:
  del etree_module.register_namespace._namespace_map[uri]
  else:
# TODO research this for better understanding
# _namespace_map is uri->prefix
# DataElement.nsmap prefix->uri
# lxml etree .nsmap ?->?
# not using lxml anyway so not really an issue as
# this condition shouldn't be met
for key, value in elem.nsmap.items():
  # research - can there be multiple instances of uri to prefix?..
  # or are they intended to be 1:1?..
  if value == uri:
if key in elem.nsmap:
  del elem.nsmap[key]

#patching
setattr(ep_etree, "etree_remove_registered_namespace",
staticmethod(etree_remove_registered_namespace))

# for namespace management of xml.etree.ElementTree code paths (which use
# the global variable register_namespace._namespace_map for namespace 
registering)

def etree_get_registered_namespaces(elem: ep_etree.ElementProtocol) -> dict:
  etree_module: Any
  if not ep_etree.is_etree_element(elem):
raise TypeError(f"{elem!r} is not an Element")
  elif isinstance(elem, ep_etree.PyElementTree.Element):
etree_module = ep_etree.PyElementTree
  elif not hasattr(elem, 'nsmap'):
etree_module = ep_etree.ElementTree
  else:
import lxml.etree as etree_module  # type: ignore[no-redef]

  if not hasattr(elem, 'nsmap'):
return etree_module.register_namespace._namespace_map
  else:
return elem.nsmap # shouldn't be met

#patching
setattr(ep_etree, "etree_get_registered_namespaces",
staticmethod(etree_get_registered_namespaces))

# for namespace management of xml.etree.ElementTree code paths (which use
# the global variable register_namespace._namespace_map for namespace 
registering)

def etree_register_namespace(elem: ep_etree.ElementProtocol,
 prefix: str = None,
 uri: str = None) -> None:
  etree_module: Any
  if not ep_etree.is_etree_element(elem):
raise TypeError(f"{elem!r} is not an Element")
  elif isinstance(elem, ep_etree.PyElementTree.Element):
etree_module = ep_etree.PyElementTree
  elif not hasattr(elem, 'nsmap'):
etree_module = ep_etree.ElementTree
  else:
import lxml.etree as etree_module  # type: ignore[no-redef]

  if prefix != None and uri != None:
if not hasattr(elem, 'nsmap'):
  etree_module.register_namespace(prefix, uri)
else:
  # TODO research this for better understanding
  # _namespace_map is uri->prefix
  # DataElement.nsmap prefix->uri
  # lxml etree .nsmap ?->?
  # not using lxml anyway so not really an issue as
  # this condition shouldn't be met
  elem.nsmap[pr

Re: Python - working with xml/lxml/objectify/schemas, datatypes, and assignments

2023-01-15 Thread aapost

On 1/11/23 13:21, Dieter Maurer wrote:

aapost wrote at 2023-1-10 22:15 -0500:

On 1/4/23 12:13, aapost wrote:

On 1/4/23 09:42, Dieter Maurer wrote:
...

You might have a look at `PyXB`, too.
It tries hard to enforce schema restrictions in Python code.

...

Unfortunately picking it apart for a while and diving deeper in to a
rabbit hole, PyXB looks to be a no-go.

PyXB while interesting, and I respect it's complexity and depth, is
lacking in design consistency in how it operates if you are trying to
modify and work with the resulting structure intuitively.
... problem with simple types ...


I use `PyXB` in `dm.saml2` and `dm.zope.saml2`, i.e. with
the SAML2 schema definitions (which include those
of XML signature and XML encryption).
I had no problems with simple types. I just assign them to attributes
of the Python objects representing the XML elements.
`PyXB` does the right thing when it serializes those objects into XML.


It does do a lot of good things, and I am sad to see all the good work 
in it not get used, but for me it really boils down to what it can sum 
up itself in a couple comments from the author in it's first file (which 
I appreciate them and their honesty, because those are comments I could 
see myself writing in a similar situation)...


##
class cscRoot (object):
"""This little bundle of joy exists because in Python 2.6 it
became an error to invoke C{object.__init__} with parameters (unless
you also override C{__new__}, in which case it's only a warning.
Whatever.).  Since I'm bloody not going to check in every class
whether C{super(Myclass,self)} refers to C{object} (even if I could
figure out how to do that, 'cuz the obvious solutions don't work),
we'll just make this thing the root of all U{cooperative super

calling}
hierarchies.
##

##
def __init__ (self, *args, **kw):
# Oh gross.  If this class descends from list (and probably 
dict), we
# get here when object is *not* our direct superclass.  In that 
case,

# we have to pass the arguments on up, or the strings don't get
# created right.  Below is the only way I've figured out to 
detect the

# situation.
#
# Note that we might also get here if you mix-in a class that used
# object as a parent instead of cscRoot.  Don't do that. 
Printing the

# mro() is a decent way of identifying the problem.
##

using that suggestion you can see that on simple types
>>> pyxbxmlroot.SomeString._mro()
[, 'pyxb.binding.basis.simpleTypeDefinition'>, 'pyxb.binding.basis._TypeBinding_mixin'>, 'pyxb.utils.utility.Locatable_mixin'>, 'pyxb.utils.utility._DeconflictSymbols_mixin'>, 'pyxb.binding.basis._DynamicCreate_mixin'>, , 
, ]


it has a python type that it sends all the way up right next to object, 
when that doesn't actually occur until after simpleType in class string 
(basis.simpleTypeDefinition, str):


This makes the object dependent on it's parent, since it itself IS the 
value, I can't assign to or do anything to it by itself, or it and all 
the other stuff goes away. As designed it is very hard to change 
anything in it without breaking something.


After working with xmlschema, it pretty much confirmed my assumptions 
that it doesn't need to be that way. I was able to follow what was going 
on and tweak xmlschema fairly easily.


That and the fact that PyXB was abandoned 5-6 years ago make it a strong 
no-go to use in a project. It would need to be adopted with fresh 
development, stripped of the python2 stuff, and the object structure 
redesigned in a more uniform way with functionality properly 
containerized instead of all stuffed together...

--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread Dino

On 1/15/2023 2:23 PM, Weatherby,Gerard wrote:

That’s about what I got using a Python dictionary on random data on a high 
memory machine.

https://github.com/Gerardwx/database_testing.git

It’s not obvious to me how to get it much faster than that.


Gerard, you are a rockstar. This is going to be really useful if I do 
decide to adopt sqlite3 for my PoC, as I understand what's going on 
conceptually, but never really used sqlite (nor SQL in a long long 
time), so this may save me a bunch of time.


I created a 300 Mb DB using your script. Then:

$ ./readone.py
testing 2654792 of 4655974
Found somedata0002654713 for 1ed9f9cd-0a9e-47e3-b0a7-3e1fcdabe166 in 
0.23933520219 seconds


$ ./prefetch.py
Index build 4.42093784897 seconds
testing 3058568 of 4655974
Found somedata202200 for 5dca1455-9cd6-4e4d-8e5a-7e6400de7ca7 in 
4.443999403715e-06 seconds


So, if I understand right:

1) once I built a dict out of the DB (in about 4 seconds), I was able to 
lookup an entry/record in 4 microseconds(!)


2) looking up a record/entry using a Sqlite query took 0.2 seconds (i.e. 
500x slower)


Interesting. Thank you for this. Very informative. I really appreciate 
that you took the time to write this.


The conclusion seems to me that I probably don't want to go the Sqlite 
route, as I would be placing my data into a database just to extract it 
back into a dict when I need it if I want it fast.


Ps: a few minor fixes to the README as this may be helpful to others.

./venv/... => ./env/..

i.e.
 ./env/bin/pip install -U pip
 ./env/bin/pip install -e .

Also add part in []

Run create.py [size of DB in bytes] prior to running readone.py and/or 
prefetch.py


BTW, can you tell me what is going on here? what's := ?

   while (increase := add_some(conn,adding)) == 0:

https://github.com/Gerardwx/database_testing/blob/main/src/database_testing/create.py#L40

Dino
--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-15 Thread David
On Mon, 16 Jan 2023 at 16:15, Dino  wrote:

> BTW, can you tell me what is going on here? what's := ?
>
> while (increase := add_some(conn,adding)) == 0:

See here:
  https://docs.python.org/3/reference/expressions.html#assignment-expressions
  https://realpython.com/python-walrus-operator/
-- 
https://mail.python.org/mailman/listinfo/python-list