Since everyone is talking about vague OS memory use and not at all about working set size of Python objects, let me ...
On 29Mar2021 03:12, Alexey <zen.supag...@gmail.com> wrote: >I'm experiencing problems with memory consumption. > >I have a class which is doing ETL job. What`s happening inside: > - fetching existing objects from DB via SQLAchemy Do you need to? Or do you only need to fetch their ids? Or do you only need to fetch a subset of the objects? It is easy to accidentally suck in way too many db session entity objects, or at any rate, more than you need to. > - iterate over raw data Can you prescan the data to determine which objects you care about, reducing the number of objects you need to obtain? > - create new/update existing objects Depoending what you're doing, you may not need to "create new/update existing objects". You could collate changes and do an UPSERT (the incantation varies a little depending on the SQL dialect behind SQLAlchemy). > - commit changes Do you discard the SQLAlchemy session after this? Otherwise it may lurk and hold onto the objects. Commit doesn't forget the objects. For my current client we have a script to import historic data from a legacy system. It has many of the issues you're dealing with: the naive (ORM) way consumes gads of memory, and can be very slow too (udating objects in an ad hoc manner tends to do individual UPDATE SQL commands, very latency laden). I wrote a generic batch UPSERT function which took an accrued list of changes and prepared a PostgreSQL INSERT...ON CONFLICT statement. The main script hands it the accrued updates and it runs batches (which lets up do progress reporting). Orders of magnitude faster, _and_ does not require storing the db objects. On the subject of "fetching existing objects from DB via SQLAchemy": you may not need to do that, either. Can you identify _which_ objects are of interest? Associate with the same script I've go a batch_select function: it takes an terable if object ids and collects them in batches, where before we were really scanning the whole db because we had an arbitrary scattering of relevant object ids from the raw data. It basicly collected ids into batches, and ran a SELECT...WHERE id in (batch-of-ids). It's really fast considering, and also scales _way_ down when the set of arbitrary ids is small. I'm happy to walk through the mechanics of these with you; the code at this end is Django's ORM, but I prefer SQLAlchemy anyway - the project dictated the ORM here. >Before processing data I create internal cache(dictionary) and store all >existing objects in it. >Every 10000 items I do bulk insert and flush. At the end I run commit command. Yah. I suspect the session data are not being released. Also, SQLAlchemy may be caching sessions or something across runs, since this is a celery worker which survives from one task to the next. You could try explicitly creating a new SQLAlchemy session around your task. >Problem. Before executing, my interpreter process weighs ~100Mb, after first >run memory increases up to 500Mb >and after second run it weighs 1Gb. If I will continue to run this class, >memory wont increase, so I think >it's not a memory leak, but rather Python wont release allocated memory back >to OS. Maybe I'm wrong. I don't know enough about Python's "release OS memory" phase. But reducing the task memory footprint will help regardless. Cheers, Cameron Simpson <c...@cskk.id.au> -- https://mail.python.org/mailman/listinfo/python-list