среда, 31 марта 2021 г. в 05:45:27 UTC+3, cameron...@gmail.com: > Since everyone is talking about vague OS memory use and not at all about > working set size of Python objects, let me ... > On 29Mar2021 03:12, Alexey <zen.su...@gmail.com> wrote: > >I'm experiencing problems with memory consumption. > > > >I have a class which is doing ETL job. What`s happening inside: > > - fetching existing objects from DB via SQLAchemy > Do you need to? Or do you only need to fetch their ids? Or do you only > need to fetch a subset of the objects?
I really need all the objects because I'm performing update and create operations. If I'll be fetching them on the go, this will take hours or even days to complete. > It is easy to accidentally suck in way too many db session entity > objects, or at any rate, more than you need to. > > - iterate over raw data > Can you prescan the data to determine which objects you care about, > reducing the number of objects you need to obtain? In this case I still need to iterate over raw and old data. As I said before if I'll try it without caching it'll take days > > - create new/update existing objects > Depoending what you're doing, you may not need to "create new/update > existing objects". You could collate changes and do an UPSERT (the > incantation varies a little depending on the SQL dialect behind > SQLAlchemy). Good advice. > > - commit changes > > Do you discard the SQLAlchemy session after this? Otherwise it may lurk > and hold onto the objects. Commit doesn't forget the objects. I tried expire_all() and expunge_all. Should I try rollback ? > For my current client we have a script to import historic data from a > legacy system. It has many of the issues you're dealing with: the naive > (ORM) way consumes gads of memory, and can be very slow too (udating > objects in an ad hoc manner tends to do individual UPDATE SQL commands, > very latency laden). > > I wrote a generic batch UPSERT function which took an accrued list of > changes and prepared a PostgreSQL INSERT...ON CONFLICT statement. The > main script hands it the accrued updates and it runs batches (which lets > up do progress reporting). Orders of magnitude faster, _and_ does not > require storing the db objects. > > On the subject of "fetching existing objects from DB via SQLAchemy": you > may not need to do that, either. Can you identify _which_ objects are of > interest? Associate with the same script I've go a batch_select > function: it takes an terable if object ids and collects them in > batches, where before we were really scanning the whole db because we > had an arbitrary scattering of relevant object ids from the raw data. I'll try to analyze if it's possible to rewrite code this way > It basicly collected ids into batches, and ran a SELECT...WHERE id in > (batch-of-ids). It's really fast considering, and also scales _way_ down > when the set of arbitrary ids is small. > > I'm happy to walk through the mechanics of these with you; the code at > this end is Django's ORM, but I prefer SQLAlchemy anyway - the project > dictated the ORM here. > >Before processing data I create internal cache(dictionary) and store all > >existing objects in it. > >Every 10000 items I do bulk insert and flush. At the end I run commit > >command. > Yah. I suspect the session data are not being released. Also, SQLAlchemy > may be caching sessions or something across runs, since this is a celery > worker which survives from one task to the next. I tried to dig in this direction. Created a few graphs with "objgraph" but it has so much references under the hood. I'll try to measure size of session object before and after building cache. > You could try explicitly creating a new SQLAlchemy session around your > task. > >Problem. Before executing, my interpreter process weighs ~100Mb, after first > >run memory increases up to 500Mb > >and after second run it weighs 1Gb. If I will continue to run this class, > >memory wont increase, so I think > >it's not a memory leak, but rather Python wont release allocated memory back > >to OS. Maybe I'm wrong. > I don't know enough about Python's "release OS memory" phase. But > reducing the task memory footprint will help regardless. Definitely. I'll think about it. Thank you! -- https://mail.python.org/mailman/listinfo/python-list