Am Freitag, 23. September 2016 12:02:47 UTC+2 schrieb Chris Angelico: > On Fri, Sep 23, 2016 at 7:05 PM, Christian <mining.fa...@gmail.com> wrote: > > I'm wondering why python blow up a dictionary structure so much. > > > > The ids and cat substructure could have 0..n entries but in the most cases > > they are <= 10,t is limited by <= 6. > > > > Example: > > > > {'0a0f7a3a0e09826caef1bff707785662': {'ids': > > {'aa316b86-8169-11e6-bab9-0050563e2d7c', > > 'aa3174f0-8169-11e6-bab9-0050563e2d7c', > > 'aa319408-8169-11e6-bab9-0050563e2d7c', > > 'aa3195e8-8169-11e6-bab9-0050563e2d7c', > > 'aa319732-8169-11e6-bab9-0050563e2d7c', > > 'aa319868-8169-11e6-bab9-0050563e2d7c', > > 'aa31999e-8169-11e6-bab9-0050563e2d7c', > > 'aa319b06-8169-11e6-bab9-0050563e2d7c'}, > > 't': {'type1', 'type2'}, > > 'dt': datetime.datetime(2016, 9, 11, 15, 15, 54, 343000), > > 'nids': 8, > > 'ntypes': 2, > > 'cat': [('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', ''), > > ('ABC','aa3174f0-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > > ('ABC','aa319408-8169-11e6-bab9-0050563e2d7c','3', 'type1'), > > ('ABC','aa3195e8-8169-11e6-bab9-0050563e2d7c', '3', 'type2'), > > ('ABC','aa319732-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > > ('ABC','aa319868-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > > ('ABC','aa31999e-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > > ('ABC','aa319b06-8169-11e6-bab9-0050563e2d7c', '3', 'type2')]}, > > > > > > sys.getsizeof(superdict) > > 50331744 > > len(superdict) > > 941272 > > So... you have a million entries in the master dictionary, each of > which has an associated collection of data, consisting of half a dozen > things, some of which have subthings. The very smallest an object will > ever be on a 64-bit Linux system is 16 bytes: > > >>> sys.getsizeof(object()) > 16 > > and most of these will be much larger: > > >>> sys.getsizeof(8) > 28 > >>> sys.getsizeof(datetime.datetime(2016, 9, 11, 15, 15, 54, 343000)) > 48 > >>> sys.getsizeof([]) > 64 > >>> sys.getsizeof(('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', '')) > 80 > >>> sys.getsizeof('aa316b86-8169-11e6-bab9-0050563e2d7c') > 85 > >>> sys.getsizeof({}) > 240 > > (Bear in mind that sys.getsizeof counts only the object itself, not > the things it references - that's why the tuple can take up less space > than one of its members.)
Thanks for this clarification! > > I don't think your collections can average less than about 1KB (even > the textual representation of your example data is about that big), > and you have a million of them. That's a gigabyte of memory, right > there. Your peak memory usage is showing 3GB, so most likely, my > conservative estimates have put an absolute lower bound on this. Try > doing everything exactly the same as you did, only without actually > loading the pickle - then see what memory usage is. I think you'll > find that the usage is fully legitimate. > > > Thanks for any advice to save memory. > > Use a database. I suggest PostgreSQL. You won't have to load > everything into memory all at once that way, and (bonus!) you can even > update stuff on disk without rewriting everything. Yes it seems I haven't a chance to avoid that, especially because the dict example isn't smaller then it will be in real. I'm in a trade-off between performance and scalability , so the dict construction should be fast as possible and having reads+writes (using mongodb) is a performance drawback. Christian > ChrisA -- https://mail.python.org/mailman/listinfo/python-list