Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Alain Rastoul Mon, 17 Nov 2014 12:28:46 -0800

our referential world is very restricted, whatever area we are talkingabout


Le 17/11/2014 21:04, Alain Rastoul a écrit :

you are saying that zip ratio is somewhat related to normalized data,
interesting view, and certainly true :)
And right, this somewhat normalize all fields, a technique used in
specialized columnstore databases (monetdb and others), often BI
databases with id representing values (that were my experiments).
About DateTimes, I think this is not different than with other values,
using a pointer to an interned value should be equivalent to using an
int, as it would be a 32 bits pointer, and with this approach, using
compact records should not make a big difference too if there is not a
lot of different values.
The key I mentioned here is that in real life, this "normalizing ratio"
is very high for almost every kind of data and that's what puzzles me
(not the technique).


Regards,
Alain

Le 17/11/2014 10:47, Stephan Eggermont a écrit :

Open package contents on your vm,
open Contents,
take a look at the info.plist

    <key>SqueakMaxHeapSize</key>
    <integer>541065216</integer>

That value needs to be increased to be able to use more than ~512 MB.

Alain wrote:

Let say it's your current requirement, and you want to do it like that,
a trick that may help you : during personal experiments about loading
data in memory and statistics from databases, I found that most often 70
to 80 % of real data is the same.


It is easy to confirm if this is the case in your data: just zip the
csv file.
Reasonably structured relational database output often reduces to 10%
of size.
With explicitly denormalized data I've seen 99% reduction.

In addition, DateAndTime has a rather wasteful representation for your
purpose.
Just reduce to one SmallInt, or with Pharo4, use slots to get a more
compact
record representation.

Stephan

Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Reply via email to