Hi Wilson, Looking at the script I see some room for improvement. You currently declare %hash as a global variable, and keep it around forever. With tens of millions of rows that is quite a large structure to just have sitting around after you have build the %stat hash. So I would start by limiting the scope of the %hash from global to only till %stat has been filled. Of course then there is the posibility of swapping the key and the value in your stats hash, making the value an array of item ids or even a string of item ids (might make sense if you do not expect to ever print all of them in a list but only a top 1000 or so.
As for sorting a hugh hash in memory there comes a point where one has to decide if Perl is the right tool for the job. It can probably do the job but much like you can hammer a nail into a wall with a swiss armyknife Perl is likely not the best tool for the job here. Looking around I see mostly people advising to use a DB solution of some description be that Berkley DB, SQLite or similar relatively simple database solutions you are likely going to find this a much faster option than using pure Perl for sorting such large hashes. The reason why Spark is so much faster at this and more memory efficient at the same time is that it has been designed to handle huge datasets like this which very often need to be sorted or counted in some way. Though you can definetely get the same output using Perl you will likely find that you are looking at a nail while holding a swiss armyknife. Just generally speaking having any variable sticking around for any length of time after you are done with it is bad prectice in all languages. In an interpeted language such as perl there is no way for an optimization step to know you are never going to use this variable again so Perl is likely going to hold on to the variable till the program exists where in a compiled language an compiler might be able to see that this is the last time this variable is used thus evict it from memory on your behalf but that is only possible in some cases in many others such optimisations are not going to kick in as the compiler cannot be certain that this function that uses the variable is never going to be called again for instance. This is why global variables in all languages are generally a bad idea but in interpeted languages even more so. Hope that helps a bit, Rob On Sun, Apr 17, 2022 at 8:00 PM David Mertens <dcmertens.p...@gmail.com> wrote: > I see nothing glaringly inefficient in the Perl. This would be fine on > your system if you were dealing with 1 million items, but you could easily > be pushing up against your system's limits with the generic data structures > that Perl uses, especially since Perl is probably using 64-bit floats and > ints, and storing the hash keys twice (because you have to hashes). > > You could try to use the Perl Data Language, PDL, to create large typed > arrays with minimal overhead. However, I think a more Perlish approach > would be to use a single hash to store the data, as you do (or maybe using > pack/unpack to store the data using 32-bit floats and integers). Then > instead of using sort, run through the whole collection and build your own > top-20 list (or 50 or whatever) by hand. This way the final process of > picking out the top 20 doesn't allocate new storage for all 80 million > items. > > Does that make sense? I could bang out some code illustrating what I mean > if that would help. > > David > > On Sun, Apr 17, 2022, 5:33 AM wilson <i...@bigcount.xyz> wrote: > >> hello the experts, >> >> can you help check my script for how to optimize it? >> currently it was going as "run out of memory". >> >> $ perl count.pl >> Out of memory! >> Killed >> >> >> My script: >> use strict; >> >> my %hash; >> my %stat; >> >> # dataset: userId, itemId, rate, time >> # AV056ETQ5RXLN,0000031887,1.0,1397692800 >> >> open HD,"rate.csv" or die $!; >> while(<HD>) { >> my ($item,$rate) = (split /\,/)[1,2]; >> $hash{$item}{total} += $rate; >> $hash{$item}{count} +=1; >> } >> close HD; >> >> for my $key (keys %hash) { >> $stat{$key} = $hash{$key}{total} / $hash{$key}{count}; >> } >> >> my $i = 0; >> for (sort { $stat{$b} <=> $stat{$a}} keys %stat) { >> print "$_: $stat{$_}\n"; >> last if $i == 99; >> $i ++; >> } >> >> The purpose is to aggregate and average the itemId's scores, and print >> the result after sorting. >> >> The dataset has 80+ million items: >> >> $ wc -l rate.csv >> 82677131 rate.csv >> >> And my memory is somewhat limited: >> >> $ free -m >> total used free shared buff/cache >> available >> Mem: 1992 152 76 0 1763 >> 1700 >> Swap: 1023 802 221 >> >> >> >> What confused me is that Apache Spark can make this job done with this >> limited memory. It got the statistics done within 2 minutes. But I want >> to give perl a try since it's not that convenient to run a spark job >> always. >> >> The spark implementation: >> >> scala> val schema="uid STRING,item STRING,rate FLOAT,time INT" >> val schema: String = uid STRING,item STRING,rate FLOAT,time INT >> >> scala> val df = >> spark.read.format("csv").schema(schema).load("skydrive/rate.csv") >> val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... >> 2 more fields] >> >> scala> >> >> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show() >> +----------+--------+ >> >> | item|avg_rate| >> +----------+--------+ >> |0001061100| 5.0| >> |0001543849| 5.0| >> |0001061127| 5.0| >> |0001019880| 5.0| >> |0001062395| 5.0| >> |0000143502| 5.0| >> |000014357X| 5.0| >> |0001527665| 5.0| >> |000107461X| 5.0| >> |0000191639| 5.0| >> |0001127748| 5.0| >> |0000791156| 5.0| >> |0001203088| 5.0| >> |0001053744| 5.0| >> |0001360183| 5.0| >> |0001042335| 5.0| >> |0001374400| 5.0| >> |0001046810| 5.0| >> |0001380877| 5.0| >> |0001050230| 5.0| >> +----------+--------+ >> only showing top 20 rows >> >> >> I think my perl script should be possible to be optimized to run this >> job as well. So ask for your helps. >> >> Thanks in advance. >> >> wilson >> >> -- >> To unsubscribe, e-mail: beginners-unsubscr...@perl.org >> For additional commands, e-mail: beginners-h...@perl.org >> http://learn.perl.org/ >> >> >>