On Sun, 2022-04-17 at 17:33 +0800, wilson wrote: > hello the experts, > > can you help check my script for how to optimize it? > currently it was going as "run out of memory". > > $ perl count.pl > Out of memory! > Killed
I would use a database like Mariadb for this, not only to create a report but also to store the data. Check out https://dbi.perl.org/ and phpmyadmin. IIRC DBI can also work with CSV files as database tables. There's also Text::CSV ... If you want to re-invent the wheel, you can do it in perl going itemId by itemId, line by line, using another file to store the results per itemId, and then come up with a variant of bubblesort that uses files, too. That's assuming that you have not as many itemIds as you have rows. Maybe you can make it faster by using an index for the itemId field. > > My script: > use strict; > > my %hash; > my %stat; > > # dataset: userId, itemId, rate, time > # AV056ETQ5RXLN,0000031887,1.0,1397692800 > > open HD,"rate.csv" or die $!; > while(<HD>) { > my ($item,$rate) = (split /\,/)[1,2]; > $hash{$item}{total} += $rate; > $hash{$item}{count} +=1; > } > close HD; > > for my $key (keys %hash) { > $stat{$key} = $hash{$key}{total} / $hash{$key}{count}; > } > > my $i = 0; > for (sort { $stat{$b} <=> $stat{$a}} keys %stat) { > print "$_: $stat{$_}\n"; > last if $i == 99; > $i ++; > } > > The purpose is to aggregate and average the itemId's scores, and print > the result after sorting. > > The dataset has 80+ million items: > > $ wc -l rate.csv > 82677131 rate.csv > > And my memory is somewhat limited: > > $ free -m > total used free shared buff/cache > available > Mem: 1992 152 76 0 1763 > 1700 > Swap: 1023 802 221 > > > > What confused me is that Apache Spark can make this job done with this > limited memory. It got the statistics done within 2 minutes. But I want > to give perl a try since it's not that convenient to run a spark job always. > > The spark implementation: > > scala> val schema="uid STRING,item STRING,rate FLOAT,time INT" > val schema: String = uid STRING,item STRING,rate FLOAT,time INT > > scala> val df = > spark.read.format("csv").schema(schema).load("skydrive/rate.csv") > val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... > 2 more fields] > > scala> > df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show() > +----------+--------+ > > > item|avg_rate| > +----------+--------+ > > 0001061100| 5.0| > > 0001543849| 5.0| > > 0001061127| 5.0| > > 0001019880| 5.0| > > 0001062395| 5.0| > > 0000143502| 5.0| > > 000014357X| 5.0| > > 0001527665| 5.0| > > 000107461X| 5.0| > > 0000191639| 5.0| > > 0001127748| 5.0| > > 0000791156| 5.0| > > 0001203088| 5.0| > > 0001053744| 5.0| > > 0001360183| 5.0| > > 0001042335| 5.0| > > 0001374400| 5.0| > > 0001046810| 5.0| > > 0001380877| 5.0| > > 0001050230| 5.0| > +----------+--------+ > only showing top 20 rows > > > I think my perl script should be possible to be optimized to run this > job as well. So ask for your helps. > > Thanks in advance. > > wilson > -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/