Hi wilson Try this module file::slurp Regards, Manikandan
On Sun, 17 Apr, 2022, 15:03 wilson, <i...@bigcount.xyz> wrote: > hello the experts, > > can you help check my script for how to optimize it? > currently it was going as "run out of memory". > > $ perl count.pl > Out of memory! > Killed > > > My script: > use strict; > > my %hash; > my %stat; > > # dataset: userId, itemId, rate, time > # AV056ETQ5RXLN,0000031887,1.0,1397692800 > > open HD,"rate.csv" or die $!; > while(<HD>) { > my ($item,$rate) = (split /\,/)[1,2]; > $hash{$item}{total} += $rate; > $hash{$item}{count} +=1; > } > close HD; > > for my $key (keys %hash) { > $stat{$key} = $hash{$key}{total} / $hash{$key}{count}; > } > > my $i = 0; > for (sort { $stat{$b} <=> $stat{$a}} keys %stat) { > print "$_: $stat{$_}\n"; > last if $i == 99; > $i ++; > } > > The purpose is to aggregate and average the itemId's scores, and print > the result after sorting. > > The dataset has 80+ million items: > > $ wc -l rate.csv > 82677131 rate.csv > > And my memory is somewhat limited: > > $ free -m > total used free shared buff/cache > available > Mem: 1992 152 76 0 1763 > 1700 > Swap: 1023 802 221 > > > > What confused me is that Apache Spark can make this job done with this > limited memory. It got the statistics done within 2 minutes. But I want > to give perl a try since it's not that convenient to run a spark job > always. > > The spark implementation: > > scala> val schema="uid STRING,item STRING,rate FLOAT,time INT" > val schema: String = uid STRING,item STRING,rate FLOAT,time INT > > scala> val df = > spark.read.format("csv").schema(schema).load("skydrive/rate.csv") > val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... > 2 more fields] > > scala> > > df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show() > +----------+--------+ > > | item|avg_rate| > +----------+--------+ > |0001061100| 5.0| > |0001543849| 5.0| > |0001061127| 5.0| > |0001019880| 5.0| > |0001062395| 5.0| > |0000143502| 5.0| > |000014357X| 5.0| > |0001527665| 5.0| > |000107461X| 5.0| > |0000191639| 5.0| > |0001127748| 5.0| > |0000791156| 5.0| > |0001203088| 5.0| > |0001053744| 5.0| > |0001360183| 5.0| > |0001042335| 5.0| > |0001374400| 5.0| > |0001046810| 5.0| > |0001380877| 5.0| > |0001050230| 5.0| > +----------+--------+ > only showing top 20 rows > > > I think my perl script should be possible to be optimized to run this > job as well. So ask for your helps. > > Thanks in advance. > > wilson > > -- > To unsubscribe, e-mail: beginners-unsubscr...@perl.org > For additional commands, e-mail: beginners-h...@perl.org > http://learn.perl.org/ > > >