Hi wilson
Try  this module file::slurp

Regards,
Manikandan

On Sun, 17 Apr, 2022, 15:03 wilson, <i...@bigcount.xyz> wrote:

> hello the experts,
>
> can you help check my script for how to optimize it?
> currently it was going as "run out of memory".
>
> $ perl count.pl
> Out of memory!
> Killed
>
>
> My script:
> use strict;
>
> my %hash;
> my %stat;
>
> # dataset: userId, itemId, rate, time
> # AV056ETQ5RXLN,0000031887,1.0,1397692800
>
> open HD,"rate.csv" or die $!;
> while(<HD>) {
>      my ($item,$rate) = (split /\,/)[1,2];
>      $hash{$item}{total} += $rate;
>      $hash{$item}{count} +=1;
> }
> close HD;
>
> for my $key (keys %hash) {
>      $stat{$key} = $hash{$key}{total} / $hash{$key}{count};
> }
>
> my $i = 0;
> for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
>      print "$_: $stat{$_}\n";
>      last if $i == 99;
>      $i ++;
> }
>
> The purpose is to aggregate and average the itemId's scores, and print
> the result after sorting.
>
> The dataset has 80+ million items:
>
> $ wc -l rate.csv
> 82677131 rate.csv
>
> And my memory is somewhat limited:
>
> $ free -m
>                total        used        free      shared  buff/cache
> available
> Mem:           1992         152          76           0        1763
>    1700
> Swap:          1023         802         221
>
>
>
> What confused me is that Apache Spark can make this job done with this
> limited memory. It got the statistics done within 2 minutes. But I want
> to give perl a try since it's not that convenient to run a spark job
> always.
>
> The spark implementation:
>
> scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
> val schema: String = uid STRING,item STRING,rate FLOAT,time INT
>
> scala> val df =
> spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
> val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ...
> 2 more fields]
>
> scala>
>
> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
> +----------+--------+
>
> |      item|avg_rate|
> +----------+--------+
> |0001061100|     5.0|
> |0001543849|     5.0|
> |0001061127|     5.0|
> |0001019880|     5.0|
> |0001062395|     5.0|
> |0000143502|     5.0|
> |000014357X|     5.0|
> |0001527665|     5.0|
> |000107461X|     5.0|
> |0000191639|     5.0|
> |0001127748|     5.0|
> |0000791156|     5.0|
> |0001203088|     5.0|
> |0001053744|     5.0|
> |0001360183|     5.0|
> |0001042335|     5.0|
> |0001374400|     5.0|
> |0001046810|     5.0|
> |0001380877|     5.0|
> |0001050230|     5.0|
> +----------+--------+
> only showing top 20 rows
>
>
> I think my perl script should be possible to be optimized to run this
> job as well. So ask for your helps.
>
> Thanks in advance.
>
> wilson
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>

Reply via email to