Jeff Peng wrote:
Hello,
Hello,
Can the code (specially the regex) below be optimized to run faster? #!/usr/bin/perl for ($i=0; $i<1000; $i+=1) {
++$i is usually faster than $i+=1. But you are not using the $i variable so you don't really need it (your Ruby programs don't have it.)
for ( 1 .. 1000 ) {
open HD,"index.html" or die $!;
You are opening the same file one thousand times so the operating system is probably caching the file in memory and using that cached file for the last 999 reads instead of doing actual disk IO. Your Ruby program doesn't test open() for failure so they are not equivalent.
while(<HD>) { print $1,"\n" if /href="http:\/\/(.*?)\/.*" target="_blank"/;
There is not much, if anything, you can optimize about that regular expression. Possibly eliminate any backtracking if present. Perhaps try the same regular expression in the second Ruby program?
} close HD; }
Instead of reading line by line you could just read the whole file: local $/; local $\ = "\n"; local @ARGV = ( 'index.html' ) x 1000; while ( <> ) { print $1 while /href="http:\/\/(.*?)\/.*" target="_blank"/g; }
The "index.html" is got from: wget http://www.265.com/Kexue_Jishu/ I ask this because someone posted a question on ruby-talk list, shows perl's regex is much faster than ruby's. [Quote] #!/usr/bin/ruby 1000.times do File.open("index.html").each do |c| puts $1 if /href="http:\/\/(.*?)\/.*" target="_blank"/ =~ c end end time ./test.rb >/tmp/t elap 6.511 user 6.336 syst 0.136 CPU 99.40% #!/usr/bin/perl for ($i=0; $i<1000; $i+=1) { open HD,"index.html" or die $!; while(<HD>) { print $1,"\n" if /href="http:\/\/(.*?)\/.*" target="_blank"/; } close HD; } time ./test.pl >/tmp/t elap 0.864 user 0.844 syst 0.020 CPU 100.04% So perl is 7 or 8 times faster here. [/Quote] But someone another optimized the ruby code and used ruby's built-in scan method, which makes the regex run a lot faster. [Quote] I get best results in Ruby with: regexp = %r{href="http://([^"/]*)/[^"]*"\s+target="_blank"} 1000.times do puts File.read('index.html').scan(regexp)
Does scan() only print out the contents of the capturing parentheses or the whole line or the whole pattern? In other words, is the output the same as the other Ruby program? it's obvious that the regular expression is not the same.
end ~/ruby/bench time ruby19 regex.rb > /dev/null real 0m1.428s user 0m1.359s sys 0m0.056s ~/ruby/bench time perl5.10.0 regex.pl > /dev/null real 0m1.189s user 0m1.095s sys 0m0.084s It's still slower. Perl has regular expression magic beyond my imagination, though. I heard they take the most "rare" character in the literal part of the regex (let's say, the colon) and search for it using machine code, and then work their way backwards to the beginning of the regexp... Say what you want, but Perl rocks when it comes to text processing speed. [/Quote] So I'm asking what's Perl's optimization for that regex. I hope this doesn't disturb everyone, thanks.
Most of the regular expression is literal text which cannot be optimized. As in the second Ruby program, try changing '\/(.*?)\/.*"' to '\/([^"\/]*)\/[^"]*"'.
John -- The programmer is fighting against the two most destructive forces in the universe: entropy and human stupidity. -- Damian Conway -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/