Package: libdpkg-perl
Version: 1.16.9
Severity: minor

Hi,

Trying to use Dpkg::Control on a "Packages_amd64" file[1], I got a bit
annoyed with its performance.  I believe the problem is basically that
Dpkg::Control(::Hash) uses its tied hash when parsing and thereby
"wasting" a lot time.

I have attached a small "Benchmark" script to demonstrate the problem.
In this script I have deviced 3 different ways of using the
Dpkg::Control object to parse the file.

 dpkg-std - This is the method I have seen used in e.g. devscripts
 dpkg-break-tie - My "poor's man" method of avoiding the tie overhead.
 dpkg-reuse-obj - The "break-tie" with an additional reuse of the
                  Dpkg::Hash object.  A "->clear()" method or an extra
                  object to ->parse() could be the API for this case.

My testing suggest that dpkg-std and dpkg-break-tie is about a factor
2 apart.  I realise that this is not a perfect test as the tied hash
do also keep track of insert order (and "corrects" the case of field
names).

For reference, the Lintian variant of this parses the same file in
about 3 seconds.  I don't think it is the underlying way of parsing
that makes the difference.  And the Dpkg::Control object has features
not provided by the Lintian variant (which just gives you a hashref
with lowercase keys).

~Niels

[1] I believe my particular copy is a merge of the main, contrib and
non-free Packages_amd64 files, but I guess the one from main should be
large enough to reproduce this.
#!/usr/bin/perl
#
#Benchmark: timing 5 iterations of dpkg-break-tie, dpkg-reuse-obj, dpkg-std...
#dpkg-break-tie: 29 wallclock secs (28.54 usr +  0.06 sys = 28.60 CPU) @  0.17/s (n=5)
#dpkg-reuse-obj: 21 wallclock secs (20.72 usr +  0.06 sys = 20.78 CPU) @  0.24/s (n=5)
#  dpkg-std: 61 wallclock secs (60.58 usr +  0.06 sys = 60.64 CPU) @  0.08/s (n=5)
#               s/iter       dpkg-std dpkg-break-tie dpkg-reuse-obj
#dpkg-std         12.1             --           -53%           -66%
#dpkg-break-tie   5.72           112%             --           -27%
#dpkg-reuse-obj   4.16           192%            38%             --
#
#

use strict;
use warnings;

use Benchmark qw(cmpthese timethese);
use Dpkg::Control;

my ($file, $count) = @ARGV;

my $bench = timethese($count//1, {
    'dpkg-std' => sub {
        # No dirty tricks
        open my $fd, '<', $file or die "open $file: $!";
        my $ctrl;
        while ( defined ($ctrl = Dpkg::Control->new (type => CTRL_INDEX_PKG)) and
                    ($ctrl->parse ($fd, $file))) {
        }
        close $fd;
    },
    'dpkg-break-tie' => sub {
        # Break the tied object, but reconstruct the object each time
        # Already this cuts the runtime in half.
        # - Ideally, Dpkg::Control(::Hash) would by-pass the tie to avoid the
        #   tie overhead.
        open my $fd, '<', $file or die "open $file: $!";
        while (1) {
            my $ctrl = Dpkg::Control->new (type => CTRL_INDEX_PKG);
            $$ctrl->{'fields'} = {};
            $ctrl->parse ($fd, $file) or last;
        }
        close $fd;
    },
    'dpkg-reuse-obj' => sub {
        # Re-use the Dpkg::Control object ontop of breaking the tied object.
        # (Simulate the situation from above with a cheap "clear" method).
        open my $fd, '<', $file or die "open $file: $!";
        my $ctrl = Dpkg::Control->new (type => CTRL_INDEX_PKG);
        $$ctrl->{'fields'} = {};
        while ( $ctrl->parse ($fd, $file) ) {
            $$ctrl->{'fields'} = {};
        }
        close $fd;
    },
  });

cmpthese $bench;

exit 0;

Reply via email to