On Fri, Jul 24, 2015 at 03:05:24PM +0200, Simon Reinhardt wrote:
> Am 21.07.2015 um 22:35 schrieb Jim Gibson:
> > That is an unique data structure and transformation
> > requirement (as most are), so you are not likely to find an
> > existing set of code to do exactly what you want. However,
> > the transformation should not be too difficult, so you should
> > try to code up something yourself and ask for help if you get
> > stuck or would like to optimize your solution.
> 
> Ok, maybe I should give more background information. The aim of
> this project is to convert the ouput of djvused's print-outline
> command to YAML. The output is like this:
> 
> $ djvused file.djvu -e print-outline
> (bookmarks
>  ("title 1"
>   "#1"
>   ("title 1.1"
>    "#2" )
>   ("title 1.2"
>    "#3"
>    ("title 1.2.1"
>     "#4" ) ) )
>  ("title 2"
>   "#5" ) )
> 
> the nesting of outline items - 3 levels in this example - can
> become arbitrarily large.

> Performance is no issue, so I still use hashrefs instead of
> arrays.
> 
> This is what if have done so far and it now seems to work. Any
> comments?

I'll give you my 2 cents for whatever that's worth. >:)

> (Any option parsing, error checks, binmodes omitted. The full
> code is at
> github.com/simon0x5b/bookmark-djvu/blob/yaml/bookmark-djvu-extract)
> 
> #!/usr/bin/env perl
> 
> 
> use 5.010;
> use warnings;
> use strict;
> use YAML::XS;
> 
> 
> # start extracting bookmarks with djvused in a pipe
> open my $djvused_fd, "-|", ("djvused", "file.djvu", "-u", "-e",
>                 "print-outline");

I assume that "fd" is an abbreviation for "file descriptor", but in
reality the variable represents a file _handle_.

You can use the qw operator to write the command list more easily.

Note that you should be checking for failure and handling it
appropriately. A lazy approach is to just use the autodie
pragma, which will automatically make a failed open die().

    my @cmd = qw(djvused file.djvu -u -e print-outline);

    open my $djvused_fh, '-|', @cmd or die "open pipe: $!";

I'm not familiar with djvused(1).

> my $outline_items = parse_djvused_output ($djvused_fd);
> die "no outline found" unless @$outline_items;
> 
> dump_to_yaml ($outline_items);
> 
> 
> 
> # arg: open file descriptor to djvused

Nitpick: Again, you mean "file handle" here.

> # returns: arrayref of hashrefs, each describing level, title and page of an
> # outline item.
> sub parse_djvused_output {
>     my $djvused_fd = shift;
>     my $level = 0;
>     my $line = <$djvused_fd>;
> 
>     if (!$line or $line !~ /^\(bookmarks$/) {
>         warn "no bookmarks found\n";
>         return 1;

You're returning a number, but the caller dereferences this as an
array ref. Perhaps you should return an empty array reference
instead. Or modify the caller, but this is easier.

>     }
>     my $outline_items = [];
>     while ($line = <$djvused_fd>) {
>         chomp $line;

chomp() is optional here, but doesn't hurt.

>         $line =~ /^\s*\("(?<title>.+)"$/
>             or die "line '$line' in djvused output does not match.";
> 
>  
>         my $title = djvused_unescape ($+{title});
>  
>         $line = <$djvused_fd>;
> 
>         $line =~ /\s*"#(?<page>[0-9]+)"(?<close>(\s*\))*)\s*$/
>             or die "line '$line' in djvused output does not match.";
> 
>         push @$outline_items, {level => $level, title => $title,
>                    page => $+{page}};
>         $level -= ($+{close} =~ tr/\)//) - 1;
>     }
> 
>     return $outline_items;
> }
> 
> 
> 
> sub dump_to_yaml {
>     my $input_array = shift;
>     (my $array, undef) = get_recursive_array ($input_array, 0);
>     print  Dump ($array);
> }
> 
> # recursively convert flat array into nested data structure
> # args: input array ref, index to start
> # returns: output array ref, following index
> sub get_recursive_array {
>     my $input_array = shift;
>     my $start = shift;

A generally tidy way to capture arguments is list assignment.
That also wouldn't modify @_. It's entirely up to you, but
personally I find it much easier to read.

    my ($input_array, $start) = @_;

Especially because it separates that boilerplate from other variables
within the subroutine.

Alternatively, some purists would recommend specifying the array
to shift: shift @_.

>     my $index = $start;
>     my $last = $#$input_array;

Many people would prefer to read that dereference in brackets:
$#{$input_array}. I honestly didn't even know that you could do
that without them (so I guess I learned something).

>     my $level = $input_array->[$index]{level};
>     my $result_array = [];
>     while (1) {
>  
>         last if $index > $last;

I believe that you can move this condition into the while:

    while($index <= $last) {

>         my $next_level = $input_array->[$index]{level};

Instead of continuously dereferencing nested structures I prefer
(when the logic allows) to store references in lexical variables
to cut down on line length.

    my $node = $input_array->[$index];
    my $next_level = $node->{level};

>         die "error in get_recursive_array" if $next_level > $level;
> 
>         last if $next_level < $level;
> 
>         my $outline_item = {title => $input_array->[$index]{title},
>                     page => $input_array->[$index]{page}};

With an alias like above you'd get:

    my $outline_item =
            {title => $node->{title}, page => $node->{page}};

Alternatively, you could take advantage of slices and get all of
the elements of the hash at once:

    my ($next_level, $title, $page) =
            @{$node}{qw/level title page/};

    ...

    my $outline_item = {title => $title, page => $page};

>         if ($index + 1 <= $last
>             && $input_array->[$index + 1]{level} > $level) {
>             ($outline_item->{kids}, $index) =
>                 get_recursive_array ($input_array, $index + 1);
>         }

More personal preferences: It's a good idea to indent
continuation lines (wrapped lines) differently than the nested
levels so that it's easier to tell them apart. I also prefer
operators on the end of the line because I find that it's easier
to read that way.

        if($index + 1 <= $last &&
                $input_array->[$index + 1]{level} > $level) {
            ($outline_item->{kids}, $index) =
                    get_recursive_array ($input_array, $index + 1);
        }

>         else {
>             ++$index;
>         }

You could also store index + 1 into a variable. This could become
(shortening the length of the condition and making the entire
thing easier to read):

    my $next_index = $index + 1;

    my $next_node = $next_index <= $last ?
            $input_array->{$next_index} :
            undef;

    if(defined $next_node && $next_node->{$level} > $level) {
        ($outline_item->{kids}, $index) =
                get_recursive_array($input_array, $next_index);
    } else {
        ++$index;
    }

It's a matter of personal preference, but I find it easier to
read this way.

>         push @$result_array, $outline_item;
>     }
> 
>     return ($result_array, $index);
> }

I can't help thinking that the entire recursive subroutine could
be simplified and tidied up a bit by changing the direction of
it, but it looks like it should work and it's concise enough that
it's probably not worth the trouble.

> sub djvused_unescape {
>     # see djvused(1): "DJVUSED FILE FORMATS - Strings"
>     # and perlrebackslash
>     my $string = shift;
> 
>     # octal escape sequences
>     $string =~ s/\\([0-7]{3})/pack 'U', oct($1)/ge;
> 
>     $string =~ s/\\(.)/
>     if ($1 eq "\\" or $1 eq "\"") {

Consider using single quotes here for at least the double quote.
Also be careful substituting || and or because they do have a
different meaning: it makes no difference here, but you may
surprise yourself if you don't keep it in mind so it's probably a
bad habit to get into using it where it doesn't belong.

    if ($1 eq '\\' || $1 eq '"') {

>         $1;
>     }
>     elsif ($1 eq "a") {
>         "\a";
>     }
>     elsif ($1 eq "b") {
>         "\b";
>     }
>     elsif ($1 eq "t") {
>         "\t";
>     }
>     elsif ($1 eq "n") {
>         "\n";
>     }
>     elsif ($1 eq "v") {
>         "\x{0b}";
>     }
>     elsif ($1 eq "f") {
>         "\f";
>     }
>     elsif ($1 eq "r") {
>         "\r";
>     }
>     else {
>         die "unknown escape sequence in '$string'";
> 
>     }
>     /ge;

You should probably indent the replacement "body" here so that
it stands out from the surrounding code. You could also lose some
vertical space by joining } and elsif ... lines. Again, it's a
style thing, but if you're already going half-way towards sharing
a line I don't see the harm in going all the way. ;)

    $string =~ s/\\(.)/
            if ($1 eq "\\" or $1 eq "\"") {
                $1;
            } elsif ($1 eq "a") {
                "\a";
            } elsif ($1 eq "b") {
                "\b";
            } elsif ($1 eq "t") {
                "\t";
            } elsif ($1 eq "n") {
                "\n";
            } elsif ($1 eq "v") {
                "\x{0b}";
            } elsif ($1 eq "f") {
                "\f";
            } elsif ($1 eq "r") {
                "\r";
            } else {
                die "unknown escape sequence in '$string'";
            }
            /ge;

Additionally, consider putting it into a separate subroutine:

    $string =~ s/\\(.)/expand_escape_sequence($string)/ge;

    sub expand_escape_sequence {
        my ($string) = @_;

        if ($1 eq "\\" || $1 eq "\"") {
            return $1;
        } elsif ($1 eq "a") {
            return "\a";
        } elsif ($1 eq "b") {
            return "\b";
        } elsif ($1 eq "t") {
            return "\t";
        } elsif ($1 eq "n") {
            return "\n";
        } elsif ($1 eq "v") {
            return "\x{0b}";
        } elsif ($1 eq "f") {
            return "\f";
        } elsif ($1 eq "r") {
            return "\r";
        } else {
            die "Unknown escape sequence in '$string'";
        }
    }

>     return $string
> }

All of my suggestions are untested so copy-pasta chefs beware.
 >:)

Regards,


-- 
Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bambams.ca/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

Attachment: signature.asc
Description: Digital signature

Reply via email to