On Fri, Jul 24, 2015 at 03:05:24PM +0200, Simon Reinhardt wrote:
> Am 21.07.2015 um 22:35 schrieb Jim Gibson:
> > That is an unique data structure and transformation
> > requirement (as most are), so you are not likely to find an
> > existing set of code to do exactly what you want. However,
> > the transformation should not be too difficult, so you should
> > try to code up something yourself and ask for help if you get
> > stuck or would like to optimize your solution.
>
> Ok, maybe I should give more background information. The aim of
> this project is to convert the ouput of djvused's print-outline
> command to YAML. The output is like this:
>
> $ djvused file.djvu -e print-outline
> (bookmarks
> ("title 1"
> "#1"
> ("title 1.1"
> "#2" )
> ("title 1.2"
> "#3"
> ("title 1.2.1"
> "#4" ) ) )
> ("title 2"
> "#5" ) )
>
> the nesting of outline items - 3 levels in this example - can
> become arbitrarily large.
> Performance is no issue, so I still use hashrefs instead of
> arrays.
>
> This is what if have done so far and it now seems to work. Any
> comments?
I'll give you my 2 cents for whatever that's worth. >:)
> (Any option parsing, error checks, binmodes omitted. The full
> code is at
> github.com/simon0x5b/bookmark-djvu/blob/yaml/bookmark-djvu-extract)
>
> #!/usr/bin/env perl
>
>
> use 5.010;
> use warnings;
> use strict;
> use YAML::XS;
>
>
> # start extracting bookmarks with djvused in a pipe
> open my $djvused_fd, "-|", ("djvused", "file.djvu", "-u", "-e",
> "print-outline");
I assume that "fd" is an abbreviation for "file descriptor", but in
reality the variable represents a file _handle_.
You can use the qw operator to write the command list more easily.
Note that you should be checking for failure and handling it
appropriately. A lazy approach is to just use the autodie
pragma, which will automatically make a failed open die().
my @cmd = qw(djvused file.djvu -u -e print-outline);
open my $djvused_fh, '-|', @cmd or die "open pipe: $!";
I'm not familiar with djvused(1).
> my $outline_items = parse_djvused_output ($djvused_fd);
> die "no outline found" unless @$outline_items;
>
> dump_to_yaml ($outline_items);
>
>
>
> # arg: open file descriptor to djvused
Nitpick: Again, you mean "file handle" here.
> # returns: arrayref of hashrefs, each describing level, title and page of an
> # outline item.
> sub parse_djvused_output {
> my $djvused_fd = shift;
> my $level = 0;
> my $line = <$djvused_fd>;
>
> if (!$line or $line !~ /^\(bookmarks$/) {
> warn "no bookmarks found\n";
> return 1;
You're returning a number, but the caller dereferences this as an
array ref. Perhaps you should return an empty array reference
instead. Or modify the caller, but this is easier.
> }
> my $outline_items = [];
> while ($line = <$djvused_fd>) {
> chomp $line;
chomp() is optional here, but doesn't hurt.
> $line =~ /^\s*\("(?<title>.+)"$/
> or die "line '$line' in djvused output does not match.";
>
>
> my $title = djvused_unescape ($+{title});
>
> $line = <$djvused_fd>;
>
> $line =~ /\s*"#(?<page>[0-9]+)"(?<close>(\s*\))*)\s*$/
> or die "line '$line' in djvused output does not match.";
>
> push @$outline_items, {level => $level, title => $title,
> page => $+{page}};
> $level -= ($+{close} =~ tr/\)//) - 1;
> }
>
> return $outline_items;
> }
>
>
>
> sub dump_to_yaml {
> my $input_array = shift;
> (my $array, undef) = get_recursive_array ($input_array, 0);
> print Dump ($array);
> }
>
> # recursively convert flat array into nested data structure
> # args: input array ref, index to start
> # returns: output array ref, following index
> sub get_recursive_array {
> my $input_array = shift;
> my $start = shift;
A generally tidy way to capture arguments is list assignment.
That also wouldn't modify @_. It's entirely up to you, but
personally I find it much easier to read.
my ($input_array, $start) = @_;
Especially because it separates that boilerplate from other variables
within the subroutine.
Alternatively, some purists would recommend specifying the array
to shift: shift @_.
> my $index = $start;
> my $last = $#$input_array;
Many people would prefer to read that dereference in brackets:
$#{$input_array}. I honestly didn't even know that you could do
that without them (so I guess I learned something).
> my $level = $input_array->[$index]{level};
> my $result_array = [];
> while (1) {
>
> last if $index > $last;
I believe that you can move this condition into the while:
while($index <= $last) {
> my $next_level = $input_array->[$index]{level};
Instead of continuously dereferencing nested structures I prefer
(when the logic allows) to store references in lexical variables
to cut down on line length.
my $node = $input_array->[$index];
my $next_level = $node->{level};
> die "error in get_recursive_array" if $next_level > $level;
>
> last if $next_level < $level;
>
> my $outline_item = {title => $input_array->[$index]{title},
> page => $input_array->[$index]{page}};
With an alias like above you'd get:
my $outline_item =
{title => $node->{title}, page => $node->{page}};
Alternatively, you could take advantage of slices and get all of
the elements of the hash at once:
my ($next_level, $title, $page) =
@{$node}{qw/level title page/};
...
my $outline_item = {title => $title, page => $page};
> if ($index + 1 <= $last
> && $input_array->[$index + 1]{level} > $level) {
> ($outline_item->{kids}, $index) =
> get_recursive_array ($input_array, $index + 1);
> }
More personal preferences: It's a good idea to indent
continuation lines (wrapped lines) differently than the nested
levels so that it's easier to tell them apart. I also prefer
operators on the end of the line because I find that it's easier
to read that way.
if($index + 1 <= $last &&
$input_array->[$index + 1]{level} > $level) {
($outline_item->{kids}, $index) =
get_recursive_array ($input_array, $index + 1);
}
> else {
> ++$index;
> }
You could also store index + 1 into a variable. This could become
(shortening the length of the condition and making the entire
thing easier to read):
my $next_index = $index + 1;
my $next_node = $next_index <= $last ?
$input_array->{$next_index} :
undef;
if(defined $next_node && $next_node->{$level} > $level) {
($outline_item->{kids}, $index) =
get_recursive_array($input_array, $next_index);
} else {
++$index;
}
It's a matter of personal preference, but I find it easier to
read this way.
> push @$result_array, $outline_item;
> }
>
> return ($result_array, $index);
> }
I can't help thinking that the entire recursive subroutine could
be simplified and tidied up a bit by changing the direction of
it, but it looks like it should work and it's concise enough that
it's probably not worth the trouble.
> sub djvused_unescape {
> # see djvused(1): "DJVUSED FILE FORMATS - Strings"
> # and perlrebackslash
> my $string = shift;
>
> # octal escape sequences
> $string =~ s/\\([0-7]{3})/pack 'U', oct($1)/ge;
>
> $string =~ s/\\(.)/
> if ($1 eq "\\" or $1 eq "\"") {
Consider using single quotes here for at least the double quote.
Also be careful substituting || and or because they do have a
different meaning: it makes no difference here, but you may
surprise yourself if you don't keep it in mind so it's probably a
bad habit to get into using it where it doesn't belong.
if ($1 eq '\\' || $1 eq '"') {
> $1;
> }
> elsif ($1 eq "a") {
> "\a";
> }
> elsif ($1 eq "b") {
> "\b";
> }
> elsif ($1 eq "t") {
> "\t";
> }
> elsif ($1 eq "n") {
> "\n";
> }
> elsif ($1 eq "v") {
> "\x{0b}";
> }
> elsif ($1 eq "f") {
> "\f";
> }
> elsif ($1 eq "r") {
> "\r";
> }
> else {
> die "unknown escape sequence in '$string'";
>
> }
> /ge;
You should probably indent the replacement "body" here so that
it stands out from the surrounding code. You could also lose some
vertical space by joining } and elsif ... lines. Again, it's a
style thing, but if you're already going half-way towards sharing
a line I don't see the harm in going all the way. ;)
$string =~ s/\\(.)/
if ($1 eq "\\" or $1 eq "\"") {
$1;
} elsif ($1 eq "a") {
"\a";
} elsif ($1 eq "b") {
"\b";
} elsif ($1 eq "t") {
"\t";
} elsif ($1 eq "n") {
"\n";
} elsif ($1 eq "v") {
"\x{0b}";
} elsif ($1 eq "f") {
"\f";
} elsif ($1 eq "r") {
"\r";
} else {
die "unknown escape sequence in '$string'";
}
/ge;
Additionally, consider putting it into a separate subroutine:
$string =~ s/\\(.)/expand_escape_sequence($string)/ge;
sub expand_escape_sequence {
my ($string) = @_;
if ($1 eq "\\" || $1 eq "\"") {
return $1;
} elsif ($1 eq "a") {
return "\a";
} elsif ($1 eq "b") {
return "\b";
} elsif ($1 eq "t") {
return "\t";
} elsif ($1 eq "n") {
return "\n";
} elsif ($1 eq "v") {
return "\x{0b}";
} elsif ($1 eq "f") {
return "\f";
} elsif ($1 eq "r") {
return "\r";
} else {
die "Unknown escape sequence in '$string'";
}
}
> return $string
> }
All of my suggestions are untested so copy-pasta chefs beware.
>:)
Regards,
--
Brandon McCaig <[email protected]> <[email protected]>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bambams.ca/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'
signature.asc
Description: Digital signature
