On Fri, Jul 24, 2015 at 03:05:24PM +0200, Simon Reinhardt wrote: > Am 21.07.2015 um 22:35 schrieb Jim Gibson: > > That is an unique data structure and transformation > > requirement (as most are), so you are not likely to find an > > existing set of code to do exactly what you want. However, > > the transformation should not be too difficult, so you should > > try to code up something yourself and ask for help if you get > > stuck or would like to optimize your solution. > > Ok, maybe I should give more background information. The aim of > this project is to convert the ouput of djvused's print-outline > command to YAML. The output is like this: > > $ djvused file.djvu -e print-outline > (bookmarks > ("title 1" > "#1" > ("title 1.1" > "#2" ) > ("title 1.2" > "#3" > ("title 1.2.1" > "#4" ) ) ) > ("title 2" > "#5" ) ) > > the nesting of outline items - 3 levels in this example - can > become arbitrarily large.
> Performance is no issue, so I still use hashrefs instead of > arrays. > > This is what if have done so far and it now seems to work. Any > comments? I'll give you my 2 cents for whatever that's worth. >:) > (Any option parsing, error checks, binmodes omitted. The full > code is at > github.com/simon0x5b/bookmark-djvu/blob/yaml/bookmark-djvu-extract) > > #!/usr/bin/env perl > > > use 5.010; > use warnings; > use strict; > use YAML::XS; > > > # start extracting bookmarks with djvused in a pipe > open my $djvused_fd, "-|", ("djvused", "file.djvu", "-u", "-e", > "print-outline"); I assume that "fd" is an abbreviation for "file descriptor", but in reality the variable represents a file _handle_. You can use the qw operator to write the command list more easily. Note that you should be checking for failure and handling it appropriately. A lazy approach is to just use the autodie pragma, which will automatically make a failed open die(). my @cmd = qw(djvused file.djvu -u -e print-outline); open my $djvused_fh, '-|', @cmd or die "open pipe: $!"; I'm not familiar with djvused(1). > my $outline_items = parse_djvused_output ($djvused_fd); > die "no outline found" unless @$outline_items; > > dump_to_yaml ($outline_items); > > > > # arg: open file descriptor to djvused Nitpick: Again, you mean "file handle" here. > # returns: arrayref of hashrefs, each describing level, title and page of an > # outline item. > sub parse_djvused_output { > my $djvused_fd = shift; > my $level = 0; > my $line = <$djvused_fd>; > > if (!$line or $line !~ /^\(bookmarks$/) { > warn "no bookmarks found\n"; > return 1; You're returning a number, but the caller dereferences this as an array ref. Perhaps you should return an empty array reference instead. Or modify the caller, but this is easier. > } > my $outline_items = []; > while ($line = <$djvused_fd>) { > chomp $line; chomp() is optional here, but doesn't hurt. > $line =~ /^\s*\("(?<title>.+)"$/ > or die "line '$line' in djvused output does not match."; > > > my $title = djvused_unescape ($+{title}); > > $line = <$djvused_fd>; > > $line =~ /\s*"#(?<page>[0-9]+)"(?<close>(\s*\))*)\s*$/ > or die "line '$line' in djvused output does not match."; > > push @$outline_items, {level => $level, title => $title, > page => $+{page}}; > $level -= ($+{close} =~ tr/\)//) - 1; > } > > return $outline_items; > } > > > > sub dump_to_yaml { > my $input_array = shift; > (my $array, undef) = get_recursive_array ($input_array, 0); > print Dump ($array); > } > > # recursively convert flat array into nested data structure > # args: input array ref, index to start > # returns: output array ref, following index > sub get_recursive_array { > my $input_array = shift; > my $start = shift; A generally tidy way to capture arguments is list assignment. That also wouldn't modify @_. It's entirely up to you, but personally I find it much easier to read. my ($input_array, $start) = @_; Especially because it separates that boilerplate from other variables within the subroutine. Alternatively, some purists would recommend specifying the array to shift: shift @_. > my $index = $start; > my $last = $#$input_array; Many people would prefer to read that dereference in brackets: $#{$input_array}. I honestly didn't even know that you could do that without them (so I guess I learned something). > my $level = $input_array->[$index]{level}; > my $result_array = []; > while (1) { > > last if $index > $last; I believe that you can move this condition into the while: while($index <= $last) { > my $next_level = $input_array->[$index]{level}; Instead of continuously dereferencing nested structures I prefer (when the logic allows) to store references in lexical variables to cut down on line length. my $node = $input_array->[$index]; my $next_level = $node->{level}; > die "error in get_recursive_array" if $next_level > $level; > > last if $next_level < $level; > > my $outline_item = {title => $input_array->[$index]{title}, > page => $input_array->[$index]{page}}; With an alias like above you'd get: my $outline_item = {title => $node->{title}, page => $node->{page}}; Alternatively, you could take advantage of slices and get all of the elements of the hash at once: my ($next_level, $title, $page) = @{$node}{qw/level title page/}; ... my $outline_item = {title => $title, page => $page}; > if ($index + 1 <= $last > && $input_array->[$index + 1]{level} > $level) { > ($outline_item->{kids}, $index) = > get_recursive_array ($input_array, $index + 1); > } More personal preferences: It's a good idea to indent continuation lines (wrapped lines) differently than the nested levels so that it's easier to tell them apart. I also prefer operators on the end of the line because I find that it's easier to read that way. if($index + 1 <= $last && $input_array->[$index + 1]{level} > $level) { ($outline_item->{kids}, $index) = get_recursive_array ($input_array, $index + 1); } > else { > ++$index; > } You could also store index + 1 into a variable. This could become (shortening the length of the condition and making the entire thing easier to read): my $next_index = $index + 1; my $next_node = $next_index <= $last ? $input_array->{$next_index} : undef; if(defined $next_node && $next_node->{$level} > $level) { ($outline_item->{kids}, $index) = get_recursive_array($input_array, $next_index); } else { ++$index; } It's a matter of personal preference, but I find it easier to read this way. > push @$result_array, $outline_item; > } > > return ($result_array, $index); > } I can't help thinking that the entire recursive subroutine could be simplified and tidied up a bit by changing the direction of it, but it looks like it should work and it's concise enough that it's probably not worth the trouble. > sub djvused_unescape { > # see djvused(1): "DJVUSED FILE FORMATS - Strings" > # and perlrebackslash > my $string = shift; > > # octal escape sequences > $string =~ s/\\([0-7]{3})/pack 'U', oct($1)/ge; > > $string =~ s/\\(.)/ > if ($1 eq "\\" or $1 eq "\"") { Consider using single quotes here for at least the double quote. Also be careful substituting || and or because they do have a different meaning: it makes no difference here, but you may surprise yourself if you don't keep it in mind so it's probably a bad habit to get into using it where it doesn't belong. if ($1 eq '\\' || $1 eq '"') { > $1; > } > elsif ($1 eq "a") { > "\a"; > } > elsif ($1 eq "b") { > "\b"; > } > elsif ($1 eq "t") { > "\t"; > } > elsif ($1 eq "n") { > "\n"; > } > elsif ($1 eq "v") { > "\x{0b}"; > } > elsif ($1 eq "f") { > "\f"; > } > elsif ($1 eq "r") { > "\r"; > } > else { > die "unknown escape sequence in '$string'"; > > } > /ge; You should probably indent the replacement "body" here so that it stands out from the surrounding code. You could also lose some vertical space by joining } and elsif ... lines. Again, it's a style thing, but if you're already going half-way towards sharing a line I don't see the harm in going all the way. ;) $string =~ s/\\(.)/ if ($1 eq "\\" or $1 eq "\"") { $1; } elsif ($1 eq "a") { "\a"; } elsif ($1 eq "b") { "\b"; } elsif ($1 eq "t") { "\t"; } elsif ($1 eq "n") { "\n"; } elsif ($1 eq "v") { "\x{0b}"; } elsif ($1 eq "f") { "\f"; } elsif ($1 eq "r") { "\r"; } else { die "unknown escape sequence in '$string'"; } /ge; Additionally, consider putting it into a separate subroutine: $string =~ s/\\(.)/expand_escape_sequence($string)/ge; sub expand_escape_sequence { my ($string) = @_; if ($1 eq "\\" || $1 eq "\"") { return $1; } elsif ($1 eq "a") { return "\a"; } elsif ($1 eq "b") { return "\b"; } elsif ($1 eq "t") { return "\t"; } elsif ($1 eq "n") { return "\n"; } elsif ($1 eq "v") { return "\x{0b}"; } elsif ($1 eq "f") { return "\f"; } elsif ($1 eq "r") { return "\r"; } else { die "Unknown escape sequence in '$string'"; } } > return $string > } All of my suggestions are untested so copy-pasta chefs beware. >:) Regards, -- Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org> Castopulence Software <https://www.castopulence.org/> Blog <http://www.bambams.ca/> perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'
signature.asc
Description: Digital signature