Splitting "Large" SQL Script Into Several; Concat Hash Doesn't Match

Brandon McCaig Tue, 14 Feb 2012 09:08:31 -0800

Hello:

tl;dr: A program indended to split a large MS SQL script into
smaller ones appears to be losing data. See below for the script.
I guess that maybe I'm making a silly mistake with newlines or
file encoding or something..


My colleague is working with a relatively large generated SQL
script that Visual Studio doesn't seem to be able to handle. The
script is approximately 370 MB[1] in UTF-16 Little Endian (the
default encoding of whatever generated the script; likely SQL
Server Management Studio). Converted to UTF-8 (using Vim) it's
unsurprisingly about half of that[2].

There are various ways to solve this. My colleagues decided to
manually split the file up into manageable pieces, but that
seemed tedious to me. I opted to hack up a Perl program to split
it up for me. Initially (perhaps with insufficient thought) I
decided to split it up into "batches". I don't really know the
details, but MS SQL has a batch keyword, by default 'GO', that it
uses to split scripts up. I don't really know what it means or
why it's used, but I know that it is necessary at times.. I
figured this was the most reliable boundary to split on so the
scripts would still have a hope of being executable without
errors. Anyway, within 30 minutes or so I had written a Perl
program to split on this boundary (there are probably easier
ways, in hindsight, albeit from Windows those ways might not be
as easy as on Unix-like platforms).

That resulted in some 12000+ files, which was obviously not very
practical to be manually executed by a human. It might suffice to
automatically execute the batches one by one, but I have
practically no experience communicating with DBMSes with Perl (or
with that much SQL at a time) and I wouldn't be able to convince
my colleagues to give me a chance to try. It could also
potentially be dangerous if I get something wrong so I don't know
if I'd really want to rush such a solution anyway, but I digress.

In order to cut the number of files down in size I added a size
parameter to the program and modified it to bundle "batches" into
files up to that maximum size each. The program appears to work
at first glance, producing approximately the right number of
files of approximately the right sizes, and terminating on batch
boundaries (by default, a line containing only 'GO'). I felt
triumphant, but realized that I should try to validate that the
split scripts are exactly equivalent to the single, extra large
script. I decided to calculate SHA1 and MD5 hashes for the
original SQL script and a concatenation of the split SQL files.
AFAIK, those hashes should match if my Perl program and
concatenation were done properly.  Unfortuantely, the SHA1 does
not match.

I used MSYS cat to do the concatenation, and MSYS sha1sum to
calculate the hashes:

C:\Users\bamccaig\src\projectX>cat data/large.sql | sha1sum
[sha1sum of large.sql spit out]
C:\Users\bamccaig\src\projectX>cat data/out*.sql | sha1sum
[sha1sum of the concatenation of data/out_0000.sql - 
data/out_0016.sql]

These hashes don't match and now I'm stuck trying to figure out
why or abandoning this program and leaving my colleague to do it
by hand. Even if we don't use this program to solve this problem
I would like to find my mistake(s) and correct them.

The program follows (also attached in case my wrapping for mail
breaks something):

#!/usr/bin/perl

use 5.010;
use strict;
use utf8;
use warnings;

use Data::Dumper;
use IO::Handle;
use Getopt::Long qw/GetOptionsFromArray HelpMessage
        :config auto_help bundling no_auto_abbrev
        no_getopt_compat no_ignore_case_always no_require_order
        permute/;

my $batch_keyword = 'GO';
my $output = 'out.sql';
my $max_size = '10M';

GetOptions(
        'keyword|K=s' => \$batch_keyword,
        'output|O=s' => \$output,
        'max-size|S=s' => \$max_size,
        );

$max_size = parse_stream_size($max_size);

my ($sql_file) = @ARGV;

open my $fh, '< :encoding(utf-8)', $sql_file or die "input: $!";

my $i = 0;

my $buf = '';
my $next_batch = '';

while(my $line = <$fh>)
{
    $next_batch .= $line;
    chomp $line;
    if($line =~ /^\Q$batch_keyword\E$/)
    {
        if(length($buf) + length($next_batch) > $max_size)
        {
            open my $out_fh, '> :encoding(utf-8)',
                    next_file($output, $i++) or die "output: $!";
            $out_fh->print($buf);
            close $out_fh or warn "output: $!";
            $buf = '';
        }
        $buf .= $next_batch;
        $next_batch = '';
    }
}

close $fh or warn "input: $!";

sub next_file
{
    my ($orig, $i) = @_;
    my $format;

    if($orig =~ /\.sql$/)
    {
        ($format = $orig) =~ s/\.sql$/_%04d.sql/;
    }
    else
    {
        $format = $orig . '_%04d';
    }

    return sprintf $format, $i;
}

sub parse_stream_size
{
    my ($size) = @_;
    my ($num, $mult) =
            $size =~ /((?:[0-9]+)?\.?(?:[0-9]+)?)\s*([BKMG])?/i;

    die "invalid stream size '$size'"
            unless defined $num && $num > 0;

    my %exp = ('B' => 0, 'K' => 1, 'M' => 2, 'G' => 3);
    my $exp = $exp{uc($mult // 'B')};

    $size = $num * 1024 ** $exp;

    return $size;
}

__END__

Regards,


-- 
Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bamccaig.com/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'


[1] 369867745 bytes.
[2] 184955287 bytes (it's actually about 42840 bytes less than
half; I don't know why..).

signature.asc
Description: Digital signature

Splitting "Large" SQL Script Into Several; Concat Hash Doesn't Match

Reply via email to