Hello: tl;dr: A program indended to split a large MS SQL script into smaller ones appears to be losing data. See below for the script. I guess that maybe I'm making a silly mistake with newlines or file encoding or something..
My colleague is working with a relatively large generated SQL script that Visual Studio doesn't seem to be able to handle. The script is approximately 370 MB[1] in UTF-16 Little Endian (the default encoding of whatever generated the script; likely SQL Server Management Studio). Converted to UTF-8 (using Vim) it's unsurprisingly about half of that[2]. There are various ways to solve this. My colleagues decided to manually split the file up into manageable pieces, but that seemed tedious to me. I opted to hack up a Perl program to split it up for me. Initially (perhaps with insufficient thought) I decided to split it up into "batches". I don't really know the details, but MS SQL has a batch keyword, by default 'GO', that it uses to split scripts up. I don't really know what it means or why it's used, but I know that it is necessary at times.. I figured this was the most reliable boundary to split on so the scripts would still have a hope of being executable without errors. Anyway, within 30 minutes or so I had written a Perl program to split on this boundary (there are probably easier ways, in hindsight, albeit from Windows those ways might not be as easy as on Unix-like platforms). That resulted in some 12000+ files, which was obviously not very practical to be manually executed by a human. It might suffice to automatically execute the batches one by one, but I have practically no experience communicating with DBMSes with Perl (or with that much SQL at a time) and I wouldn't be able to convince my colleagues to give me a chance to try. It could also potentially be dangerous if I get something wrong so I don't know if I'd really want to rush such a solution anyway, but I digress. In order to cut the number of files down in size I added a size parameter to the program and modified it to bundle "batches" into files up to that maximum size each. The program appears to work at first glance, producing approximately the right number of files of approximately the right sizes, and terminating on batch boundaries (by default, a line containing only 'GO'). I felt triumphant, but realized that I should try to validate that the split scripts are exactly equivalent to the single, extra large script. I decided to calculate SHA1 and MD5 hashes for the original SQL script and a concatenation of the split SQL files. AFAIK, those hashes should match if my Perl program and concatenation were done properly. Unfortuantely, the SHA1 does not match. I used MSYS cat to do the concatenation, and MSYS sha1sum to calculate the hashes: C:\Users\bamccaig\src\projectX>cat data/large.sql | sha1sum [sha1sum of large.sql spit out] C:\Users\bamccaig\src\projectX>cat data/out*.sql | sha1sum [sha1sum of the concatenation of data/out_0000.sql - data/out_0016.sql] These hashes don't match and now I'm stuck trying to figure out why or abandoning this program and leaving my colleague to do it by hand. Even if we don't use this program to solve this problem I would like to find my mistake(s) and correct them. The program follows (also attached in case my wrapping for mail breaks something): #!/usr/bin/perl use 5.010; use strict; use utf8; use warnings; use Data::Dumper; use IO::Handle; use Getopt::Long qw/GetOptionsFromArray HelpMessage :config auto_help bundling no_auto_abbrev no_getopt_compat no_ignore_case_always no_require_order permute/; my $batch_keyword = 'GO'; my $output = 'out.sql'; my $max_size = '10M'; GetOptions( 'keyword|K=s' => \$batch_keyword, 'output|O=s' => \$output, 'max-size|S=s' => \$max_size, ); $max_size = parse_stream_size($max_size); my ($sql_file) = @ARGV; open my $fh, '< :encoding(utf-8)', $sql_file or die "input: $!"; my $i = 0; my $buf = ''; my $next_batch = ''; while(my $line = <$fh>) { $next_batch .= $line; chomp $line; if($line =~ /^\Q$batch_keyword\E$/) { if(length($buf) + length($next_batch) > $max_size) { open my $out_fh, '> :encoding(utf-8)', next_file($output, $i++) or die "output: $!"; $out_fh->print($buf); close $out_fh or warn "output: $!"; $buf = ''; } $buf .= $next_batch; $next_batch = ''; } } close $fh or warn "input: $!"; sub next_file { my ($orig, $i) = @_; my $format; if($orig =~ /\.sql$/) { ($format = $orig) =~ s/\.sql$/_%04d.sql/; } else { $format = $orig . '_%04d'; } return sprintf $format, $i; } sub parse_stream_size { my ($size) = @_; my ($num, $mult) = $size =~ /((?:[0-9]+)?\.?(?:[0-9]+)?)\s*([BKMG])?/i; die "invalid stream size '$size'" unless defined $num && $num > 0; my %exp = ('B' => 0, 'K' => 1, 'M' => 2, 'G' => 3); my $exp = $exp{uc($mult // 'B')}; $size = $num * 1024 ** $exp; return $size; } __END__ Regards, -- Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org> Castopulence Software <https://www.castopulence.org/> Blog <http://www.bamccaig.com/> perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say' [1] 369867745 bytes. [2] 184955287 bytes (it's actually about 42840 bytes less than half; I don't know why..).
signature.asc
Description: Digital signature