On Sat, Aug 13, 2016 at 09:04:32AM +0000, Eric Wong wrote:

> > Is there an easy way to get _just_ the list of message-ids you are
> > storing (I know I can download the whole archive, but it's big)?
> 
> XHDR (or HDR) over NNTP should do it (that's how I checked
> against gmane):
> --------8<-----
> use Net::NNTP;
> my $nntp = Net::NNTP->new($ENV{NNTPSERVER} || 'news.public-inbox.org');
> my ($num, $first, $last) = $nntp->group('inbox.comp.version-control.git');
> my $batch = 10000;
> my $i;
> for ($i = $first; $i < $last; $i += $batch) {
>       my $j = $i + $batch - 1;
>       $j = $last if $j > $last;
>       my $num2mid = $nntp->xhdr('Message-ID', "$i-$j");
>       for my $n ($i..$j) {
>               defined(my $mid = $num2mid->{$n}) or next;
>               print "$mid\n";
>       }
> }

Thanks, that's perfect.

I collected the message-ids from my archive. Interestingly, I had a
dozen or so that did not have message-ids at all. I think most of them
are from patches that put the "From " line in the body, like this one:

  http://public-inbox.org/git/20070311033833.gb10...@spearce.org/

and then they got corrupted on a round-trip through one of the bad mbox
formats (probably downloading from gmane, I'd guess; the export there
uses mbox, and I use maildir myself, so it probably got split badly
years ago). Anyway, public-inbox seems to get this case right, which is
good.

I had several hundred message ids that you didn't. About half of them
were spam or other junk. I weeded them out manually (mostly by picking
through the subjects, so possibly there's some error). The end result is
279 messages that I think are legitimate that you don't have.

I'll send them to you off-list, as the mbox is about 300K, which the
list will reject.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to