Hi Asheesh, I'm currently testing your code from the git repository. Interestingly it also respects future ;-) :
... Computed upload history for 2020-05 Computed upload history for 2020-06 Computed upload history for 2020-07 Computed upload history for 2020-08 Computed upload history for 2020-09 Computed upload history for 2020-10 Computed upload history for 2020-11 Computed upload history for 2020-12 Computed upload history for 2021-01 Computed upload history for 2021-02 Computed upload history for 2021-03 Computed upload history for 2021-04 Computed upload history for 2021-05 Computed upload history for 2021-06 Computed upload history for 2021-07 Computed upload history for 2021-08 Computed upload history for 2021-09 Computed upload history for 2021-10 Computed upload history for 2021-11 Computed upload history for 2021-12 >From the first look the result looks sensible: sqlite> select * from upload_history where maintainer like '%debian-med-packaging%' limit 2 ; e1jawxz-000605...@ries.debian.org|1199391582|gnumed-client|0.2.8.1-1|Andreas Tille <ti...@debian.org>|Andreas Tille|ti...@debian.org|Debian-Med Packaging Team <debian-med-packag...@lists.alioth.debian.org>|Debian-Med Packaging Team|debian-med-packag...@lists.alioth.debian.org|0| gnumed-client (0.2.8.1-1) unstable; urgency=low . * New upstream version e1japsm-0006xr...@ries.debian.org|1199462003|probcons|1.12-4|Charles Plessy <charles-debian-nos...@plessy.org>|Charles Plessy|charles-debian-nos...@plessy.org|Debian-Med Packaging Team <debian-med-packag...@lists.alioth.debian.org>|Debian-Med Packaging Team|debian-med-packag...@lists.alioth.debian.org|0| probcons (1.12-4) unstable; urgency=low . - Allowed upload by Debian Maintainers. - Checked the compliance with Policy 3.7.3 * debian/patches: - swiched to quilt - added a fix to build with GCC 4.3 (Closes: #455625) * debian/rules: - modify Main-RNA.cc so that it uses Defaults-RNA.h (Closes: #458926) * debian/copyright: - converted to machine-readable format. . [ David Paleino ] * debian/probcons.1, debian/probcons-RNA.1, debian/pc-compare.1, debian/pc-makegnuplot.1, debian/pc-project.1 added - these have been statically built. * debian/control: - B-D updated - added myself to Uploaders * debian/rules: - manpages statically built - minor changes But I guess you consider this table partly a debugging state. I do not see a good reason to store the full changelog paragraph otherwise. You also are storing message_id. That's OK from a data consumption point of view but I do not see any real usage for this field at the moment. I would love to see the same table structure as in UDD: source | version | date | changed_by | changed_by_name | changed_by_email | maintainer | maintainer_name | maintainer_email | nmu | signed_by | signed_by_name | signed_by_email | key_id | distribution | file | fingerprint What I'm missing is signed_by* . No idea what key_id means - never used this. Distribution might be good to have as well, no idea what file might have contained. Fingerprint seems also sensible since it could be a link to the carnivore table. Regarding the decision to parse the web archives rather than mboxes: I don't know what is better. I agree that accessing public data is an advantage but if it is at the expense of more complex code I would rather stick to the mbox parsing. BTW, formerly the data went at least back to 2000. Here is the graph for pkg-perl: http://blends.debian.net/liststats/uploaders_pkg-perl.png Currently you encode date as integer in sqlite so I need to think about how to translate this. For my target query I want to do for my talk it would be comfortable to have date or datetime values. So far for my review. Thanks a lot for your work on this. Its really appreciated! Kind regards Andreas. On Wed, Aug 19, 2020 at 11:03:40PM -0700, Asheesh Laroia wrote: > Hi Andreas & Lucas & all, > > Lucas -- I'm making progress on re-implementing this. I'd love your input > by email or IRC about my approach, but if you're busy, feel free to ignore > this and I'll mention you again when I submit a patch. > > Andreas -- The codebase at > https://github.com/paulproteus/debian-devel-changes-history-extractor can > be run on your system and generate a "upload_history" table. Would you be > willing to try it out and let me know if it meets your needs? > > The README at the URL above has some information about how to use it. > > https://drive.google.com/drive/folders/1hF_zuc_03m3a_VwOO5hpjp5vETNjVxMx?usp=sharing > is a Google Drive folder (owned by me) which contains an > upload_history.sqlite file you can use. This would allow you to query the > current database without using the code to create it. (Feel free to also > use the code to create your own DB.) > > I'm happy to discuss by IRC or private email or BTS email what you would > need next. I do hope to resolve the issues listed in the bug tracker on > GitHub, but I haven't yet, and feedback will help me prioritize. > > Per the info in the README, I'd like to get this merged into UDD in the > long run, and be happy to have a discussion about the best way to do so. > There are a few issues I want to fix before formally submitting it -- see > https://github.com/paulproteus/debian-devel-changes-history-extractor/issues > for > a list. > > Cheers, > > Asheesh. -- http://fam-tille.de