Hi, On +2020-08-27 11:41:24 +0200, zimoun wrote: > Hi, > > On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samp...@ngyro.com> wrote: > > zimoun <zimon.touto...@gmail.com> writes: > > > >> One question is how this database scales? > >> > >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata > >> for ~14k packages and then an increase of ~700MB per year, both with the > >> Ludo’s code [1]. > >> > >> [1] <http://issues.guix.gnu.org/issue/42162#11> > > > > It’s a good question. A good part of the size comes from the > > representation rather than the data. Compression helps a lot here. I > > have a database of 3,912 packages. It’s 295M uncompressed (which is a > > little better than your estimation). If I pass each file through Lzip, > > it shrinks down to 60M. That’s more like 15.5K per package, which is > > almost an order of magnitude smaller than the estimation you used > > (120K). I think that makes the numbers rather pleasant, but it comes at > > the expense of easy storing in Git. > > Thank you for these numbers. Really interesting! > > First, I do not know if the database needs to be stored with Git. What > should be the advantage? (naive question :-)) > > > On SWH T2430 [1], you explain the “default-header” trick to cut down the > size. Nice! > > Moreover, the format is a long list, e.g., > > --8<---------------cut here---------------start------------->8--- > (headers
How about (X-v1-headers (borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard) The idea is to make it easy to script the change to "(headers" once there is consensus for declaring a new standard. The "v1-" part could allow a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion, or even a base64 of a compressed format. There's lots that could be borrowed from the MIME rfc's :) --8<---------------cut here---------------start------------->8--- 6.3. New Content-Transfer-Encodings Implementors may, if necessary, define private Content-Transfer- Encoding values, but must use an x-token, which is a name prefixed by "X-", to indicate its non-standard status, e.g., "Content-Transfer- Encoding: x-my-new-encoding". Additional standardized Content- Transfer-Encoding values must be specified by a standards-track RFC. The requirements such specifications must meet are given in RFC 2048. As such, all content-transfer-encoding namespace except that beginning with "X-" is explicitly reserved to the IETF for future use. Unlike media types and subtypes, the creation of new Content- Transfer-Encoding values is STRONGLY discouraged, as it seems likely to hinder interoperability with little potential benefit --8<---------------cut here---------------end--------------->8--- > ((name "raptor2-2.0.15/") > (mode 493) If you want to be more human-readable with mode, I would put a chmod argument in place of 493 :) --8<---------------cut here---------------start------------->8--- $ printf "%o\n" 493 755 $ --8<---------------cut here---------------end--------------->8--- Hm, could this be a security risk?? I mean, could a mode typo here inadvertently open a door for a nasty mod by oportunistic code buried in a later-executed apparently unrelated app? > (mtime 1414909500) One of these might be more human-recognizable :) --8<---------------cut here---------------start------------->8--- $ date --date='@1414909497' -Is 2014-11-02T07:24:57+01:00 $ date --date='@1414909497' -uIs 2014-11-02T06:24:57+00:00 $ TZ=America/Buenos_Aires date --date='@1414909497' -Is 2014-11-02T03:24:57-03:00 $ $ date --date='@1414909497' -u '+%Y%m%d_%H%M%S' 20141102_062457 # vs 1414909497, which, yes, costs 5 chars less $ --8<---------------cut here---------------end--------------->8--- > (chksum 4225) > (typeflag 53)) > ((name "raptor2-2.0.15/build/") > (mode 493) > (mtime 1414909497) > (chksum 4797) > (typeflag 53)) > ((name "raptor2-2.0.15/build/ltversion.m4") > (size 690) > (mtime 1414908273) > (chksum 5958)) > > […]) > --8<---------------cut here---------------end--------------->8--- > > which is human-readable. Is it useful? > > > Instead, one could imagine shorter keywords: > (X-v2-headers ;; ;-) > ((na "raptor2-2.0.15/") > (mo 493) > (mt 1414909500) > (ch 4225) > (ty 53)) > > which using your database (commit fc50927) reduces from 295MB to 279MB. > > Or even plain list: > (X-v3-headers > (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53) > (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958) > > where the first element provides the “type” of list to ease the reader. > > > Well, the 2 naive questions are: does it make sense to > - have the database stored under Git? > - have an human-readable format? > > > Thank you again for pushing forward this topic. :-) > > All the best, > simon > > [1] https://forge.softwareheritage.org/T2430#47522 > > > Prefixing "X-" can obviously be used with any tentative name for anything. I am suggesting it as a counter to premature (and likely clashing) bindings of valuable names, which IMO is as bad as premature optimization :) Naming is too important to be defined by first-user flag-planting, ISTM. -- Regards, Bengt Richter