Bug#570623: reprepro: please add multiple version management

Benjamin Drung Mon, 26 May 2014 08:40:21 -0700

Am Freitag, den 23.05.2014, 23:39 +0200 schrieb Bernhard R. Link:
> * Benjamin Drung <benjamin.dr...@profitbricks.com> [140521 16:48]:
> > I got distracted by different tasks, but now I have time to work on
> > reprepro again.
> > 
> > Am Dienstag, den 04.02.2014, 23:23 +0100 schrieb Bernhard R. Link:
> > > * Benjamin Drung <benjamin.dr...@profitbricks.com> [140203 13:15]:
> > > > Okay. Attached the patch for my prototype. Be aware: It's just a
> > > > prototype that is just able to run the commands that I wanted to test,
> > > > but isn't near to be ready for mainlining. The prototype implements case
> > > > 2 just because that was my initial idea, but now I tend to think that
> > > > case 1 might be easier/cleaner.
> > >
> > > Thanks. I'll take a look this weekend.
> >
> > Any feedback so far?
> 
> It's too long ago for me to remember. AFAIR it mostly was that libdb
> trick that is quite hard to understand.


My current approach is to use the secondary DB again (case 2). The
primary DB stores packagename+version (e.g. "hello|2.5-1") and does not
allow duplicates. The secondary DB stores just the packagename (e.g.
"hello") and does allow duplicates. The duplicates are sorted by their
version.

The function get_package_name is used to derives the secondary key from
the primary key by extracting the packagename (e.g. "hello|2.5-1" ->
"hello").

This approach has the advantage that these operations are easy to
implement/cost efficient:

1) get the latest package version (first entry in the secondary DB for a
given package name)
2) get a specific package version (entry in the primary DB for a given
packagename+version combination)

For example adding a new package would do this:
Get the latest package version for the given package name (1). If the
result is NULL or the returned package version is older than the new
package, add the new package. If the version is the same, reject the new
package. If the version is newer than the new package, check if the
package version is already present in the archive (2) and reject it in
this case. Otherwise add the new package if downgrades are allowed.

> > > > > It sounds quite slow either way. Perhaps the way to go is instead
> > > > > changing the data format, like having the version first (perhaps even 
> > > > > in
> > > > > preparsed format to speed things up).
> > > >
> > > > Good idea, but is this function really time critical? It should be only
> > > > called when comparing duplicate keys (which shouldn't happen that often,
> > > > does it?).
> > > 
> > > It might also happen when updating some value otherwise. (And if the
> > > version is in some meta-data first one also does not have to
> > > differentiate between binaries and sources that much). One could also
> > > take the opportunity of a format change to allow for other possible
> > > future meta data (like the first added timestamp).
> > 
> > How flexible should the new data structure be? What meta data besides
> > the timestamp could be relevant?
> 
> It would be nice if it allowed other fields to be added later without
> breaking the format again. Like some (field-length-content)* format.

Fields to store (for the beginning):
1) package version string
2) control chunk string
3) added timestamp

Here are different, but similar on-disk format proposals:

First variant
=============

Physical order of the data
--------------------------

field 1
field 2
...
field n
sizeof(field n) [stored as 32-bit unsigned int]
...
sizeof(field 2) [stored as 32-bit unsigned int]
sizeof(field 1) [stored as 32-bit unsigned int]
number of field [stored as 8-bit (or 16-bit?) unsigned int]

In our case this would be
-------------------------

[variable] package version string + '\0'
[variable] control chunk string + '\0'
[8 bytes] timestamp (which format? 64-bit Unix timestamp?)
[4 bytes] sizeof(timestamp) = 8 (in case of 64-bit timestamp)
[4 bytes] strlen(control chunk) + 1
[4 bytes] strlen(package version) + 1
[1 byte] 3

Access to the individual entries for a given void *data and size_t len:
size_t end = (size_t)data + len;

char *version = (char*)data;
size_t version_len = *(uint32_t)(end - sizeof(uint8_t) - sizeof(uint32_t));
fail if *(uint8_t)(end - sizeof(uint8_t)) < 1

char chunk = (char*)((size_t)data + version_len);
size_t chunk_len = *(uint32_t)(end - sizeof(uint8_t) - 2 * sizeof(uint32_t));
fail if *(uint8_t)(end - sizeof(uint8_t)) < 2

time_t timestamp = *(int64_t)((size_t)data + version_len + chunk_len);
fail if *(uint8_t)(end - sizeof(uint8_t)) < 3

Additional checks:
size_t timestamp_len = *(uint32_t)(end - sizeof(uint8_t) - 3 * 
sizeof(uint32_t));
timestamp_len == sizeof(int64_t)
len == version_len + chunk_len + timestamp_len + sizeof(uint8_t) + 3 * 
sizeof(uint32_t)

Second variant
==============

The second variant is similar to the first, but only the size of fields
are stored that have a variable length. In our case, we would store the
length of the version and control chunk, but not the length of the
timestamp.

third variant
=============

Instead of storing the length of the field, the offset is stored:

Physical order of the data
--------------------------

field 1
field 2
...
field n
offset field n [stored as 32-bit unsigned int]
...
offset field 3 [stored as 32-bit unsigned int]
offset field 2 [stored as 32-bit unsigned int]
number of field [stored as 8-bit (or 16-bit?) unsigned int]

In our case this would be
-------------------------

[variable] package version string + '\0'
[variable] control chunk string + '\0'
[8 bytes] timestamp (which format? 64-bit Unix timestamp?)
[4 bytes] strlen(package version) + strlen(control chunk) + 2
[4 bytes] strlen(package version) + 1
[1 byte] 3

Access to the individual entries for a given void *data and size_t len:
size_t end = (size_t)data + len;

char *version = (char*)data;
size_t chunk_offset = *(uint32_t)(end - sizeof(uint8_t) - sizeof(uint32_t));
size_t version_len = chunk_offset;
fail if *(uint8_t)(end - sizeof(uint8_t)) < 1

char chunk = (char*)((size_t)data + chunk_offset);
size_t timestamp_offset = *(uint32_t)(end - sizeof(uint8_t) - 2 * 
sizeof(uint32_t));
size_t chunk_len = timestamp_offset - chunk_offset;
fail if *(uint8_t)(end - sizeof(uint8_t)) < 2

time_t timestamp = *(int64_t)((size_t)data + timestamp_offset);
fail if *(uint8_t)(end - sizeof(uint8_t)) < 3

Additional checks:
chunk_offset < timestamp_offset
timestamp_offset < len
len == version_len + chunk_len + timestamp_len + sizeof(uint8_t) + 3 * 
sizeof(uint32_t)

Modified first/second/third variant
===================================

This variant can be a modification to the first, second, or third
variant and therefore become variant four till the sixth. Instead of
storing the number of field at the end of the data block, the number of
fields are stored at the beginning of the data block:

number of field [stored as 8-bit (or 16-bit?) unsigned int]
field 1
field 2
...
field n
sizeof/offset m
...
sizeof/offset 2
sizeof/offset 1

Question
========

Which on-disk format do you prefer?

-- 
Benjamin Drung
System Developer

ProfitBricks GmbH - The IaaS-Company
Greifswalder Str. 207
D - 10405 Berlin

Mail: benjamin.dr...@profitbricks.com
Fax:  +49 30 577 008 598
URL:  http://www.profitbricks.com

Sitz der Gesellschaft: Berlin.
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B.
Geschäftsführer: Andreas Gauger, Achim Weiss.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#570623: reprepro: please add multiple version management

Reply via email to