On Sun, Nov 11, 2018 at 12:32:22AM -0800, Elijah Newren wrote:
> > > Documentation/git-fast-export.txt | 7 +++++++
> > > builtin/fast-export.c | 20 +++++++++++++++-----
> > > fast-import.c | 17 +++++++++++++++++
> > > t/t9350-fast-export.sh | 17 +++++++++++++++++
> > > 4 files changed, 56 insertions(+), 5 deletions(-)
> >
> > The fast-import format is documented in Documentation/git-fast-import.txt.
> > It might need an update to cover the new format.
>
> We document the format in both fast-import.c and
> Documentation/git-fast-import.txt? Maybe we should delete the long
> comments in fast-import.c so this isn't duplicated?
Yes, that is probably worth doing (see the comment at the top of
fast-import.c). Some information might need to be migrated.
If we're going to have just one spot, I think it needs to be the
user-facing documentation. This is a public interface that other people
are building compatible implementations for (including your new tool).
> > > +--show-original-ids::
> > > + Add an extra directive to the output for commits and blobs,
> > > + `originally <SHA1SUM>`. While such directives will likely be
> > > + ignored by importers such as git-fast-import, it may be useful
> > > + for intermediary filters (e.g. for rewriting commit messages
> > > + which refer to older commits, or for stripping blobs by id).
> >
> > I'm not quite sure how a blob ends up being rewritten by fast-export (I
> > get that commits may change due to dropping parents).
>
> It doesn't get rewritten by fast-export; it gets rewritten by other
> intermediary filters, e.g. in something like this:
>
> git fast-export --show-original-ids --all | intermediary_filter |
> git fast-import
>
> The intermediary_filter program may want to strip out blobs by id, or
> remove filemodify and filedelete directives unless they touch certain
> paths, etc.
OK, that matches my understanding. So why does fast-export need to print
the blob ids? If the intermediary is rewriting blobs, it can then
produce the "originally" line itself, can't it?
The more interesting case I guess is your "strip out blobs by id"
example. There the intermediary _could_ do so itself, but it would
require recomputing the object id of each blob.
If you use "--no-data", then this just works (we specify tree entries by
object id, rather than by mark). But I can see how it would be useful to
have the information even without "--no-data" (i.e., if you are doing
multiple kinds of rewrites on a single stream).
I think the thing that confused me is that this "originally" is doing
two things:
- mentioning blob ids as an optimization / convenience for the reader
- mentioning rewritten commit (and presumably tag?) ids that were
rewritten as part of a partial history export. I suppose even trees
could be rewritten that way, too, but fast-import doesn't generally
consider trees to be a first-class item.
So I'm OK with it, but I wonder if there is an easier way to explain it.
-Peff