Re: Commit messages and the move to git

Richard Earnshaw (lists) Mon, 18 Nov 2019 07:32:54 -0800

On 09/11/2019 06:01, Eric S. Raymond wrote:

Richard Earnshaw (lists) <richard.earns...@arm.com>:

Which makes me wonder if, given a commit log of the form:



2019-10-30  Richard Biener  <rguent...@suse.de>

         PR tree-optimization/92275
         * tree-vect-loop-manip.c (slpeel_update_phi_nodes_for_loops):
         Copy all loop-closed PHIs.

         * gcc.dg/torture/pr92275.c: New testcase.

Where the first line is a ChangeLog style date and author, we could spot the
PR line below that and hoist it up as a more useful summary (perhaps by
copying it rather than moving it).

It wouldn't fix all commits, but even just doing this for those that have
PRs would be a help.


Speaking from lots of experience with converting old repositories that
exhibited similar comment conventions, I would be nervous about trying
to do this entirely mechanically.  I think the risk of mangling text
that is not fornatted as you expect - and not noticing that until the
friction cost of fixing it has escalated - is rather high.

On the other hand, reposurgeon allows a semi-neechanized attack on
the problem that I think would work well, because I've done similar
things in ither coversions.

There's a pair of commands that allow you to (a) extract comments from
a range of commits into a message list that looks like an RFC822
mailbox file, (b) modify those comments, and (c) weave the the message
list reliably back into the repository.

If it were me doing this job, I'd write a reposurgeon command that
extracts all the comments containing PR strings into a message box
Then I'd write an Emacs macro that moves to the next nessage and
hoists its PR line.

Then I'd walk through the comments applying the macro and keeping an eye on
them for cases where what the macro doesn't do quite the right thing and
using undo and hand-editing to recover.  Human eyes are very good at
spotting anomalies in an expected flow of textm and once you've gotten
into the rhythm of a task like this is is easily possible to filter
approximately a message per second. In round numbers, providing
the anomaly rate isn't high, that's upwards of 3000 messages per hour.

The point is that for this kind of task a hnman being who undertands
what he's reading is likely to have a lower rate of mangling errors than
a program that doesn't.

The hook into reposurgeon is even better... it allows a more intelligenthook to be produced, for example the attached.


In a reposurgeon lift script you can write something like

~gcc2 msgout >gcc.commitlog
shell ./fixbugmessages
msgin <gcc.fixedprs

I wrote this script for two reasons

1) To learn some python (finally I had a good reason to go and dothis :)2) To try to improve some of our legacy commit messages, especiallywhere they appear in git as just the name of the author (informationthat is already readily available in other components.

It works by scanning for commits that match a traditional ChangeLogstyle author line and then, if the commit mentions a PR, looking thatup in the Bugzilla database to extract the original component and thesummary line from the PR. It then produces a single line summary basedon the PR and the summary associated with it. So for example, a commitsuch as this one:


2019-10-30  Richard Biener  <rguent...@suse.de>

        PR tree-optimization/92275
        * tree-vect-loop-manip.c (slpeel_update_phi_nodes_for_loops):
        Copy all loop-closed PHIs.

        * gcc.dg/torture/pr92275.c: New testcase.


Would be matched and the line

PR tree-optimization/92275 ICE: error: definition in block 11 doesnot dominate use in block 15 since r277566


added as a summary

Since there are a number of errors that over the years have crept intothe commit logs, there are also features to try to match (where they arespecified) the bug component against that listed in the commit log.That matching is fuzzy, since it too can be inaccurate: many bugs, forexample, are classified as target in the PR, but subsequently get acommit component that reflects where the fix was applied. So there aresome filters to simplify reporting to just those categories that lookobviously suspicious (like a fortran bug with a C component reported).There are also facilities to mark some commits that have been manuallyverified as clean so that they are no-longer reported.

The script generates new summary lines for nearly 43000 commits thatwould mostly have just shown the author line. Of those about 560 areflagged as potentially suspicious and will need some manual checking.

R.

#!/usr/bin/python2

import requests
import json
import os.path
import codecs
import email.message, email.parser
import re
import signal

# Files that we read or write.
buglist = "bugids.list"
bugdbfile = "bugs.json"
commits_in = "gcc.commitlog"
commits_out = "gcc.fixedprs"

bugdb = {}

# Dictionary indexed by bugzilla component.  If the claimed PR component
# matches against the real component, then assume this probably an OK
# matching.  We avoid a lot of false positive warnings this way.
# A special match of "*" means we don't care in this case.  Many bugs
# are filed under, for example, bootstrap or other when in fact the
# real component will ultimately be shown in the commit log.
compalias = {
    "rtl-optimization": ["opt", "optimization", "optimisation", "optimize",
                         "middle-end", "rtl-opt", "rtl_optimization",
                         "rtl-optimizations", "rtl"],
    "tree-optimization": ["opt", "optimization", "optimisation", "optimize",
                          "middle-end", "tree-opt", "tree-optimzation",
                          "graphite", "lto", "tree-optimizations",
                          "tree-optimizer", "ipa", "tree-optimize",
                          "tree-optimizatio"],
    "java": ["gcj", "libgcj", "libjava"],
    "fastjar": ["java"],
    "libgcj": ["java", "libjava"],
    "classpath": ["libgcj", "libjava", "xml", "crypto"],
    "awt": ["classpath", "libjava"],
    "xml": ["classpath", "libjava"],
    "crypto": ["classpath", "libjava"],
    "objc++": ["obj-c++"],
    "c++": ["libstdc++", "diagnostic", "cp"],
    "c": ["diagnostic", "c-family"],
    "libstdc++": ["c++", "libstdc++-v3"],
    "libmudflap": ["mudflap"],
    "preprocessor": ["cpp", "pch"],
    "bootstrap": ["*"],
    "testsuite": ["*"],
    "regression": ["*"],
    "sanitizer": ["sanitize", "ubsan"],
    "other": ["*"],
    "middle-end": ["optimization", "rtl-optimization", "tree-optimization",
                   "tree-opt", "rtl-opt", "ipa", "lto", "middle", "middle-ed"],
    "ipa": ["middle-end", "lto"],
    "lto": ["middle-end", "ipa", "tree-optimization"],
    "target": ["*"],
    "plugins": ["*"],
    "libf2c": ["fortran", "gfortran", "libgfortran", "libfortran"],
    "fortran": ["gfortran", "libgfortran", "libfortran", "forrtran"],
    "libfortran": ["fortran", "gfortran", "libgfortran", "libfofortran"],
    "libgomp": ["fortran"],
    "gcov-profile": ["profile", "gcov"],
    "libffi": ["other"],
    "web": ["documentation"]}

# List of Legacy-IDs where the PR is correct, dispite the anomaly
# checks we perform.
whitelist = ["52862", "53125", "57958", "57959", "58705", "51884", "58535",
             "58542",
             "61329", "61770", "61960", "61984", "63783", "65127", "65128",
             "67653", "68236", "68239",
             "70062", "70542", "70625", "70932", "72072", "72259", "72260",
             "73504", "73940", "73940", "77602", "78312", "78313",
             "80793", "82490", "82518", "83609", "84217", "84295", "84531",
             "84532", "84887", "84895", "86396", "86429", "86458", "86459",
             "86460", "86461", "87066", "87189", "87511", "87512", "87632",
             "88922", "89175", "89921",
             "90004", "91308", "91504", "92158", "92585", "92658", "94054",
             "94226", "94317", "94805", "95382", "95383", "95428", "95673",
             "96396", "96626", "97373", "97375", "97537", "98412", "98926",
             "99480", "99514",
             "100206", "100207", "100209", "100371", "100851", "100961",
             "101264", "101273", "102155", "102348", "102539", "102567",
             "102799", "102920", "103268", "103279", "103308", "104948",
             "104949", "104988", "105052", "105274", "106269", "106582",
             "107816", "108045", "109166", "109167", "109168", "109169",
             "109643", "109644", "109818",
             "110131", "110132", "110627", "110975", "111818", "111947",
             "111952", "112296", "112298", "112611", "112612", "112819",
             "113095", "113119", "113121", "113137", "113627", "113628",
             "113661", "113662",
             "113701", "113702", "113703", "114482", "115200",
             "116613", "116633", "116633", "117824",
             "117852", "118025", "118034", "118142",
             "118742", "118743", "119248", "119485", "119486",
             "120497", "120498", "120499", "120523", "121334", "121338",
             "121949", "122640", "122641", "122700", "122743", "122751",
             "122798", "124362", "125025",
             "125108", "125844", "126675", "127812", "127688", "127689",
             "128272", "128573", "128649", "129871",
             "130016", "131781", "132577", "132587",
             "141646", "142724", "142725", "142833", "144976",
             "152550", "157199", "157784",
             "160456", "160476", "163777",
             "176083", "176167", "176623", "176624",
             "182943", "187842",
             "190100", "193620", "194102", "195610", "199972",
             "201377", "201378", "202642", "202937", "205753", "209250",
             "210999", "211000", "211001", "211002", "214734", "214958",
             "215345", "215386", "215804", "215806", "215807",
             "221207", "221958",
             "233033",
             "242543", "242795",
             "259105",
             "268434",
             "276571", "276987", "276993"
]

# Indexed by Legacy-ID.  Each fixup may contain either the correct PR
# for this commit or an alternative summary string.  If the latter, this
# will form the entire summary for the commit.
fixups = {
    "57986": {"PR": "7792"},
    "52818": {"PR": "6479"},
    "65177": {"PR": "8808"},
    "65181": {"PR": "8808"},
    "68117": {"PR": "10712"},
    "68118": {"PR": "10712"},
    "71157": {"PR": "11867"},
    "71159": {"PR": "11867"},
    "71856": {"PR": "11415"},
    "75457": {"PR": "12815"},
    "81771": {"PR": "15204"},
    "82028": {"summary": "Multiple fixes: PRs 14692, 15274 and 15463"},
    "84508": {"PR": "16455"},
    "84522": {"PR": "15754"},
    "85966": {"PR": "16935"},
    "86431.1": {"summary": "Revert earlier fix for PR 14029"},
    "86431.2": {"summary": "Revert earlier fix for PR 14029"},
    "93620": {"PR": "19009"},
    "94658": {"PR": "19736"},
    "94889": {"PR": "19634"},
    "96560": {"PR": "20490"},
    "107451": {"PR": "24236"},
    "110063": {"summary": "PR25024, PR20881, PR23308, PR25538 and PR25710 - Procedure references"},
    "114667": {"PR": "27648"},
    "115858": {"PR": "28452"},
    "122127": {"PR": "30795"},
    "122980": {"PR": "31025"},
    "141612": {"PR": "32519"},
    "146134": {"PR": "38668"},
    "146593": {"PR": "39632"},
    "146650": {"PR": "39632"},
    "151112": {"PR": "28039"},
    "158393": {"PR": "43741"},
    "171924": {"PR": "48401"},
    "174228": {"PR": "46005"},
    "174235": {"PR": "46005"},
    "179230": {"PR": "45012"},
    "188077": {"PR": "52700"},
    "188078": {"PR": "52700"},
    "188079": {"PR": "52700"},
    "206073": {"PR": "35545"},
    "206074": {"PR": "35545"},
    "224527": {"PR": "56766"},
    "268301": {"PR": "89020"},
    "249711": {"summary": "Revert: backport PRs 80382, 80966"},
    "276627": {"PR": "47054"}
    }

def read_bugdb(dbfile):
    try:
        with open(dbfile, "r") as f:
            bugdb.update(json.load(f))
            print(len(bugdb))
            f.close()
    except:
        return

def write_bugdb(dbfile):
    try:
        os.delete(dbfile + ".bak")
    except:
        pass
    try:
        os.rename(dbfile, dbfile + ".bak")
    except:
        pass
    try:
        with open(dbfile, "w") as f:
            json.dump(bugdb, f)
            f.close()
    except:
        print("Failed to write" + dbfile)
        exit()

def fetch_sublist(sublist):
    url = 'https://gcc.gnu.org/bugzilla/rest.cgi/bug?id='
    url += ','.join([elem for elem in sublist])
    url += '&include_fields=id,summary,component,resolution'

    resp = requests.get(url)

    if resp.status_code != 200:
        print(resp.status_code)
    return resp.json()

def process(jdata):
    buglist = jdata['bugs']
    for bug in buglist:
        bugdb[str(bug['id'])] = {"summary": bug['summary'],
                                 "component": bug['component'],
                                 "resolution": bug['resolution']}

    print(len(bugdb))

def fetchall(bugids):
    f = open(bugids, "r")
    allbugids = f.read().splitlines()
    sublist = []
    for bugid in allbugids:
        if str(bugid) not in bugdb:
            #print("missing: " + id)
            sublist.append(bugid)
            if len(sublist) == 200:
                process(fetch_sublist(sublist))
                sublist = []
    for fixup in fixups.values():
        if "PR" in fixup and fixup['PR'] not in bugdb:
            sublist.append(fixup['PR'])
            if len(sublist) == 200:
                process(fetch_sublist(sublist))
                sublist = []
    if sublist:
        process(fetch_sublist(sublist))

# This is lifted from the python version of reposurgeon, since we are
# parsing objects it created.

# Since we assume python2
rspolicy = None

class RepoSurgeonEmail(email.message.Message, object):
    "Specialized email message with a distinguishing starter."
    Divider = 78 * "-"
    __hash__ = None
    def __init__(self, **kwargs):
        if rspolicy:
            # This activates the Python 3 email header handling hack
            kwargs['policy'] = rspolicy
        email.message.Message.__init__(self, **kwargs)
        self.set_unixfrom(RepoSurgeonEmail.Divider)
    @staticmethod
    def readmsg(fp):
        msg = ''
        firstline = fp.readline()
        if not firstline:
            return None
        elif not firstline.startswith(RepoSurgeonEmail.Divider):
            msg = firstline
        while True:
            line = fp.readline()
            if not line:
                break
            # Not impossible for this to get spoofed, but *highly* unlikely
            if line == "\n." + RepoSurgeonEmail.Divider + "\n":
                line = "\n" + RepoSurgeonEmail.Divider + "\n"
            if line.startswith(RepoSurgeonEmail.Divider):
                break
            msg += line
        return msg
    def payload(self):
        "Return message body - works because not multipart."
        return self.get_payload()
    def __str__(self):
        out = super(RepoSurgeonEmail, self).as_string(unixfrom=True)
        out = out.replace("\n" + RepoSurgeonEmail.Divider + "\n",
                          "\n." + RepoSurgeonEmail.Divider + "\n")
        out = out.replace("\n>From", "\nFrom")
        return out

# Close states that suggest that the bug is unlikely to be a correct
# target of a fix.
badclose = ["INVALID"]
# Would be nice to add ["WORKSFORME", "WONTFIX"], but too many false
# positives on early PRs.

# Fuzzy match a claimed bugcat against the component in the PR.  This avoids
# a number of likely false negatives without over relaxing the desire to
# detect possibly inaccurate PR numbers.
def component_mismatch(bugcat, bugid):
    # We're not bothered about case
    bugcat = bugcat.lower()
    component = bugdb[bugid]['component']
    # It's extremely unlikely that any bug classified as spam should match
    # a PR.  It's also pretty unlikely that a bug closed as INVALID would
    # do so either.  So flag up both cases.
    if (component == "spam" or bugdb[bugid]['resolution'] in badclose):
        return True
    # These blanket components lead to far too many false positives, so
    # just ignore them.  This is essentially the reverse map of the "*"
    # match later on.
    if (bugcat in ["boostrap", "bootstrap", "target", "plugins", "testsuite",
                   "other", "ice", "gcc", "translation", "regression", "build",
                   "diagnostic", "wrong-code"] or
        # These are nonesense categories, since they refer to
        # components outside of GCC.
        bugcat in ["binutils", "gas", "ld"] or
        bugcat == component):
        return False
    if component not in compalias:
        return True
    if ("*" in compalias[component] or
        bugcat in compalias[component]):
        return False
    return True

def processmsg(msg):
    multicommit = False
    payload = msg.get_payload()
    summary = None

    # Look to see if this event has a specific fixup.  Use it if so.
    if msg['Legacy-ID'] in fixups:
        fixup = fixups[msg['Legacy-ID']]
        if "PR" in fixup:
            summary = bugdb[fixup['PR']]['summary'].encode("utf-8")
            z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary)
            if z:
                summary = z.group(1)
            summary = ("PR " + bugdb[fixup['PR']]['component'].encode("utf-8")
                       + "/" + fixup['PR'] + ": " + summary)
        elif "summary" in fixup:
            summary = fixup['summary']
        else:
            print("Invalid fixup for Legacy-ID " + msg['Legacy-ID'])
            exit()

    # Look to see if this event is a backport of multiple PRs.  Just
    # set the summary to a quick list if so.
    if not summary:
        allprs = re.findall(r"\b\s*PR(\s+([^/\s]+)/|\s*#?)(\d+)\b", payload)
        allprnos = list(dict.fromkeys([int(pr[2]) for pr in allprs]))
        allprnos.sort()
        if ((len(allprnos) > 3 and
             re.search(r"\b[Bb]ack ?port(ed)?\b", payload))
            or (len(allprnos) > 1 and
                re.search(r"\b[Bb]ack ?ports", payload))):
            prs = allprnos if len(allprnos) < 10 else allprnos[:10]
            #print prs[0]
            summary = "Backport PRs " + str(prs[0])
            for pr in prs[1:]:
                summary += ", " + str(pr)
            if len(allprnos) > 10:
                summary += " and more"

    if not summary:
        lines = payload.splitlines()
        while lines and lines[0] == "":
            lines = lines[1:]
        if not lines:
            return None
        x = re.match(r"\d{4}-\d{2}-\d{2}\s+.+\s<[\w.]+@[\w.-]+>",lines[0])
        if x:
            if len(lines) > 2:
                for line in lines[2:]:
                    if (line != lines[0] and
                        re.match(r"\d{4}-\d{2}-\d{2}\s+.+\s<[\w.]+@[\w.-]+>",
                                 line)):
                        multicommit = True
                        break
        else:
            # Some messages begin with just the PR number, we can
            # handle these the same way.  In this case, only match if
            # the line only contains the PR number.
            x = re.match(r"\s*PR(\s+[^/\s]+/|\s*#?)(\d+)\s*$", lines[0])
            if not x:
                return None
        if multicommit:
            summary = "[multiple commits]"
            bugid = ""
        else:
            y = re.search(r"(^|\s+|[/\(\[])PR(\s+([^/\s]+)/|\s*#?)(\d+)\b",
                          payload)
            if not y:
                return None
            bugcat = y.group(3)
            bugid = y.group(4)
            if bugid not in bugdb:
                #print("PR " + bugid + " not found in database")
                return None
            summary = bugdb[bugid]['summary'].encode("utf-8")
            # Early on there were multiple GNATS databases for bugs;
            # later they were merged into a single database, but the
            # PR numbers were changed.  Don't trust bugs below 1000
            # unless the component matches the one mentioned in the
            # log (if it even has one).
            if (int(bugid) < 1000 and
                (bugcat == None or bugcat != bugdb[bugid]['component'])):
                print("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID']
                      + "): Ignoring PR " + bugid
                      + " - no component or component mismatch")
                return None
            elif (bugcat != None and msg['Legacy-ID'] not in whitelist
                  and component_mismatch(bugcat, bugid)):
                problem = ("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID']
                           + "): '" + bugcat + "/" + bugid + "'")
                if (bugdb[bugid]['resolution'] in badclose):
                    problem += " closed as " + bugdb[bugid]['resolution'].encode("utf-8")
                    summary += (" [checkme: "
                                + bugdb[bugid]['resolution'].encode("utf-8")
                                + " SVN r" + msg['Legacy-ID'] + "]")
                else:
                    problem += " mismatch PR " + bugdb[bugid]['component'].encode("utf-8")
                    summary += (" [checkme: " + bugcat + " SVN r"
                                + msg['Legacy-ID'] + "]")
                print(problem)

            z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary)
            if z:
                #print(summary)
                summary = z.group(1)
            summary = ("PR " + bugdb[bugid]['component'].encode("utf-8")
                       + "/" + bugid + ": " + summary)

    newmsg = RepoSurgeonEmail()

    # All messages should have an event number, so this will hopefully
    # raise an exception if it fails
    newmsg['Event-Number'] = msg['Event-Number']
    if "Check-Text" in msg:
        newmsg['Check-Text'] = msg['Check-Text']
    # We must copy the tag name if it exists.
    if "Tag-Name" in msg:
        newmsg['Tag-Name'] = msg['Tag-Name']
    try:
        newmsg.set_payload(summary + "\n\n" + payload)
        return str(newmsg)
    except:
        print("Unicode??? Cannot format message for Event "
              + msg['Event-Number'] +
              (summary if bugid == "" else (" PR " + bugid)))
        return None

def main():
    read_bugdb(bugdbfile)
    try:
        fetchall(buglist)
    except KeyboardInterrupt:
        write_bugdb(bugdbfile)
        exit()

    write_bugdb(bugdbfile)

    commitlog = open(commits_in, "r")
    newlog = open(commits_out, "w")

    while True:
        msg = RepoSurgeonEmail.readmsg(commitlog)
        if not msg:
            break
        newmsg = processmsg(email.message_from_string(msg))
        if newmsg:
            newlog.write(newmsg)

    newlog.close()
    commitlog.close()

if __name__ == '__main__':
    main()

#!/bin/sh

# Note the largest bugid is currently less than 100000, so we strip
# off any above that as they must clearly be bogus

egrep "\bPR( ?#?[0-9]+|[        ]+[^/   ]+/[0-9]+)([[:punct:]]|[        ]|$)"\
      gcc.commitlog | \
    sed -E "s:\bPR( [a-zA-Z][-+a-zA-Z0-9]*/| ?#?)([0-9][0-9]*).*:x\2:" | \
    sed -E "s/.*x([0-9][0-9]*)/\1/" | \
    sort -n | \
    uniq | \
    sed -e "/.......*/d" > bugids.list

./bugdb.py

Re: Commit messages and the move to git

Reply via email to