[Bug gcov-profile/54487] [4.8 Regression] profiledbootstrap broken by r190952

tejohnson at google dot com Tue, 11 Sep 2012 10:25:45 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487

--- Comment #16 from Teresa Johnson <tejohnson at google dot com> 2012-09-11 
17:24:58 UTC ---
On Thu, Sep 6, 2012 at 10:18 PM, Teresa Johnson <tejohn...@google.com> wrote:
> On Thu, Sep 6, 2012 at 1:49 PM, hjl.tools at gmail dot com
> <gcc-bugzi...@gcc.gnu.org> wrote:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>>
>> --- Comment #13 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-06 
>> 20:49:02 UTC ---
>> It works for me now after syncing with revision 191037.
>
> Unfortunately, I have now hit this myself a couple times in further
> testing, so I am still digging...

After digging into this for several days now, I am convinced that the
gcda file is changing out from underneath the profile-use compile
during the read. I just don't understand how, since fcntl is being
used to lock the file. Details below. I am hoping someone has some
ideas or advice on where to go.

First, some info on the compile step and gcda file locking, then I
will explain why I belive the gcda file is being written mid-read.

The failure is occurring sporadically during the profile-use step of a
parallel profiledbootstrap. In this step, not only are the gcc
profiles being read in by the profile-use compilation, but the gcc
binaries being used in the compile are instrumented, so the gcda files
are also being written by the parallel build processes. However, the
fcntl file locking should prevent interference between the reads and
the writes of the same gcda file. According to the fcntl
documentation, the lock is lost on any fclose on the file by the same
process, even if it uses a different file descriptor. But for a given
profile use compile, the read step should be complete before the write
step when the gcc exits and updates the gcda files with libgcov. So I
don't think the lock should be lost this way.

The failure is not reproducible manually. Therefore, I have been
adding instrumentation to the compiler to spit out information at both
the point of failure ("error: corrupted profile info: edge from 54 to
55 exceeds maximal count" from read_profile_edge_counts()), and then
at the point when the gcda counts are being read in by
read_counts_file(). This is the loop that looks like:

          for (ix = 0; ix != n_counts; ix++)
            entry->counts[ix] += gcov_read_counter ();

I modified the above loop to check each count as it is read in against
the sum_max, and if it exceeds sum_max to print the summary and
counters at that point. I then compared this to the counters obtained
for that function from a gcov-dump of the gcda file afterwards. The
counters in the gcda file from gcov-dump are slightly higher because
of subsequent merges into it by other parallel builds before
compilation aborts, but it is still fairly easy to correlate the
counter values between them.

What I found is that the counter values being read by read_counts_file
look good up to a certain point, and then they go bad. Looking at the
huge bad values in hex, they are actually valid values from the gcda
file, but it looks like we suddenly jumped back or ahead several
locations. So for example, in some cases where it looks like we
suddenly jumped ahead in the gcda file, I see that some of the last
counter values being read into the array as counters are actually the
tag values from just after the counter array in the gcda file. Or in
some cases, the counter values are being read mis-aligned by one word,
so they look huge because instead of having 0x00000000 in the most
significant of the 2 counter words, the 0x00000000 is in the least
significant of the 2 counter words. In other words, we jumped some odd
number of words ahead (or behind) in the gcda file.

If another process is writing into the gcda file at the same time,
this could happen. Specifically, since the histogram now included in
the gcda file summaries with my patch only write non-zero values,
after a merge the size of the histogram, and therefore the size of the
summaries, could change. This will cause the starting offsets of
different tags/sections throughout the rest of the gcda file to shift.
This shouldn't be a problem if the file locking is working though. But
if the file locking is not working, then that would explain why
suddenly during the read we suddenly seem to jump to a different spot.

A couple of other notes:
- I added some code after each counter read in the above loop to seek
back to the offset where we read the tag, re-read it, and compare it
to the tag we read earlier (then re-seek back to the current location
in the counter array before continuing the read). Normally this check
succeeds, but in the cases where I am hitting the above errors, the
"tag" at that location has changed to something that doesn't look like
a tag.
- Jumping to a different spot should corrupt the reads of the rest of
the file. But the main loop in read_counts_file will simply ignore any
tag it doesn't understand, and exits when it reads a '0' tag. That's
why there was no error being flagged during the read when this was
happening.
- If something like this was happening on trunk before my patch, it
most likely would have been hidden because the size of the summaries
was static, and therefore so were the offsets within the file, except
in the case when a new summary was being added during the merge (which
is not very often). So it is possible that this problem existed prior
but was hidden.

Any suggestions or ideas?

Thanks,
Teresa

>
> Teresa
>
>>
>> --
>> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
>> ------- You are receiving this mail because: -------
>> You are on the CC list for the bug.
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

[Bug gcov-profile/54487] [4.8 Regression] profiledbootstrap broken by r190952

Reply via email to