[sage-devel] Re: Using valgrind to find segfaults

Bill Hart Wed, 03 Nov 2010 05:30:15 -0700

Hi Leif,

On 3 Nov, 06:19, leif <not.rea...@online.de> wrote:
> On 3 Nov., 06:29, Bill Hart <goodwillh...@googlemail.com> wrote:
>
> > [...]
> > Firstly, thank you to all the people who took the time to work on
> > putting the new MPIR and Pari into Sage.
>
> > (By the way, I don't understand why MPIR has been updated to 2.1.2 and
> > not 2.1.3 which fixes a serious bug in the mpf functions. Nor do I
> > understand why MPIR has been updated and the thread for this hasn't
> > been closed. Also FLINT hasn't been updated, even though I explicitly
> > stated it isn't safe to build the old flint against the new MPIR.)
>
> Em, we haven't yet upgraded MPIR in Sage (see #8664), it's still
> 1.2.2.


I just noticed this myself. I misread the version number of the spkg
in Sage 4.6 as 2.1.2 and not 1.2.2. This explains why the ticket
hasn't been closed. It also explains why flint will build against the
MPIR in sage.

It's blowing my mind that Sage is still using an 18 month old MPIR
which is almost uniformly half the speed!! I predict the doctest times
for some modules will drop noticeably when you put the new MPIR in.

Hasn't David Harvey been maintaining an optional GMP 5.0.1 spkg? I
can't believe anyone has still been using Sage with the old MPIR spkg
when that is available. It is also uniformly twice as fast!!

Didn't there used to be some thing about Sage wanting to be a viable
alternative to Magma....

>
> (I recently sent you and thempirteam an e-mail regarding the [rather
> trivial] exec stack problem of MPIR 2.1.x with Fedora 14. I couldn't
> find it on mpir-devel, and MPIR's trac was down.)

I forwarded it just now. That may save a day or two before thempirteam
email address gets read (I am no longer in charge of MPIR, Jason
Moxham is). At any rate, I'm pretty sure mpir 1.2.2 is no different
with regard to this issue than the latest version.

We got asked to add some code to the bottom of a large number of files
in MPIR for some "security issue" a while back by some distro, but I
didn't know what to make of the request. I guess maybe nothing
happened?

>
> Hopefully we'll get the new MPIR into an early alpha of 4.6.1, but
> there's still work to do to make /upgrading Sage/ work with that,
> since currently not all dependent parts of the /Sage library/ get
> automatically properly rebuilt. I think we made a step forward with
> the 4.6 release, since now at least dependent /spkgs/ get rebuilt.

I see.

>
> W.r.t. 2.1.3, somebody else said we're currently not using any of the
> mpf functions in Sage.
>

cddlib, ecl, mpmath, mpfr, singular, sympy all, as far as I can see,
make extensive use of mpf functions.

>
>
>
>
>
>
>
>
> > Anyhow, whilst reading the long Pari trac ticket, and associated
> > tickets, a few things stood out to me (a C programmer) that just might
> > not be obvious to everyone. Apologies if this is already known to
> > everyone here.
>
> > At some point the new Pari + new MPIR caused a segfault in one of the
> > doctests. Now, segfaults are in some ways the easiest types of bugs to
> > track down. Here's how:
>
> > You simply compile the relevant C libraries with gcc -g (this adds
> > symbol information and line numbers to the compiled libraries). Next,
> > you run the program valgrind. You don't need to do anything to run
> > this program. It just works.
>
> > If you normally type "blah" at the command line to run your program,
> > just type "valgrind blah" instead. It will take much longer to run
> > (usually 25-100 times longer), but it will tell you precisely which
> > lines of the C code caused the segfault and if it was reading or
> > writing to an invalid memory address at the time! Its output is a bit
> > like a stack trace in Python.
>
> > Note you can actually do all this with a Sage doctest, because after
> > all, Sage is just a program you run from the command line.
>
> > Once you find out which lines of C code the segfault occurs at, you
> > can put a trace in to see if the data being fed to the relevant
> > function is valid or not. This tells you if the library is at fault or
> > your higher level Python/Cython code is somehow responsible for
> > feeding invalid data (e.g. some C object wasn't initialised).
>
> > Once upon a time, Michael Abshoff used to valgrind the entire Sage
> > test suite and fix all the myriad bugs that showed up!
>
> > So valgrind is the bug hunters friend.
>
> > A second observation, made by Leif I think, is spot on. This all quite
> > possibly shows up a problem with insufficient doctesting in Sage.
>
> > Now the MPIR test code is pretty extensive and really ought to have
> > picked up this bug. We put a lot of time into the test code for that
> > MPIR release, so this is unfortunate.
>
> > However, the entire Pari test suite and the entire Sage test suite
> > (with an older version of Pari) passed without picking up this pretty
> > serious bug in the MPIR division code!
>
> > I think this underscores something I have been saying for a long time.
> > Sage doesn't test the C libraries it uses well enough. As a result of
> > that, it is taking inordinate amounts of developers' time to track
> > down bugs turned up by Sage doctests when spkg's are updated. In some
> > cases there is actually woefully inadequate test code in the C library
> > itself. But even when this is not the case, it makes sense for Sage to
> > do some serious testing before assuming the library is bug free. This
> > is particularly easy to do in Python, and much harder to do at the
> > level of the C library itself, by the way.
>
> > I have been saying this for a very long time, to many people. *ALL*
> > mathematical libraries are broken and contain bugs. If you don't test
> > the code you are using, it *is* broken. The right ratio of test code
> > to code is really pretty close to 50/50. And if you think I don't do
> > this myself when I write code (even Sage code), well you'd be wrong.
>
> > One solution would be for everyone to test more widely. If you write
> > code that depends on feature Y of module X and module X doesn't
> > properly test feature Y, assume it is broken and write doctests for
> > that code as well as the code you are writing yourself.
>
> > To give an example, Andy Novocin and I have been working on new
> > polynomial factoring code in FLINT for a couple of years now. Around 6
> > months ago we had a long test of some 100,000 or so polynomials
> > factoring correctly. We also had a long test of some 20 odd very
> > difficult polynomials factoring correctly. Thus there was no reason at
> > all to suppose there were *ANY* bugs in the polynomial factoring code
> > or any of the functions it made use of. By Sage standards I think this
> > is an insane level of testing.
>
> > But I insisted that every function we have written have its own test
> > code. This has meant 6 months more work (there was something like
> > 40,000 lines of new code to test). But I cannot tell you how many new
> > serious bugs (and also performance problems too) that we turned up.
> > There must be dozens of serious bugs we've fixed, many of which would
> > have led to incorrect factorisations of whole classes of polynomials.
>
> > The lesson for me was: just because my very extensive 5 or 6 doctests
> > passed for the very complex new functionality I added, does not mean
> > there aren't incredibly serious bugs in the underlying modules I used,
> > nor does it mean that my new code is worth printing out and using as
> > toilet paper.
>
> > Detecting bugs in Sage won't make Sage a viable alternative to the
> > MA*'s (that a whole nuther thread). After all, testing standards in
> > those other packages are quite possibly much worse. But testing more
> > thoroughly will mean less time is spent wasted trying to track down
> > bugs in an ad hoc manner, and eventually, much more time available for
> > addressing those issues that are relevant to becoming a viable
> > alternative.
>
> A long way to go... ;-)
>
> I don't think people would like a complete feature (and perhaps
> component upgrade) freeze for e.g. 6 months.

This seems to me to be a self compounding issue. The longer these
packages are put off, the more difficult the problems become. We're
talking nearly 18 months since some spkgs have been updated. I'm glad
we're getting around to it, but what I think I'm seeing is the
following work flow:

1) Try to build the whole of Sage on top of package X (without
necessarily having updated packages X relies on or testing package X
in isolation on supported platforms)
2) Bizarre doctest failures result
3) Find some bug, report upstream and pull package X back out
4) Upstream fixes bug (if upstream even exists)
5) Package doesn't get put in again because new work on package X that
has happened in the mean time might potentially cause more failures
6) Long delay whilst everyone involved licks their wounds
7) Rinse, lather, repeat

I don't feel like that is a sustainable model.

I don't wish to appear mean, but the (fairly humourous) image that
comes to mind is that we're taping a whole pile of old vacuum tubes
and radio parts together and telling people it's a mercedes benz.
There doesn't seem to be a coherent plan to rationalise the core of
Sage, modularise, or meet the objectives Sage has. All I see in the
future is more taping together of ever more broken bits and pieces and
trying to make it all cooperate.

>
> But there's work in progress to at least better support more automatic
> testing on a wide(r) variety of platforms and systems. If we get new
> weird doctest or build errors with every (pre-)release, there remains
> little time to solve problems a long time in.

This seems to have happened since the very first Sage's were released.
I don't remember a time when there weren't new unusual bugs that
showed up on platform X.Y.Z.

To me, this speaks of a need to break the process down a little. Most
supported platforms surely don't cause us regular problems, because
developers are testing on those platforms anyway. So initially release
a Sage which is fine on those "easy" and most widely used platforms.

Then for each other supported platform X, have a group (not the same
one doing the initial release) who then are tasked with getting it
running on platform X. They maintain their own repo, apply their own
patches and release their own "platform X certified Sage" when they
are good and ready. Eventually, they feed their patches back into the
"main" Sage so as to keep the number of patches they have to regularly
apply to a minimum. This model would prevent all this pointless too-
ing and fro-ing I see because something or other doesn't pass on
Solaris, or whatever. You can't seriously expect to do this all on the
day of release! Lots of people are just gonna get burned out if we
keep demanding that of them.

When gcc release a new compiler, it isn't instantly available on
Ubuntu and certainly not on Solaris. I don't see why Sage should be
any different.

And we need to put aside this naive notion that if Sage passes its
doctests on all supported platforms then somehow it is substantially
more reliable. There's probably 100,000 bugs in Sage. If three
doctests fail, big whoop. If they all pass on platform X, that should
be enough for platform X.

(You will be able to tell when Sage doesn't have 100,000 bugs because
you'll be able to afford to offer a $3300 bounty for fixing critical
bugs. :-)

Another big issue is modularity. Different groups should be
responsible for different parts of Sage. There should be a "core Sage"
community, a graph theory community, a symbolics community, a number
theory community, etc (with some overlap of membership obviously).
Each community should be able to test, update and release their part
of Sage independently of the other parts. I've argued for this for
years. No one has ever agreed with me.

No large project can ultimately succeed without modularity. This is
why module systems were added to every serious programming language,
because it was realised that modularity was essential to the success
of large projects (small projects didn't need it and neither do toy
languages).

Sage has been showing the strain for over 2 years now.

Modularity would also help things like porting to new platforms. E.g.
how could anyone think Sage will ever become a viable alternative to
the MA*'s if it never runs natively on Windows? But we'll never
achieve the latter if there isn't something smaller than Sage to port
before you port everything else. Modularity is the key to this.

Finally, testing each individual spkg (against its dependencies) on
all supported platforms *before* having to download and build the
whole of the latest Sage seems to me to be a logical first step. I'm
not seeing even that happen at the moment. This again is a kind of
modularity. If the new Pari doesn't even build on the Solaris, what is
the point of spending a whole day building and testing the whole of
Sage on Solaris? And if Pari doesn't even have a comprehensive test
suite and a new stable release I'm not getting why we are even using
it the way we are. We surely need to be much more sceptical about it
and test the hell out of it before trying to put it into Sage. OK it's
in now, but is it really worth doing it that way again in the future?

And thank goodness ticket #4000 finally got closed. I haven't even sat
down to try and analyse what went on there. There *has* to be a lesson
or two to learn from that process.

>
> -Leif

-- 
To post to this group, send an email to sage-devel@googlegroups.com
To unsubscribe from this group, send an email to 
sage-devel+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sage-devel
URL: http://www.sagemath.org

[sage-devel] Re: Using valgrind to find segfaults

Reply via email to