[FFmpeg-devel] On errors, asserts and crashing (was: Shaping the AVTextFormat API Surface)

Nicolas George Thu, 24 Apr 2025 07:47:41 -0700

softworkz . (HE12025-04-22):
> At the implementation level, I sensed that "you" ("FFmpeg")
> are following some principles which are somewhat contradictive to 
> those that I'm usually adhering to (e.g. "parameter validation
> being a responsibility of the call site, crashing otherwise 
> being acceptable"). Nonetheless, I'm the one who has to adapt,
> and I'm not going to question that.


I am feeling didactic, so I will write what I have to say on the topic
in full details, so that it can also serve as future reference.

Crashing immediately when some unexpected condition happens is not a
quirk of a few FFmpeg developers, it is very often the only way to do
things and also often better than any available alternative. Knowing
what to do when it is only an option is not always obvious, but this is
FFmpeg, not some management software project developed by engineers
fresh from an average school.

Consider the simplest mathematical operation that can fail, a division:

        a = b / c;

What happens if c is 0?

It depends: on the language, the CPU, the build options, possibly the
phases of the Moon. But mostly, what happens lives between two extremes:

- it crashes immediately, sometimes even with a hardware exception
  generated by the CPU;

- it returns some kind of cheesy not-a-number value that will
  contaminate all computations done with a, leading to silently
  corrupted output.

Intermediate cases can be: return a not-a-number but something later
tests it and crashes; or: return not a number and be later conditionally
discarded to produce corrupt output only once in a while. For the sake
of simplicity, let us focus on the extreme cases.

There are two things to say here:

- both are very bad;

- silently corrupting the output is the worst.

(There are languages where division by zero has a precise and
mathematically consistent semantic that a very careful programmer could
make use of. That does not invalidate my point.)

Both are bad: that means that whenever we write a division, we must be
sure that the divisor is not zero. Sometimes, it is very easy:

        unsigned n = 0;
        do { n++; } while (*(p++));
        a = b / n;

Loops of the do…while kind are always executed at least once, therefore
we know that n ≠ at the end. (Did you notice the flaw in that reasoning?
I will come back to it later.)

Sometimes, it is just a matter of testing:

        if (nb_notes == 0)
            die("Why are you asking me to compute an average when you
                 have not yet entered any note, you idiot?");

And sometimes it relies on a 10-pages mathematical proof that the
algorithm we are implementing is correct.

But hard or easy, it is always the responsibility of the programmer to
know that the divisor is not zero before coding a division.

This is not unique to divisions:

We must know a pointer points to a valid object before dereferencing it.

We must know a pointer points to writable memory before writing to it.

We must know an index is within the boundaries of an array before
accessing it.

And this is not limited to C and other programs FFmpeg is fond of: write
into a cell beyond the end of the array in C, it will let you modify
something else entirely and that might let the spooks or cryptlockers
in; but do it in Java or OCaml, it will trigger an exception that we
probably neglected to catch, because if we thought of catching it we
would probably have written it more elegantly.

The conclusion of this section is this:

Any programmer worth their salt must know a lot of invariants about each
line of their code, invariants that are not written into the code but
that they would be able to express and prove if asked to. “At this
point, this number is not zero”; “at this point, this pointer comes from
malloc() or realloc()”, “at this point there are no other references to
this object”, “at this point this block of memory has a 0 before the
end” and so on.

Failure to account for the invariants required by the constructs used in
the program will result in the program misbehaving in more or less
harmful ways.


Now about function calls and API.

As I said, constructs in the program have invariant requirements: if we
divide, it must not be by zero. These requirements propagate backwards
along the flow of the program: if l must not be zero, then the string
it is the length of must not be empty, therefore the list from which it
is built must not itself be empty, and so on.

A program behaves correctly if at all places in the code flow, the
invariants guaranteed by the earlier parts are at least as strict as the
invariant requirements impose by the later parts.

In a complete program, functions are just roudabouts in the program
flow. As such, like any place in the program there are invariant
requirements. But if we think of functions as separating different parts
of a program that are under the responsibility of different person
(or the same person at different times), theses invariants requirements
must be documented.

Sometimes, it is very subtle: “The strlen function computes the length
of the string pointed to by s”: where does it say s must not be a NULL
pointer? Not explicitly, but it is implied by “string pointed to”,
because NULL does not point to anything, let alone a string.

What happens if we call strlen() on a NULL pointer? That depends. On
Unix systems, it usually results in a SIGSEGV; something similar will
happen with other memory-protected OSes. But on a microcontroller, it
will just compute how many non-0 octets there are at this place in the
memory. Whatever the effect, it misbehaves.


What about mistakes?

Even excellent programmers, programmers worth rose Himalayan salt, can
make mistakes. For example, in the code I gave above as an example where
it is easy to be sure that n is not zero, what happens on x86_64 if p
points to four giga-octets of non-zero values and then a 0? Oops.

In the ideal world, that would be impossible. In the ideal world, the
compiler would refuse to compile a program unless it can prove the
required invariants are met, with the help of annotations helpfully
provided by the programmer.

This is what Rust tried to do and lamentably failed at.

(That would leave out programs used to explore unprovable mathematical
properties. I am perfectly fine letting them use some kind of
“ignore-invariant” pragma, as long as it is not used in a program that
might crash a space probe, explode a nuclear plant or, worse, make a
game unwinnable.)

In the real world, we are not there yet, by far. Yet we must decide what
to do about all these invariants. In most of the places, we do nothing,
because the road is straight and if we are speeding, the speed bump a
hundred meters from now will trash us. But it is still useful to add our
own speed bumps and traffic lights at strategic places.

The worst thing we could do is pump the brakes before the speed bump. It
is called “defensive programming”: “at this point, this value should not
be negative, let us check if it is and make it 0 if necessary, the show
must go on!” Well, it is not a show, it is a program producing a result,
and if it goes on after breaking one of its invariant requirements it
will almost certainly produce invalid result. That is evil.


But do we add traffic lights or a speed bump? My traffic analogy breaks
down, I will stop using it. Do we add do nothing, do we an assert or do
we return an error?

But first, a short point about asserts. Logically, an assert and doing
nothing are the same, because asserts are often removed at build time
(by the way, I hope all FFmpeg developers remember to always work with
the maximum assert level). And because “crashing right now” is one case
of “anything can happen”. But asserts, when they are enabled, make
mistakes easier to debug, because they make their consequences more
visible earlier.

So: do nothing / assert, or return an error? This is the point where
there is no longer an easy answer. It requires weighting the pros and
the cons. The only thing I can do is make a list of a few considerations
that affects the weighting of the pros and cons.

Mostly, I will talk about code wrapped inside a function, with the
caller possibly under the responsibility of a different programmer.

One of the most important ones is: Is there already a channel to return
an error?

If the caller is already expected to check for an error return due to
conditions outside of the control of any programmer (the hard drive
died!), then using it to signal a programming error is an easy option.
On the other hand, if the function has been of type void for ten years,
changing it to return an error now is almost certainly not a good idea,
because a lot of code already calls it without checking for an error.

Even worse if the function used to return a positive value that was not
supposed to be an error, and now we want to return an error with a
distinct set of value (negative AVERROR code): we are in the “cheesy
not-a-number value” case and risk contaminating the output.

Another important one is: How easy is it to ensure the requirement is
met?

“Pointer must not be null” is one of the easiest to ensure, and as a
consequence it is one that is very often completely neglected, relying
on the OS sending a SIGSEGV.

“Flags must be a binary OR of the following constants” is another very
easy one: just do not do any fancy arithmetic with the value that will
end up in the flags. If we only use the designated constants and binary
arithmetic, we are guaranteed to be good. If not, what are we doing
exactly?

On the other hand, “expr must point to a string with a syntactically
valid expression”: I cannot check the syntax of expr without duplicating
a large part of the code I am about to call, and sometimes I might want
to use an expression that comes from the outside.

Useful rule of thumb: What would the C language do?

When writing a new function, we can try to imagine how it would work if
was a standard C function, and for that look at similar cases in the
libc. For example, many many C functions say “undefined behavior” if
NULL is passed where NULL is not expected. Whenever the answer to “what
would the C language do” contains the words “undefined behavior”, then
the best choice is to do nothing, possibly with an assert.

System calls, on the other hand, are not a very good rule of thumb: they
are heavily skewed towards “everything returns an error code, it is a
grave mistake to ignore it”; they are heavily burdened by the need for
compatibility with existing code and other systems; and the consequences
of doing nothing and just let the dice fall would be catastrophic.


To summarize: Deciding what to do when invalid values are met requires
properly understanding the invariant requirements of the program and
doing a cost-analysis of the options, including whether the rest of the
code can easily enforce those invariants and whether it can do something
graceful with an error.

Advice addressed are more beginner programmers tend to be more
one-size-fit-all, and as such inefficient. And often more dogmatic. 

Regards,

-- 
  Nicolas George
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] On errors, asserts and crashing (was: Shaping the AVTextFormat API Surface)

Reply via email to