Four times faster printf %e

Povilas Kanapickas Mon, 06 Jan 2014 12:58:38 -0800

Hello,

I have a proof-of-concept implementation of floating-point printf()
which is 4 to 6 times faster than the glibc printf() in common use cases
while producing exactly the same output (including rounding, etc.). I
would like to contribute it to stdlibc++ as an implementation of
floating-point num_put facet.


Contributing this directly to glibc may look like a better idea.
However, in light of this comment in locale_facets.tcc[1] and since my
implementation happens to be in C++, I decided to ask here first.

If you are interested, please read on :-)

The printf() implementation in glibc is comparatively inefficient since
the internal scaling computations are always lossless. In some common
cases only limited precision is enough. Also, since arbitrary precision
library is used and the precision is not known beforehand, there are
very few inlining opportunities.

The approach that I've taken is to perform the computation in fixed
precision that is sufficient to represent usable number of digits (e.g.
96-bits are used for double and 64-bits for float). The computation is
imprecise, which makes correct rounding impossible. Fortunately we can
estimate the upper bound of an error in the final value that we extract
the digits from. Then, if we see that our value is so close to the
rounding boundary that the error is sufficient to affect the rounding
decision, we abort and use the regular, exact printf() instead.

It's easily seen that the above approach is valid only if the requested
precision* is sufficiently small. This issue is solved by checking the
precision and using the libc printf() if it is too large. Fortunately,
most of the time the requested precision is not much larger than
DBL_DECIMAL_DIG, so the faster formatter is still usable.

* - here 'precision' is the total number of sequested significant digits.

Unfortunately the faster formatter can actually be slower in cases when
it detects that it doesn't have enough precision only in the rounding
step. This is solved by using smaller than internal precision in the
initial check. For example, in the current implementation we process
requests with at most 21 digits of precision and assume that the
internal precision is 24 digits. This means that libc printf() is called
in the rounding step only in 0.1% cases, which doesn't impact the
overall performance a lot.

Here is the actual performance comparison on Intel Wolfdale (64bit
mode). The number is number of seconds that the program spent in user
mode. In both cases there were 20m doubles roughly exponentially
distributed in the quoted ranges. The null_* cases are equivalent to the
regular ones except that the call to formatting function is omitted.

* 1e-30 .. 1e30, -1e-30 .. -1e30

 cf        time: 4.01
 libc      time: 16.93
 null_cf   time: 0.58
 null_libc time: 1.06

* 1e-300 .. 1e300, -1e-300 .. -1e300

 cf        time: 4.37
 libc      time: 26.71
 null_cf   time: 0.59
 null_libc time: 1.02

I've tested 1 billion conversions of positive and negative numbers in
1e-300..1e300 range to verify that the conversion is roughly correct.
There were no failed cases. As far as I see there are no fundamental
problems with the approach, thus even if some failures are detected in
the future, it should be possible to solve them by adjusting the
internal parameters or increasing precision. A throughout mathematical
investigation should be done to verify correctness though.

The code is here: https://github.com/p12tic/floatfmt

The Makefile contains some documentation.

* 'make test' produces speed results
* 'make testout' produces several files for output comparison

I would like to stress that all of this is proof-of-concept. The code is
not pretty or well documented and there are a lot of quirks and bugs
that I didn't bother to fix yet. I've tested only double precision
numbers, only the format "%.17e" and only on 64bit x86 host.

Are stdlibc++ mainainers interested in the idea? The implementation
could be adapted to include a version that uses an arbitrary precision
arithmetic library, so that the dependency on printf() could be removed
entirely along with the support code.

If the answer to the above question is yes, then should I prepare a
patch myself or would someone from the GCC developers adapt my idea?
Please have in mind that I'm new to GCC stdlib development so it may
take some time until I have a patch that satisfies the internal library
standards. It may also take some developers' time too to explain me the
"accepted way of doing things" :-)

Regards,
Povilas

[1]: libstdc++-v3/include/bits/locale_facets.tcc:

// The following code uses vsnprintf (or vsprintf(), when
// _GLIBCXX_USE_C99 is not defined) to convert floating point values
// for insertion into a stream.  An optimization would be to replace
// them with code that works directly on a wide buffer and then use
// __pad to do the padding.  It would be good to replace them anyway
// to gain back the efficiency that C++ provides by knowing up front
// the type of the values to insert.  Also, sprintf is dangerous
// since may lead to accidental buffer overruns.  This
// implementation follows the C++ standard fairly directly as
// outlined in 22.2.2.2 [lib.locale.num.put]

Four times faster printf %e

Reply via email to