Hello, I have a proof-of-concept implementation of floating-point printf() which is 4 to 6 times faster than the glibc printf() in common use cases while producing exactly the same output (including rounding, etc.). I would like to contribute it to stdlibc++ as an implementation of floating-point num_put facet.
Contributing this directly to glibc may look like a better idea. However, in light of this comment in locale_facets.tcc[1] and since my implementation happens to be in C++, I decided to ask here first. If you are interested, please read on :-) The printf() implementation in glibc is comparatively inefficient since the internal scaling computations are always lossless. In some common cases only limited precision is enough. Also, since arbitrary precision library is used and the precision is not known beforehand, there are very few inlining opportunities. The approach that I've taken is to perform the computation in fixed precision that is sufficient to represent usable number of digits (e.g. 96-bits are used for double and 64-bits for float). The computation is imprecise, which makes correct rounding impossible. Fortunately we can estimate the upper bound of an error in the final value that we extract the digits from. Then, if we see that our value is so close to the rounding boundary that the error is sufficient to affect the rounding decision, we abort and use the regular, exact printf() instead. It's easily seen that the above approach is valid only if the requested precision* is sufficiently small. This issue is solved by checking the precision and using the libc printf() if it is too large. Fortunately, most of the time the requested precision is not much larger than DBL_DECIMAL_DIG, so the faster formatter is still usable. * - here 'precision' is the total number of sequested significant digits. Unfortunately the faster formatter can actually be slower in cases when it detects that it doesn't have enough precision only in the rounding step. This is solved by using smaller than internal precision in the initial check. For example, in the current implementation we process requests with at most 21 digits of precision and assume that the internal precision is 24 digits. This means that libc printf() is called in the rounding step only in 0.1% cases, which doesn't impact the overall performance a lot. Here is the actual performance comparison on Intel Wolfdale (64bit mode). The number is number of seconds that the program spent in user mode. In both cases there were 20m doubles roughly exponentially distributed in the quoted ranges. The null_* cases are equivalent to the regular ones except that the call to formatting function is omitted. * 1e-30 .. 1e30, -1e-30 .. -1e30 cf time: 4.01 libc time: 16.93 null_cf time: 0.58 null_libc time: 1.06 * 1e-300 .. 1e300, -1e-300 .. -1e300 cf time: 4.37 libc time: 26.71 null_cf time: 0.59 null_libc time: 1.02 I've tested 1 billion conversions of positive and negative numbers in 1e-300..1e300 range to verify that the conversion is roughly correct. There were no failed cases. As far as I see there are no fundamental problems with the approach, thus even if some failures are detected in the future, it should be possible to solve them by adjusting the internal parameters or increasing precision. A throughout mathematical investigation should be done to verify correctness though. The code is here: https://github.com/p12tic/floatfmt The Makefile contains some documentation. * 'make test' produces speed results * 'make testout' produces several files for output comparison I would like to stress that all of this is proof-of-concept. The code is not pretty or well documented and there are a lot of quirks and bugs that I didn't bother to fix yet. I've tested only double precision numbers, only the format "%.17e" and only on 64bit x86 host. Are stdlibc++ mainainers interested in the idea? The implementation could be adapted to include a version that uses an arbitrary precision arithmetic library, so that the dependency on printf() could be removed entirely along with the support code. If the answer to the above question is yes, then should I prepare a patch myself or would someone from the GCC developers adapt my idea? Please have in mind that I'm new to GCC stdlib development so it may take some time until I have a patch that satisfies the internal library standards. It may also take some developers' time too to explain me the "accepted way of doing things" :-) Regards, Povilas [1]: libstdc++-v3/include/bits/locale_facets.tcc: // The following code uses vsnprintf (or vsprintf(), when // _GLIBCXX_USE_C99 is not defined) to convert floating point values // for insertion into a stream. An optimization would be to replace // them with code that works directly on a wide buffer and then use // __pad to do the padding. It would be good to replace them anyway // to gain back the efficiency that C++ provides by knowing up front // the type of the values to insert. Also, sprintf is dangerous // since may lead to accidental buffer overruns. This // implementation follows the C++ standard fairly directly as // outlined in 22.2.2.2 [lib.locale.num.put]