Skip to content

More principled handling of float<->string #1785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Oct 12, 2017

This refactor comes after a belated realization that our numeric conversion routines (principally from_string<>) are "locale-dependent" via their use of strtod, as well as atof() scattered through the code.

In particular, this means that when software using this code is running with a locale in which the decimal mark is something other than '.' (e.g., many European locales that use ',' for decimal mark), these functions will incorrectly parse text floating point numbers that use the standard period as decimal mark. For example, if the locale is "fr_FR", atof("123.45") will return 123.0, rather than the correct 123.45.

At the core of our fix is the introduction of Strutil::strtof/strtod which by default is "locale-independent" (i.e. always uses '.') and also has varieties that take a std::locale& explicitly. The implementation uses strtod_l/strtof_l (nonstandard, but available on Linux, OSX, BSD, and a differently-named but equivalent function on Windows), with a gross but simple fallback for any straggler platforms (MINGW?) that copies the string and replaces the '.' with the native locale's mark.

This is pretty inexpensive: on Linux, the locale-independent version of our new Strutil::strtof has the same cost as the system locale-dependent atof/strtof, and the explicit locale version is only about 20% slower. On OSX, atof is remarkably fast, so our version that uses strtof_l is about 2x slower, but that seems acceptable given that it gives us all the control we want, and in comparison, another strategy involving istringstream was 10x slower. Unless this is shown to be an important bottleneck, I'm rejecting the alternative of embedding a full custom strtod implementation as being unnecessarily inviting maintenance costs and potential bugs. The approach we chose is minimally invasive and relies on system libraries for the heavy lifting.

Additionally:

  • Added a new stof(), stoi(), stoul() to replace the clunkier template syntax of from_string<>. The stof(), like our new strtof(), has a locale-independent as well as an explicit-local versions.

  • A handful of places that still had old calls to std atof and strtod were changed to Strutil::stof.

  • Strutil::format and ustring::format (and our copy of tinyformat underneath) were touched up so that the versions of format() that return strings will use the "C" locale, versions added that return strings and use an explicitly-passed locale, and the versions that append their results to existing streams continue to operate as before by honoring the existing locale conventions of the streams they are using.

  • Several places where we assembled strings using std::ostringstream, I forced the stream to use classic "C" locale in cases where it was likely to have any floating-point data output.

TIP going forward: For any I/O that must be persistent (files saved or read), or always be formatted identically regardless of the computer being used, always take care to use locale-independent (i.e. classic "C" locale) formatting functions and also making sure to initialize any std::ostringstream with stream.imbue(std::locale::classic()). For ephemeral I/O or UI elements that you want to display correctly internationalized for the user's country, then use the versions with explicit locale std::locale() and use the default initialization of output streams.

As an aside, I do wish that C/C++ had adopted the convention of default using the classic (globally uniform) locale for all functions that don't take an explicit locale, and leave the explicit locale functions just for programs that are intentionally trying to do country-by-country localization. This is all just a huge mess, and I hate that our library can have subtle I/O bugs when used by someone in another country because of a particular global locale setting that we have no control over and can't modify simply without potentially breaking the surrounding app.

Also, for anybody who has bothered to read this far, it's not unreasonable for you all to tell me that I've totally over-engineered this, and that as far as you're concerned, anything OIIO reads and writes should always and only use '.' decimal point and omit any locale flexibility, and that I should simplify all of this accordingly. But I think what I've done is made the default modes of everything behave this way at nominal cost, and only additionally (and at no extra expense if you don't use it) provided flexibility to use our functions to internationalize for nice UI and such for those who really want it.

@lgritz lgritz force-pushed the lg-locale branch 3 times, most recently from 8cd64b5 to d11ac6e Compare October 12, 2017 06:16
lgritz added a commit to lgritz/OpenShadingLanguage that referenced this pull request Oct 12, 2017
I got quite the education this week, realizing just how much of C++'s
I/O functionality implicitly takes the system's "locale" into
consideration.  Of particular concern is the conversion of text to
floating point numeric values and vice versa, for which much C++
functionality will happily swap the role of '.' and, say, ',' (comma)
for the decimal mark, depending on the user's country and whether they
are savvy enough to set their locale to the classic "C" (essentially
US-style) version. On the other hand, who can blame somebody for not
blindly accepting US hegemony these days?

But dammit, OSL is a programming language, and so '.' has a specific
role in numeric constants and ',' does something quite different (comma
operator!), and if you start mixing up the two when parsing programs,
it will all end in tears. All other major programming languages just
adopt this convention as well.

Now, if you were writing a full app, you could set the locale to the
classic one globally (to that process) upon startup, and then never
worry about it. Which we can do for oslc itself. But much of OSL is
actually a library that can be embedded into other libraries or apps,
so globally setting the locale can mess up UI internationalization or
the intentional localization of other app I/O. So we have to handle
this deftly throughout.

The fix comes in three parts:

1. I've submitted a separate review to OIIO here:
AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten
up that library, and in particular the various Strutil and ustring
utilities that we liberally use throughout OSL. And patrol some of the
places where we directly call the locale-sensitive C++ functions like
atof and strtod, and replace them with the now-safe OIIO::Strutil
equivalents. We'll be able to count on all this working for OIIO >= 1.8.6.

2. In a variety of places that don't rely on OIIO (for example, various
outputs via std::ostringstream), we need to take care to explicitly set
the stream's locale to produce the output we expect.

3. For people building OSL against a too-old version of OIIO to contain
the fixes described above, when compiling a buffer of OSL source code,
do the gross but necessary safeguard of saving the prior locale,
globally changing to "C" locale while we compile the OSL, then restoring
the locale.  And *HOPE* that the surrounding application isn't
simultaneously counting on the original locale-based
internationalization still being in effect for something it is doing in
another thread that's unaware that we've temporarily altered a global
setting. That's inherently unsafe, but remember it only affects OSL builds
against pre-1.8 OIIO (once that patch is applied on the OIIO side), and
only will be symptomatic for systems using a non-. locale (which is,
empirically speaking, rare enough that nobody noticed this issue until
this week).
@lgritz
Copy link
Collaborator Author

lgritz commented Oct 12, 2017

Note: I've also submitted the couple of changes I made internal to tinyformat.h to the tinyformat project: c42f/tinyformat#44 We're negotiating over the right way to do that, so some minor things may change along the way and we'll propagate it back here.

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 12, 2017

There is one decision I made in this patch that I'm not 100% sure of:

As submitted, Strutil::fprintf() (appends formatted output to a stream) honors the existing locale associated with the stream.

An alternate design would be for Strutil::fprintf() to ALWAYS print "C" locale style even to a stream, and have a second version that takes an explicit locale, to which could could pass mystream.getloc() if you wanted to preserve the stream locale.

I don't know which is better. The alternative has the advantage of being 100% consistent across all the Strutil functions -- default behavior always is C locale and you have to override to get something localized. But the stream overall will appear to have inconsistent behavior if you output to it with a mix of direct output (mystream << myfloat) and Strutil-formatted string (mystream << Strutil::format("%g",myfloat).

I'm leaning toward preferring the "alternate", but seeking advice.

@lgritz lgritz force-pushed the lg-locale branch 2 times, most recently from 609553f to e36dfee Compare October 16, 2017 02:52
lgritz added a commit to lgritz/OpenShadingLanguage that referenced this pull request Oct 16, 2017
I got quite the education this week, realizing just how much of C++'s
I/O functionality implicitly takes the system's "locale" into
consideration.  Of particular concern is the conversion of text to
floating point numeric values and vice versa, for which much C++
functionality will happily swap the role of '.' and, say, ',' (comma)
for the decimal mark, depending on the user's country and whether they
are savvy enough to set their locale to the classic "C" (essentially
US-style) version. On the other hand, who can blame somebody for not
blindly accepting US hegemony these days?

But dammit, OSL is a programming language, and so '.' has a specific
role in numeric constants and ',' does something quite different (comma
operator!), and if you start mixing up the two when parsing programs,
it will all end in tears. All other major programming languages just
adopt this convention as well.

Now, if you were writing a full app, you could set the locale to the
classic one globally (to that process) upon startup, and then never
worry about it. Which we can do for oslc itself. But much of OSL is
actually a library that can be embedded into other libraries or apps,
so globally setting the locale can mess up UI internationalization or
the intentional localization of other app I/O. So we have to handle
this deftly throughout.

The fix comes in three parts:

1. I've submitted a separate review to OIIO here:
AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten
up that library, and in particular the various Strutil and ustring
utilities that we liberally use throughout OSL. And patrol some of the
places where we directly call the locale-sensitive C++ functions like
atof and strtod, and replace them with the now-safe OIIO::Strutil
equivalents. We'll be able to count on all this working for OIIO >= 1.8.6.

2. In a variety of places that don't rely on OIIO (for example, various
outputs via std::ostringstream), we need to take care to explicitly set
the stream's locale to produce the output we expect.

3. For people building OSL against a too-old version of OIIO to contain
the fixes described above, when compiling a buffer of OSL source code,
do the gross but necessary safeguard of saving the prior locale,
globally changing to "C" locale while we compile the OSL, then restoring
the locale.  And *HOPE* that the surrounding application isn't
simultaneously counting on the original locale-based
internationalization still being in effect for something it is doing in
another thread that's unaware that we've temporarily altered a global
setting. That's inherently unsafe, but remember it only affects OSL builds
against pre-1.8 OIIO (once that patch is applied on the OIIO side), and
only will be symptomatic for systems using a non-. locale (which is,
empirically speaking, rare enough that nobody noticed this issue until
this week).
This refactor comes after a belated realization that our numeric
conversion routines (principally from_string<>) are "locale-dependent"
via their use of strtod, as well as atof() scattered through the code.

In particular, this means that when software using this code is running
with a locale in which the decimal mark is something other than '.'
(e.g., many European locales that use ',' for decimal mark), these
functions will incorrectly parse text floating point numbers that use
the standard period as decimal mark.  For example, if the locale is
"fr_FR", atof("123.45") will return 123.0, rather than the correct
123.45.

At the core of our fix is the introduction of Strutil::strtof/strtod
which by default is "locale-independent" (i.e. always uses '.') and also
has varieties that take a std::locale& explicitly. The implementation
uses strtod_l/strtof_l (nonstandard, but available on Linux, OSX, BSD,
and a differently-named but equivalent function on Windows), with a
gross but simple fallback for any straggler platforms (MINGW?) that
copies the string and replaces the '.' with the native locale's mark.

This is pretty inexpensive: on Linux, the locale-independent version of
our new Strutil::strtof has the same cost as the system locale-dependent
atof/strtof, and the explicit locale version is only about 20% slower.
On OSX, atof is remarkably fast, so our version that uses strtof_l is
about 2x slower, but that seems acceptable given that it gives us all
the control we want, and in comparison, another strategy involving
istringstream was 10x slower. Unless this is shown to be an important
bottleneck, I'm rejecting the alternative of embedding a full custom
strtod implementation as being unnecessarily inviting maintenance costs
and potential bugs. The approach we chose is minimally invasive and
relies on system libraries for the heavy lifting.

Additionally:

* Added a new stof(), stoi(), stoul() to replace the clunkier template
syntax of from_string<>. The stof(), like our new strtof(), has a
locale-independent as well as an explicit-local versions.

* A handful of places that still had old calls to std atof and strtod
were changed to Strutil::stof.

* Strutil::format and ustring::format (and our copy of tinyformat
underneath) were touched up so that the versions of format() that return
strings will use the "C" locale, versions added that return strings and
use an explicitly-passed locale, and the versions that append their
results to existing streams continue to operate as before by honoring
the existing locale conventions of the streams they are using.

* Several places where we assembled strings using std::ostringstream,
I forced the stream to use classic "C" locale in cases where it was
likely to have any floating-point data output.

* For maketx and oiiotool, globally set the locale to classic. In these
cases, we control the whole app, so this is safe to do.

TIP going forward: For any I/O that must be persistent (files saved or
read), or always be formatted identically regardless of the computer
being used, always take care to use locale-independent (i.e. classic "C"
locale) formatting functions and also making sure to initialize any
std::ostringstream with stream.imbue(std::locale::classic()). For
ephemeral I/O or UI elements that you want to display correctly
internationalized for the user's country, then use the versions with
explicit locale std::locale() and use the default initialization of
output streams.

As an aside, I do wish that C/C++ had adopted the convention of default
using the classic (globally uniform) locale for all functions that don't
take an explicit locale, and leave the explicit locale functions just
for programs that are intentionally trying to do country-by-country
localization. This is all just a huge mess, and I hate that our library
can have subtle I/O bugs when used by someone in another country because
of a particular global locale setting that we have no control over and
can't modify simply without potentially breaking the surrounding app.
…tring_view

More fixes and refactoring after it was pointed out that there were some
additional latent bugs related to places where we implicitly assumed
that a string_view was null terminated or within a null-terminated
string. Also fixed important bugs in Strutil::parse_int and parse_float
where it was definitely not safe to assume that the string_view passed
was the null-terminated edge of the string, and so, for example if a
string_view pointed to the characters "12" but the very next characters
were "345" parse_int would return 12345 instead of the correct 12.
@lgritz
Copy link
Collaborator Author

lgritz commented Oct 26, 2017

Pushed more fixes and refactoring after it was pointed out that there were some additional latent bugs related to places where we implicitly assumed that a string_view was null terminated or within a null-terminated string. Also fixed important bugs in Strutil::parse_int and parse_float where it was definitely not safe to assume that the string_view passed was the null-terminated edge of the string, and so, for example if a string_view pointed to the characters "12" but the very next characters were "345" parse_int would return 12345 instead of the correct 12.

@fpsunflower
Copy link
Contributor

The cases you are fixing here sound good to me, but I'm still worried that this is more complexity than if we just grabbed a known good atoi/atof and used that all the time. OIIO is not a localization library. I think it would be fine to document its parsing/format functions to be intended for predictable serialization and forget locales exist altogether.

Chances are if you really are designing some international UI application you are using Qt which has much more comprehensive solutions for all this stuff (AFAIK).

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 26, 2017

I think that's an extremely valid position -- just declaring that because at its heart, OIIO is fundamentally about persistent and predictable I/O (duh), even its generic utility components ought to use "C" locale only and be done with it, and if you want something else you should seek another library.

The counter-argument is simply the observation that it's only 10% more code to also have (optional) entry points that would let you conform to C++'s behavior. It's actually not a lot of extra complexity (I would consider it a larger addition of unnecessary complexity to have our own string conversion functions -- at least the way I've done it, we rely on std for all the actual character handling).

So you think I should just drop the loc-aware calls entirely, leave France to puzzle over its commas without us, and I guess we could always add them back later if a need comes up?

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 26, 2017

On the other hand, if we took an existing high-quality atoi/atof, we could adapt it to work on a string_view directly -- without it needing null termination and without the need to make a temporary std::string.

Hmmm...

@fpsunflower
Copy link
Contributor

That's what I was thinking yes - a directly string_view enabled parser.

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 26, 2017

ok, ok, let me take a look

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 27, 2017

Counter-proposal in #1796

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 27, 2017

We decided to go with the alternate solution of #1796, so I'm going to close this PR. We can always revisit if something comes up that necessitates OIIO exposing locale-aware formatting or parsing.

@lgritz lgritz closed this Oct 27, 2017
@lgritz lgritz deleted the lg-locale branch October 27, 2017 22:37
lgritz added a commit to lgritz/OpenShadingLanguage that referenced this pull request Oct 28, 2017
I got quite the education this week, realizing just how much of C++'s
I/O functionality implicitly takes the system's "locale" into
consideration.  Of particular concern is the conversion of text to
floating point numeric values and vice versa, for which much C++
functionality will happily swap the role of '.' and, say, ',' (comma)
for the decimal mark, depending on the user's country and whether they
are savvy enough to set their locale to the classic "C" (essentially
US-style) version. On the other hand, who can blame somebody for not
blindly accepting US hegemony these days?

But dammit, OSL is a programming language, and so '.' has a specific
role in numeric constants and ',' does something quite different (comma
operator!), and if you start mixing up the two when parsing programs,
it will all end in tears. All other major programming languages just
adopt this convention as well.

Now, if you were writing a full app, you could set the locale to the
classic one globally (to that process) upon startup, and then never
worry about it. Which we can do for oslc itself. But much of OSL is
actually a library that can be embedded into other libraries or apps,
so globally setting the locale can mess up UI internationalization or
the intentional localization of other app I/O. So we have to handle
this deftly throughout.

The fix comes in three parts:

1. I've submitted a separate review to OIIO here:
AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten
up that library, and in particular the various Strutil and ustring
utilities that we liberally use throughout OSL. And patrol some of the
places where we directly call the locale-sensitive C++ functions like
atof and strtod, and replace them with the now-safe OIIO::Strutil
equivalents. We'll be able to count on all this working for OIIO >= 1.8.6.

2. In a variety of places that don't rely on OIIO (for example, various
outputs via std::ostringstream), we need to take care to explicitly set
the stream's locale to produce the output we expect.

3. For people building OSL against a too-old version of OIIO to contain
the fixes described above, when compiling a buffer of OSL source code,
do the gross but necessary safeguard of saving the prior locale,
globally changing to "C" locale while we compile the OSL, then restoring
the locale.  And *HOPE* that the surrounding application isn't
simultaneously counting on the original locale-based
internationalization still being in effect for something it is doing in
another thread that's unaware that we've temporarily altered a global
setting. That's inherently unsafe, but remember it only affects OSL builds
against pre-1.8 OIIO (once that patch is applied on the OIIO side), and
only will be symptomatic for systems using a non-. locale (which is,
empirically speaking, rare enough that nobody noticed this issue until
this week).
lgritz added a commit to AcademySoftwareFoundation/OpenShadingLanguage that referenced this pull request Oct 28, 2017
I got quite the education this week, realizing just how much of C++'s
I/O functionality implicitly takes the system's "locale" into
consideration.  Of particular concern is the conversion of text to
floating point numeric values and vice versa, for which much C++
functionality will happily swap the role of '.' and, say, ',' (comma)
for the decimal mark, depending on the user's country and whether they
are savvy enough to set their locale to the classic "C" (essentially
US-style) version. On the other hand, who can blame somebody for not
blindly accepting US hegemony these days?

But dammit, OSL is a programming language, and so '.' has a specific
role in numeric constants and ',' does something quite different (comma
operator!), and if you start mixing up the two when parsing programs,
it will all end in tears. All other major programming languages just
adopt this convention as well.

Now, if you were writing a full app, you could set the locale to the
classic one globally (to that process) upon startup, and then never
worry about it. Which we can do for oslc itself. But much of OSL is
actually a library that can be embedded into other libraries or apps,
so globally setting the locale can mess up UI internationalization or
the intentional localization of other app I/O. So we have to handle
this deftly throughout.

The fix comes in three parts:

1. I've submitted a separate review to OIIO here:
AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten
up that library, and in particular the various Strutil and ustring
utilities that we liberally use throughout OSL. And patrol some of the
places where we directly call the locale-sensitive C++ functions like
atof and strtod, and replace them with the now-safe OIIO::Strutil
equivalents. We'll be able to count on all this working for OIIO >= 1.8.6.

2. In a variety of places that don't rely on OIIO (for example, various
outputs via std::ostringstream), we need to take care to explicitly set
the stream's locale to produce the output we expect.

3. For people building OSL against a too-old version of OIIO to contain
the fixes described above, when compiling a buffer of OSL source code,
do the gross but necessary safeguard of saving the prior locale,
globally changing to "C" locale while we compile the OSL, then restoring
the locale.  And *HOPE* that the surrounding application isn't
simultaneously counting on the original locale-based
internationalization still being in effect for something it is doing in
another thread that's unaware that we've temporarily altered a global
setting. That's inherently unsafe, but remember it only affects OSL builds
against pre-1.8 OIIO (once that patch is applied on the OIIO side), and
only will be symptomatic for systems using a non-. locale (which is,
empirically speaking, rare enough that nobody noticed this issue until
this week).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants