-
Notifications
You must be signed in to change notification settings - Fork 626
More principled handling of float<->string #1785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8cd64b5
to
d11ac6e
Compare
I got quite the education this week, realizing just how much of C++'s I/O functionality implicitly takes the system's "locale" into consideration. Of particular concern is the conversion of text to floating point numeric values and vice versa, for which much C++ functionality will happily swap the role of '.' and, say, ',' (comma) for the decimal mark, depending on the user's country and whether they are savvy enough to set their locale to the classic "C" (essentially US-style) version. On the other hand, who can blame somebody for not blindly accepting US hegemony these days? But dammit, OSL is a programming language, and so '.' has a specific role in numeric constants and ',' does something quite different (comma operator!), and if you start mixing up the two when parsing programs, it will all end in tears. All other major programming languages just adopt this convention as well. Now, if you were writing a full app, you could set the locale to the classic one globally (to that process) upon startup, and then never worry about it. Which we can do for oslc itself. But much of OSL is actually a library that can be embedded into other libraries or apps, so globally setting the locale can mess up UI internationalization or the intentional localization of other app I/O. So we have to handle this deftly throughout. The fix comes in three parts: 1. I've submitted a separate review to OIIO here: AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten up that library, and in particular the various Strutil and ustring utilities that we liberally use throughout OSL. And patrol some of the places where we directly call the locale-sensitive C++ functions like atof and strtod, and replace them with the now-safe OIIO::Strutil equivalents. We'll be able to count on all this working for OIIO >= 1.8.6. 2. In a variety of places that don't rely on OIIO (for example, various outputs via std::ostringstream), we need to take care to explicitly set the stream's locale to produce the output we expect. 3. For people building OSL against a too-old version of OIIO to contain the fixes described above, when compiling a buffer of OSL source code, do the gross but necessary safeguard of saving the prior locale, globally changing to "C" locale while we compile the OSL, then restoring the locale. And *HOPE* that the surrounding application isn't simultaneously counting on the original locale-based internationalization still being in effect for something it is doing in another thread that's unaware that we've temporarily altered a global setting. That's inherently unsafe, but remember it only affects OSL builds against pre-1.8 OIIO (once that patch is applied on the OIIO side), and only will be symptomatic for systems using a non-. locale (which is, empirically speaking, rare enough that nobody noticed this issue until this week).
Note: I've also submitted the couple of changes I made internal to tinyformat.h to the tinyformat project: c42f/tinyformat#44 We're negotiating over the right way to do that, so some minor things may change along the way and we'll propagate it back here. |
There is one decision I made in this patch that I'm not 100% sure of: As submitted, An alternate design would be for I don't know which is better. The alternative has the advantage of being 100% consistent across all the Strutil functions -- default behavior always is C locale and you have to override to get something localized. But the stream overall will appear to have inconsistent behavior if you output to it with a mix of direct output ( I'm leaning toward preferring the "alternate", but seeking advice. |
609553f
to
e36dfee
Compare
I got quite the education this week, realizing just how much of C++'s I/O functionality implicitly takes the system's "locale" into consideration. Of particular concern is the conversion of text to floating point numeric values and vice versa, for which much C++ functionality will happily swap the role of '.' and, say, ',' (comma) for the decimal mark, depending on the user's country and whether they are savvy enough to set their locale to the classic "C" (essentially US-style) version. On the other hand, who can blame somebody for not blindly accepting US hegemony these days? But dammit, OSL is a programming language, and so '.' has a specific role in numeric constants and ',' does something quite different (comma operator!), and if you start mixing up the two when parsing programs, it will all end in tears. All other major programming languages just adopt this convention as well. Now, if you were writing a full app, you could set the locale to the classic one globally (to that process) upon startup, and then never worry about it. Which we can do for oslc itself. But much of OSL is actually a library that can be embedded into other libraries or apps, so globally setting the locale can mess up UI internationalization or the intentional localization of other app I/O. So we have to handle this deftly throughout. The fix comes in three parts: 1. I've submitted a separate review to OIIO here: AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten up that library, and in particular the various Strutil and ustring utilities that we liberally use throughout OSL. And patrol some of the places where we directly call the locale-sensitive C++ functions like atof and strtod, and replace them with the now-safe OIIO::Strutil equivalents. We'll be able to count on all this working for OIIO >= 1.8.6. 2. In a variety of places that don't rely on OIIO (for example, various outputs via std::ostringstream), we need to take care to explicitly set the stream's locale to produce the output we expect. 3. For people building OSL against a too-old version of OIIO to contain the fixes described above, when compiling a buffer of OSL source code, do the gross but necessary safeguard of saving the prior locale, globally changing to "C" locale while we compile the OSL, then restoring the locale. And *HOPE* that the surrounding application isn't simultaneously counting on the original locale-based internationalization still being in effect for something it is doing in another thread that's unaware that we've temporarily altered a global setting. That's inherently unsafe, but remember it only affects OSL builds against pre-1.8 OIIO (once that patch is applied on the OIIO side), and only will be symptomatic for systems using a non-. locale (which is, empirically speaking, rare enough that nobody noticed this issue until this week).
This refactor comes after a belated realization that our numeric conversion routines (principally from_string<>) are "locale-dependent" via their use of strtod, as well as atof() scattered through the code. In particular, this means that when software using this code is running with a locale in which the decimal mark is something other than '.' (e.g., many European locales that use ',' for decimal mark), these functions will incorrectly parse text floating point numbers that use the standard period as decimal mark. For example, if the locale is "fr_FR", atof("123.45") will return 123.0, rather than the correct 123.45. At the core of our fix is the introduction of Strutil::strtof/strtod which by default is "locale-independent" (i.e. always uses '.') and also has varieties that take a std::locale& explicitly. The implementation uses strtod_l/strtof_l (nonstandard, but available on Linux, OSX, BSD, and a differently-named but equivalent function on Windows), with a gross but simple fallback for any straggler platforms (MINGW?) that copies the string and replaces the '.' with the native locale's mark. This is pretty inexpensive: on Linux, the locale-independent version of our new Strutil::strtof has the same cost as the system locale-dependent atof/strtof, and the explicit locale version is only about 20% slower. On OSX, atof is remarkably fast, so our version that uses strtof_l is about 2x slower, but that seems acceptable given that it gives us all the control we want, and in comparison, another strategy involving istringstream was 10x slower. Unless this is shown to be an important bottleneck, I'm rejecting the alternative of embedding a full custom strtod implementation as being unnecessarily inviting maintenance costs and potential bugs. The approach we chose is minimally invasive and relies on system libraries for the heavy lifting. Additionally: * Added a new stof(), stoi(), stoul() to replace the clunkier template syntax of from_string<>. The stof(), like our new strtof(), has a locale-independent as well as an explicit-local versions. * A handful of places that still had old calls to std atof and strtod were changed to Strutil::stof. * Strutil::format and ustring::format (and our copy of tinyformat underneath) were touched up so that the versions of format() that return strings will use the "C" locale, versions added that return strings and use an explicitly-passed locale, and the versions that append their results to existing streams continue to operate as before by honoring the existing locale conventions of the streams they are using. * Several places where we assembled strings using std::ostringstream, I forced the stream to use classic "C" locale in cases where it was likely to have any floating-point data output. * For maketx and oiiotool, globally set the locale to classic. In these cases, we control the whole app, so this is safe to do. TIP going forward: For any I/O that must be persistent (files saved or read), or always be formatted identically regardless of the computer being used, always take care to use locale-independent (i.e. classic "C" locale) formatting functions and also making sure to initialize any std::ostringstream with stream.imbue(std::locale::classic()). For ephemeral I/O or UI elements that you want to display correctly internationalized for the user's country, then use the versions with explicit locale std::locale() and use the default initialization of output streams. As an aside, I do wish that C/C++ had adopted the convention of default using the classic (globally uniform) locale for all functions that don't take an explicit locale, and leave the explicit locale functions just for programs that are intentionally trying to do country-by-country localization. This is all just a huge mess, and I hate that our library can have subtle I/O bugs when used by someone in another country because of a particular global locale setting that we have no control over and can't modify simply without potentially breaking the surrounding app.
…tring_view More fixes and refactoring after it was pointed out that there were some additional latent bugs related to places where we implicitly assumed that a string_view was null terminated or within a null-terminated string. Also fixed important bugs in Strutil::parse_int and parse_float where it was definitely not safe to assume that the string_view passed was the null-terminated edge of the string, and so, for example if a string_view pointed to the characters "12" but the very next characters were "345" parse_int would return 12345 instead of the correct 12.
Pushed more fixes and refactoring after it was pointed out that there were some additional latent bugs related to places where we implicitly assumed that a string_view was null terminated or within a null-terminated string. Also fixed important bugs in Strutil::parse_int and parse_float where it was definitely not safe to assume that the string_view passed was the null-terminated edge of the string, and so, for example if a string_view pointed to the characters "12" but the very next characters were "345" parse_int would return 12345 instead of the correct 12. |
The cases you are fixing here sound good to me, but I'm still worried that this is more complexity than if we just grabbed a known good atoi/atof and used that all the time. OIIO is not a localization library. I think it would be fine to document its parsing/format functions to be intended for predictable serialization and forget locales exist altogether. Chances are if you really are designing some international UI application you are using Qt which has much more comprehensive solutions for all this stuff (AFAIK). |
I think that's an extremely valid position -- just declaring that because at its heart, OIIO is fundamentally about persistent and predictable I/O (duh), even its generic utility components ought to use "C" locale only and be done with it, and if you want something else you should seek another library. The counter-argument is simply the observation that it's only 10% more code to also have (optional) entry points that would let you conform to C++'s behavior. It's actually not a lot of extra complexity (I would consider it a larger addition of unnecessary complexity to have our own string conversion functions -- at least the way I've done it, we rely on std for all the actual character handling). So you think I should just drop the loc-aware calls entirely, leave France to puzzle over its commas without us, and I guess we could always add them back later if a need comes up? |
On the other hand, if we took an existing high-quality atoi/atof, we could adapt it to work on a string_view directly -- without it needing null termination and without the need to make a temporary std::string. Hmmm... |
That's what I was thinking yes - a directly |
ok, ok, let me take a look |
Counter-proposal in #1796 |
We decided to go with the alternate solution of #1796, so I'm going to close this PR. We can always revisit if something comes up that necessitates OIIO exposing locale-aware formatting or parsing. |
I got quite the education this week, realizing just how much of C++'s I/O functionality implicitly takes the system's "locale" into consideration. Of particular concern is the conversion of text to floating point numeric values and vice versa, for which much C++ functionality will happily swap the role of '.' and, say, ',' (comma) for the decimal mark, depending on the user's country and whether they are savvy enough to set their locale to the classic "C" (essentially US-style) version. On the other hand, who can blame somebody for not blindly accepting US hegemony these days? But dammit, OSL is a programming language, and so '.' has a specific role in numeric constants and ',' does something quite different (comma operator!), and if you start mixing up the two when parsing programs, it will all end in tears. All other major programming languages just adopt this convention as well. Now, if you were writing a full app, you could set the locale to the classic one globally (to that process) upon startup, and then never worry about it. Which we can do for oslc itself. But much of OSL is actually a library that can be embedded into other libraries or apps, so globally setting the locale can mess up UI internationalization or the intentional localization of other app I/O. So we have to handle this deftly throughout. The fix comes in three parts: 1. I've submitted a separate review to OIIO here: AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten up that library, and in particular the various Strutil and ustring utilities that we liberally use throughout OSL. And patrol some of the places where we directly call the locale-sensitive C++ functions like atof and strtod, and replace them with the now-safe OIIO::Strutil equivalents. We'll be able to count on all this working for OIIO >= 1.8.6. 2. In a variety of places that don't rely on OIIO (for example, various outputs via std::ostringstream), we need to take care to explicitly set the stream's locale to produce the output we expect. 3. For people building OSL against a too-old version of OIIO to contain the fixes described above, when compiling a buffer of OSL source code, do the gross but necessary safeguard of saving the prior locale, globally changing to "C" locale while we compile the OSL, then restoring the locale. And *HOPE* that the surrounding application isn't simultaneously counting on the original locale-based internationalization still being in effect for something it is doing in another thread that's unaware that we've temporarily altered a global setting. That's inherently unsafe, but remember it only affects OSL builds against pre-1.8 OIIO (once that patch is applied on the OIIO side), and only will be symptomatic for systems using a non-. locale (which is, empirically speaking, rare enough that nobody noticed this issue until this week).
I got quite the education this week, realizing just how much of C++'s I/O functionality implicitly takes the system's "locale" into consideration. Of particular concern is the conversion of text to floating point numeric values and vice versa, for which much C++ functionality will happily swap the role of '.' and, say, ',' (comma) for the decimal mark, depending on the user's country and whether they are savvy enough to set their locale to the classic "C" (essentially US-style) version. On the other hand, who can blame somebody for not blindly accepting US hegemony these days? But dammit, OSL is a programming language, and so '.' has a specific role in numeric constants and ',' does something quite different (comma operator!), and if you start mixing up the two when parsing programs, it will all end in tears. All other major programming languages just adopt this convention as well. Now, if you were writing a full app, you could set the locale to the classic one globally (to that process) upon startup, and then never worry about it. Which we can do for oslc itself. But much of OSL is actually a library that can be embedded into other libraries or apps, so globally setting the locale can mess up UI internationalization or the intentional localization of other app I/O. So we have to handle this deftly throughout. The fix comes in three parts: 1. I've submitted a separate review to OIIO here: AcademySoftwareFoundation/OpenImageIO#1785 which tries to straighten up that library, and in particular the various Strutil and ustring utilities that we liberally use throughout OSL. And patrol some of the places where we directly call the locale-sensitive C++ functions like atof and strtod, and replace them with the now-safe OIIO::Strutil equivalents. We'll be able to count on all this working for OIIO >= 1.8.6. 2. In a variety of places that don't rely on OIIO (for example, various outputs via std::ostringstream), we need to take care to explicitly set the stream's locale to produce the output we expect. 3. For people building OSL against a too-old version of OIIO to contain the fixes described above, when compiling a buffer of OSL source code, do the gross but necessary safeguard of saving the prior locale, globally changing to "C" locale while we compile the OSL, then restoring the locale. And *HOPE* that the surrounding application isn't simultaneously counting on the original locale-based internationalization still being in effect for something it is doing in another thread that's unaware that we've temporarily altered a global setting. That's inherently unsafe, but remember it only affects OSL builds against pre-1.8 OIIO (once that patch is applied on the OIIO side), and only will be symptomatic for systems using a non-. locale (which is, empirically speaking, rare enough that nobody noticed this issue until this week).
This refactor comes after a belated realization that our numeric conversion routines (principally
from_string<>
) are "locale-dependent" via their use of strtod, as well as atof() scattered through the code.In particular, this means that when software using this code is running with a locale in which the decimal mark is something other than '.' (e.g., many European locales that use ',' for decimal mark), these functions will incorrectly parse text floating point numbers that use the standard period as decimal mark. For example, if the locale is "fr_FR", atof("123.45") will return 123.0, rather than the correct 123.45.
At the core of our fix is the introduction of Strutil::strtof/strtod which by default is "locale-independent" (i.e. always uses '.') and also has varieties that take a std::locale& explicitly. The implementation uses strtod_l/strtof_l (nonstandard, but available on Linux, OSX, BSD, and a differently-named but equivalent function on Windows), with a gross but simple fallback for any straggler platforms (MINGW?) that copies the string and replaces the '.' with the native locale's mark.
This is pretty inexpensive: on Linux, the locale-independent version of our new Strutil::strtof has the same cost as the system locale-dependent atof/strtof, and the explicit locale version is only about 20% slower. On OSX, atof is remarkably fast, so our version that uses strtof_l is about 2x slower, but that seems acceptable given that it gives us all the control we want, and in comparison, another strategy involving istringstream was 10x slower. Unless this is shown to be an important bottleneck, I'm rejecting the alternative of embedding a full custom strtod implementation as being unnecessarily inviting maintenance costs and potential bugs. The approach we chose is minimally invasive and relies on system libraries for the heavy lifting.
Additionally:
Added a new stof(), stoi(), stoul() to replace the clunkier template syntax of from_string<>. The stof(), like our new strtof(), has a locale-independent as well as an explicit-local versions.
A handful of places that still had old calls to std atof and strtod were changed to Strutil::stof.
Strutil::format and ustring::format (and our copy of tinyformat underneath) were touched up so that the versions of format() that return strings will use the "C" locale, versions added that return strings and use an explicitly-passed locale, and the versions that append their results to existing streams continue to operate as before by honoring the existing locale conventions of the streams they are using.
Several places where we assembled strings using std::ostringstream, I forced the stream to use classic "C" locale in cases where it was likely to have any floating-point data output.
TIP going forward: For any I/O that must be persistent (files saved or read), or always be formatted identically regardless of the computer being used, always take care to use locale-independent (i.e. classic "C" locale) formatting functions and also making sure to initialize any std::ostringstream with
stream.imbue(std::locale::classic())
. For ephemeral I/O or UI elements that you want to display correctly internationalized for the user's country, then use the versions with explicit localestd::locale()
and use the default initialization of output streams.As an aside, I do wish that C/C++ had adopted the convention of default using the classic (globally uniform) locale for all functions that don't take an explicit locale, and leave the explicit locale functions just for programs that are intentionally trying to do country-by-country localization. This is all just a huge mess, and I hate that our library can have subtle I/O bugs when used by someone in another country because of a particular global locale setting that we have no control over and can't modify simply without potentially breaking the surrounding app.
Also, for anybody who has bothered to read this far, it's not unreasonable for you all to tell me that I've totally over-engineered this, and that as far as you're concerned, anything OIIO reads and writes should always and only use '.' decimal point and omit any locale flexibility, and that I should simplify all of this accordingly. But I think what I've done is made the default modes of everything behave this way at nominal cost, and only additionally (and at no extra expense if you don't use it) provided flexibility to use our functions to internationalize for nice UI and such for those who really want it.