International number formats

KMyMoney as a financial application deals with numbers a lot. As a KDE application, it supports internationalization (or i18n for short) from the very beginning. For accuracy reasons it has internal checks to verify the numbers a user can enter.

The validation routine has a long history (I think it goes back to the KDE3 days) and we recently streamlined it a bit as part of the journey to use more and more Qt standard widgets instead of our own.

This led to the replacement of the KMyMoneyEdit widget with the newer AmountEdit widget. Everything worked great for me (using a German locale) until we received notifications that users could only enter integer numbers but no fractional part. This of course is not what we wanted. But why is that?

The important piece of information was that the user reporting the issue uses the Finland svenska (sv_FI) locale on his system. So I set my development system to use that locale for numbers and currencies and it failed for me as well. So it was pretty clear that the validation logic had a flaw.

Checking the AmountValidator object which is an extension of the QDoubleValidator I found out that it did not work as expected with the said locale. So it was time to setup some testcases for the validator to see how it performs with other locales. I still saw it failing which made me curious so I dug into the Qt source code one more time, specifically the QDoubleValidator. Well, it looked that most of the logic we added in former times is superfluous meanwhile with the Qt5 version. But there remains a little difference: the QDoubleValidator works on the symbols of the LC_NUMERIC category of a locale where we want to use it the LC_MONETARY version. So what to do? Simply ignore the fact? This could bite us later.

So the next step was to check what happens within the testcases: Using the current code base they fail. Ok, good news I caught the problem. Using the plain QDoubleValidator and running the testcases again showed that all were positive. So no problem? What if a locale uses different symbols for the decimal point and the thousands separator? Does such a locale exist? Looking at all of them seemed a bit tedious, but tedious work is something for a computer. Here’s the little script I wrote (for simplicity in Perl):

checklocales Download

So, they do exist. Bummer, we need to do something about it. On my box, I collect 484 locales where 110 show differences. That’s close to 25% of the locales, too much to simply ignore the problem. Since some of the locales where present in multiple formats (e.g. br_FR, br_FR@euro and br_FR.utf8) I concentrated on the utf8 versions and the ones that contained an at-sign in their name). This modified the figures to 37 with differing values out of 182 locales checked. Still 20%. Here’s that list. <> denotes an emtpy string, and _ denotes a blank in the locale. If I would not replace those values one could not see a difference in a terminal session.

 Locale          dp      mon_dp  sep     mon_sep
 aa_DJ.utf8      .       .       <>      _
 aa_ER@saaho     .       .       <>      ,
 be_BY@latin     ,       .       .       _
 be_BY.utf8      ,       .       .       _
 bg_BG.utf8      ,       ,       <>       
 bs_BA.utf8      ,       ,       <>      _
 ca_AD.utf8      ,       ,       <>      .
 ca_ES@euro      ,       ,       <>      .
 ca_ES.utf8      ,       ,       <>      .
 ca_FR.utf8      ,       ,       <>      .
 ca_IT.utf8      ,       ,       <>      .
 de_AT@euro      ,       ,       .       _
 de_AT.utf8      ,       ,       .       _
 es_MX.utf8      .       .               ,
 es_PE.utf8      ,       .       .       ,
 eu_ES@euro      ,       ,       <>      .
 eu_ES.utf8      ,       ,       <>      .
 gez_ER@abegede  .       .       <>      ,
 gl_ES@euro      ,       ,       <>      .
 gl_ES.utf8      ,       ,       <>      .
 hr_HR.utf8      ,       ,       <>      _
 it_CH.utf8      ,       .       '       '
 it_IT@euro      ,       ,       <>      .
 it_IT.utf8      ,       ,       <>      .
 mg_MG.utf8      ,       ,       <>      _
 nl_BE.utf8      ,       ,       .       _
 nl_NL@euro      ,       ,       <>      _
 nl_NL.utf8      ,       ,       <>      _
 pl_PL.utf8      ,       ,       <>      .
 pt_PT@euro      ,       ,       <>      .
 pt_PT.utf8      ,       ,       <>      .
 ru_RU.utf8      ,       .                
 ru_UA.utf8      ,       .       .       _
 so_DJ.utf8      .       .       <>      _
 sr_RS@latin     ,       ,       <>      .
 tg_TJ.utf8      ,       .       .       _
 tt_RU@iqtelif   ,       .       .        
 182 locales processed, 37 locales differ.

Next I went in and tried to figure out what differences exist. A bit of UNIX foo (i.e. combining the tools of your toolbox) shows the results:

./checklocales | tr '\t' ':' | cut -d: -f2- | tr ':' '\t' | sort | uniq | grep -v "mon_dp" | grep -v "process"

reveals the following result (I added the header here manually):

 dp      mon_dp  sep     mon_sep
 ,       ,       <>      _
 ,       ,       <>      .
 ,       ,       .       _
 ,       .       .       _
 ,       .       .       ,
 ,       .       .        
 ,       .       '       '
 .       .       <>      _
 .       .       <>      ,
 .       .               ,
 ,       ,       <>       
 ,       .

Running this additionally through “wc -l” shows 12 different combinations. The last one is strange, because it does not show anything in the last two columns. Looking at the full table shows that this is the entry produced by ru_RU.utf8. Why is this now? Running

LC_ALL=ru_RU.utf8 locale -k decimal_point mon_decimal_point thousands_sep mon_thousands_sep

we get

 decimal_point=","
 mon_decimal_point="."
 thousands_sep=" "
 mon_thousands_sep=" "

which looks just fine, but why did we not see the underbar for the blanks in the thousands separator? Running the above output through our friend od we see why:

LC_ALL=ru_RU.utf8 locale -k decimal_point mon_decimal_point thousands_sep mon_thousands_sep | od -c
0000000   d   e   c   i   m   a   l   _   p   o   i   n   t   =   "   ,
0000020   "  \n   m   o   n   _   d   e   c   i   m   a   l   _   p   o
0000040   i   n   t   =   "   .   "  \n   t   h   o   u   s   a   n   d
0000060   s   _   s   e   p   =   " 302 240   "  \n   m   o   n   _   t
0000100   h   o   u   s   a   n   d   s   _   s   e   p   =   " 302 240
0000120   "  \n
0000122

Ah, “302 240” octal: that is some UTF-8 encoded character presented as a blank in my default locale. Ok, since it is the same no big deal. But there are two more:

 ,       .       .        
 .       .               ,

Going into the long list we see that these are the results of tt_RU@iqtelif and es_MX.utf8. Looking at od’s output we see:

LC_ALL=tt_RU@iqtelif  locale -k decimal_point mon_decimal_point thousands_sep mon_thousands_sep | od -c
0000000   d   e   c   i   m   a   l   _   p   o   i   n   t   =   "   ,
0000020   "  \n   m   o   n   _   d   e   c   i   m   a   l   _   p   o
0000040   i   n   t   =   "   .   "  \n   t   h   o   u   s   a   n   d
0000060   s   _   s   e   p   =   "   .   "  \n   m   o   n   _   t   h
0000100   o   u   s   a   n   d   s   _   s   e   p   =   " 342 200 202
0000120   "  \n
0000122
LC_ALL=es_MX.utf8 locale -k decimal_point mon_decimal_point thousands_sep mon_thousands_sep | od -c
0000000   d   e   c   i   m   a   l   _   p   o   i   n   t   =   "   .
0000020   "  \n   m   o   n   _   d   e   c   i   m   a   l   _   p   o
0000040   i   n   t   =   "   .   "  \n   t   h   o   u   s   a   n   d
0000060   s   _   s   e   p   =   " 342 200 211   "  \n   m   o   n   _
0000100   t   h   o   u   s   a   n   d   s   _   s   e   p   =   "   ,
0000120   "  \n
0000122

so again just some special chars which are shown as blank, but they differ between numeric and currency representation in the said locales.

While trying a few things I stumbled across the following output. See for yourself:

LC_ALL=C.utf8 locale -k decimal_point mon_decimal_point thousands_sep mon_thousands_sep
decimal_point="."
mon_decimal_point="."
thousands_sep=""
mon_thousands_sep=""
LC_ALL=C locale -k decimal_point mon_decimal_point thousands_sep mon_thousands_sep
decimal_point="."
mon_decimal_point=""
thousands_sep=""
mon_thousands_sep=""

What? The C locale (non UTF-8 version) has no monetary decimal point defined? This may end in strange results, but since nowadays we talk about UTF-8 I simply put it aside.

Back to the original problem: the AmountValidator. Looking at the results above, it seems not to make sense to simply use the QDoubleValidator version. Would it make sense to allow both versions? To use the QDoubleValidator we would need to replace the currency versions of the two characters with their numeric version. The question is: would that lead to false strings in any of the locales? Looking at the long list, we find the following where both characters are different:

 Locale          dp      mon_dp  sep     mon_sep
 be_BY@latin     ,       .       .       _ 
 be_BY.utf8      ,       .       .       _ 
 es_PE.utf8      ,       .       .       , 
 ru_UA.utf8      ,       .       .       _ 
 tg_TJ.utf8      ,       .       .       _ 
 tt_RU@iqtelif   ,       .       .

es_PE.utf8 seems to be the worst candidate. The meaning is simply reversed. Let’s add this locale to the testcases and see how it performs.

Looks like it works out of the box. Strange, I expected a different result. Anyway, the changes to KMyMoney are now committed to the master branch and I can continue with the next problem.

International number formats