Since I wrote a simple LogFile class to use in my software, I decided that I ought to make sure it works with Unicode so that I can be prepared for multi-lingual support in the future. Hence, I’ve been reading up on Unicode and how to support it. Of course, the character encoding is closely tied to international localization, so I’ve been reading up on that, too. In learning about this, I’ve found a few interesting links that I thought I would share.
The Very Basics
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. The article is what it claims, and is a decent introduction to just what Unicode and character sets are.
Multi-byte-chars and Unicode on Windows
The first article I came across was also my introduction to the Filpcode Archive. Advanced String Techniques in C++ by Fredrik Andersson describes techniques for dealing with multi-byte characters and Unicode on Windows. It seems a bit complex compared to what I’m seeing on Linux, but this is an old article, so hopefully things have improved over the years.
Unicode and Linux
One older but useful article I’ve found is A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX by Ed Trager. There’s a little technical information at the top, but mostly this article talks about the practicalities of using Unicode in Linux: the software you need and settings you might have to adjust.
The Locale in C
Over at IBM Developer works, I found an article titled Linux Unicode programming by Thomas Burger. This covers the basics of detecting and setting locale information in C in a Linux environment. Again, this is an older article, and I probably ought to have learned this information before now, but in the past I was programming system tools that generally had little, if any, user interface, so localization was never an issue. Now it is.
GCC beats VC++ ?
A lot of what I’ve been seeing indicates that working with UTF-8 – and in particular writing it to a file – on Windows requires you to explicitly convert your strings using WideCharToMultiByte() or set up some form of automatic conversion that hooks into the C++ streams library. For example, see Writing UTF-8 files in C++ by Marius Bancila. This is information I’m going to keep in mind, but my testing with GCC 4.5 on Windows and Linux makes me really appreciate that compiler. The first “naïve” example that Mr. Bancila provides as an example of fstream failing to write UTF-8 actually works just fine for me. On both Windows and Linux it writes a UTF-8 encoded file with no special effort on my part.
I’m still absorbing everything, but the more I read about this the more it seems that as long as I stick with GCC then I essentially get UTF-8 support for free.