Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: wcin, wstring, and encoding

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wcin, wstring, and encoding

Subject: Re: wcin, wstring, and encoding
From: Andreas Grosam <email@hidden>
Date: Mon, 10 Jan 2011 13:02:39 +0100

On Jan 9, 2011, at 11:35 PM, Todd Heberlein wrote:

> I have some code that basically looks like this:
>
> 	wstring filename;
> 	getline(wcin, filename);
>
> When running this from the Terminal application it looks like the characters encoded in filename are encoded in UTF-8 even though wchar_t is 4-bytes wide. For example, if I use the Character Viewer to enter ARABIC LETTER FARSI YEH (Unicode: 06CC, UTF8: D8 8C)
>
> 	filename[0] = 0xD8
> 	filename[1] = 0x8C
>
> instead of being encoded as a single character
>
> 	filename[0] = 0x06CC
>
> In other words, the single unicode point is encoded as two wchar_t characters in the wstring filename.
>
> Is there any rule (or even rule of thumb) that lets me know the character encoding of a wstring? And in particular, a wstring read in from a wistream?
>

Well, first you need to know that the type wchar_t (C99) is a compiler defined type used for *internal representation* of characters. The standard says that "it is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among all supported locales". This means, that the mapping of characters to specific values and the bit-size of the wchar_t is defined by the compiler. It's size actually varies on different platforms.

The consequence is, that the encoding of a wstring is defined by the compiler.

In your example, the UTF-8 encoded input stream is simply extracted to wide characters - rather than converted from UTF-8 to UTF-32.

In C++, the conversion from an external stream (say a file) to the internal wide character representation and vice versa is controlled by the locale of the stream. You can set the locale of a stream via the imbue member function:
(example taken from http://www.cplusplus.com/)

// imbue example
#include <iostream>
#include <locale>
using namespace std;

int main()
{
    locale mylocale(""); // Construct locale object with the user's default preferences
    cout.imbue( mylocale );
    // Imbue that locale
    cout << (double) 3.14159 << endl;
    return 0;
}
This code writes a floating point number using the user's prefered locale. For example, in a system configured with a Spanish locale as default, this should write the number using a comma decimal separator:
3,14159

The locale consists of a set of features which are specific to culture. It also contains information to classify and convert characters as well as how to convert between different character encoding systems.

Unfortunately, on Mac OS X the locale support in C++ with gcc is severely broken. You can effectively use only the "C" locale. Nonetheless AFAIK, you write and read from UTF-8 encoded streams.
(For more info how you might workaround your problem search for "boost io::code_converter locale".)

In C99, you may successfully change the locale using setlocale() (see man setlocale(3)) which is defined once per process, as opposed to C++ where it would be per stream. However, you can only select a locale which is already supported. Unfortunately, as far as I know, on Mac OS X there is no support for locales which write and read from files which are encoded in UTF-32.
To get a list of supported locales on your system, type the following in the terminal:
$ locale -a

Well finally, in order to solve your problem, you might read/write from/to an external stream representation encoded in UTF-8 and manually convert it to a sequence of wide chars using the various extended multibyte/wide character utility functions defined in <wchar.h> and <xlocale.h>:
mbsrtowcs(), wcsrtombs(), mbsrtowcs_l(), wcsrtombs_l()

Note though, that the internal encoding (what you get in your wchar buffer) doesn't have to be UTF-32 encoded - as stated above, this is implementation defined by the compiler. For GNU, wchar_t is always 32 bits wide and capable to represent all UCS-4 values.

According the Terminal app:
On my system, the current locale is "C", using UTF-8 encoding. You get this information when typing
$locale
in a console in the terminal app. You can set the encoding in the preferences under the "Encodings" section. There is also UTF-32 supported.  It would be interesting to see what you get in your program from the input stream in Terminal app when setting the encoding to UTF-32.

Regards,

Andreas

> Thanks,
>
> Todd
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Xcode-users mailing list      (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: wcin, wstring, and encoding
  - From: Todd Heberlein <email@hidden>

References:
	>wcin, wstring, and encoding (From: Todd Heberlein <email@hidden>)

Prev by Date: Re: Xcode 3.2.5 keeps crashing
Next by Date: [iphone] about code re-signing pb with old iOS 3
Previous by thread: wcin, wstring, and encoding
Next by thread: Re: wcin, wstring, and encoding
Index(es):
- Date
- Thread