(unicode -> shift-jis) encoding conversion bug?
(unicode -> shift-jis) encoding conversion bug?
- Subject: (unicode -> shift-jis) encoding conversion bug?
- From: email@hidden (Jody Fairchild)
- Date: Thu, 24 Jan 2002 23:22:59 +0900
a situation in which converting characters from unicode to shift-jis
seems to produce incorrect (or at least counterintuitive) results, e.g:
let's run the following code snippetoid:
NSString *example;
unichar in, out;
NSString *s;
NSData *d;
for (i = 0; i < [length example]; i++)
{
uc = [example characterAtIndex:i];
s = [[NSString alloc] initWithCharacters:&in length:1];
d = [s dataUsingEncoding:NSShiftJISStringEncoding
allowLossyConversion:NO];
[d getBytes:&out];
NSLog(@"unicode = %X, sjis = %X",in,out);
}
for an example string containing two characters (from unicode input via an
NSTextField). the characters are plain lowercase "a", and hiragana "あ"
(the japanese phonetic character representing the "ah" sound) ... we get
something like the following output:
unicode = 61, sjis = 6114 (for regular "a")
unicode = 3042, sjis = 82A0 (for hiragana "a")
the problem is that regular "a" should be 0x61 in both unicode _and_
shift-jis, but the converted char gets a garbage byte tacked onto the end
of it. this garbage byte is essentially random, and tends to change each
time the code is run. note that the first byte of the unichar holds the
correct value ... note also that the conversion works for a regular
double-byte character; hiragana "a" is indeed 0x3042 in unicode and 0x82A0
in shift-jis.
should not the conversion be returning 0x0061 for regular "a"? i thought
part of the beauty of this unicode stuff was that we wouldn't have to worry
about when something should be treated as one byte or two ...
is this a bug in the conversion stuff or am i missing something?
opinions? any encoding gurus out there care to point out some fatal flaw
in my approach?
thanks,
-jf