Re: Unicode canonical decomposed form and text encoding
Re: Unicode canonical decomposed form and text encoding
- Subject: Re: Unicode canonical decomposed form and text encoding
- From: Renaud Boisjoly <email@hidden>
- Date: Tue, 14 Jan 2003 20:43:40 -0500
I may be on a track here...
I just discovered that I can get it to go from a string to something
which looks like:
\\000A\\003\\000\\000.\\000r\\000t\\000f\\000d\\000\\000\\000 (with
more null characters at the end... I guess I can figure out how to get
rid of the extra characters, or a way to define my UniChar with the
right size to start with (no idea yet on how this is done...)
But once I get it down to
\\000A\\003\\000\\000.\\000r\\000t\\000f\\000d
only (the name of my file is @.rtfd), perhaps it will be the same as:
A\\u0300.rtfd
which is what I want it to look like since I do know that:
\\300.rtfd
doesn't work...
All this because NSDictionaries do not store Unicode keys the same way
HFS+ does... I know I should be doing this completely differently and I
will eventually, but this means a major rewrite of my file format and
I'd like to get this working, at least for normal use... until I can
rewrite the format
On Tuesday, January 14, 2003, at 08:26 PM, Renaud Boisjoly wrote:
Well, when I run your code, I'm getting
\\000A\\003\\000
after decoding...
but I would need:
A\\u0300
Otherwise the rest of my code doesn't work because of the way unicode
is used in Dictionary keys...
Now, perhaps they are both the same, but if this is the case, why
wouldn't they look the same in an NSLog?
On Tuesday, January 14, 2003, at 07:39 PM, Aki Inoue wrote:
Renaud,
I think we're talking in the same line.
\\300 is 0x00C0 in octal and is "A grave".
It is usually called the precomposed form.
And "A \U0300" is the decomposed form.
So I used getCharacters but somehitng isn't working still. I think I
may have asked part of my question backwards. Boy, Unicode is not
too simple! Perhaps with an example.
Exactly what's not working ?
Aki
On 2003.1.14, at 04:11 PM, Renaud Boisjoly wrote:
Hi again
So I used getCharacters but somehitng isn't working still. I think I
may have asked part of my question backwards. Boy, Unicode is not
too simple! Perhaps with an example.
Say the string I need to convert is "A acute". It first looks like:
\\300
But what I need is:
A\\u0300
I'm not sure yet how each is supposed to be called.
I get the feeling that the routine you so kindly put together
actually does the opposite... is this possible? If so, I tried
inverting some of the parameters in CreateTextConverter, but it
fails to convert anything... any clues?
Thanks again to all for helping out!
Renaud
On Tuesday, January 14, 2003, at 05:44 PM, Aki Inoue wrote:
Renaud,
You can use getCharacters: to bulk-get characters from NSString.
One thing to note if you're using stack buffer in a loop as in your
original example.
Depending on your needs in decomposed format, you have to be a
little bit more careful at the end of each buffer run.
For example, let's assume your source NSString contains the
following character sequence "U0104 U0300" LATIN CAPITAL LETTER A
WITH OGONEK and COMBINING GRAVE ACCENT. "!" (This should display
correctly in Mail.app).
When decompose, they can be either "U0041 U0328 U0300" or "U0041
U0300 U0328". They are both perfectly legal Unicode character
sequences, but only the latter is canonically decomposed format.
Back to the NSString with these character sequences, you won't get
the canonical format if your working buffer ends between U0104 and
U0300 since TEC cannot know the next character in that case.
So, if you want to have canonically decomposed format (not just
decomposed), you need to make sure your working buffer ends BEFORE
a base character (![[NSCharacter nonBaseCharacterSet]
characterIsMember:theChar]). You don't have to worry about
surrogates since pre-Jaguar TEC doesn't recognize them.
Aki
On 2003.1.14, at 01:08 PM, Renaud Boisjoly wrote:
Hi again
Ok, I think it will work, but I do have a last newbie question to
ask if I can...
I've managed to convert from the UniChar result to an NSString,
but I'm not clear on how to efficiently do the reverse. My
original string is in an NSString and I guess I need to convert it
to UniChar... but being pretty unexperienced, this looks like a
mystery to me. Do I need to iterate through each character using
characterAtIndex and add them to characters[] one by one? Should I
use an NSScanner? Is there an immensely obvious way to do this and
I'm just not seeing it (probably). I now its probably something I
should know, but considering I've only been programming for a year
or so except for stuff like AppleScript, I miss a lot of things.
My current idea is a for loop using characterAtIndex to add each
character...
Thanks for your time if you can afford it.
Renaud
On Tuesday, January 14, 2003, at 02:39 PM, Aki Inoue wrote:
#import <Foundation/Foundation.h>
static UniChar characters[] = {0x00C0}; // LATIN CAPITAL LETTER A
WITH GRAVE
#define MAX_BUFFER_LENGTH (100)
int main (int argc, const char * argv[]) {
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
UnicodeToTextInfo textInfo;
UnicodeMapping mapping =
{CreateTextEncoding(kTextEncodingUnicodeDefault,
kTextEncodingDefaultVariant, kUnicode16BitFormat),
CreateTextEncoding(kTextEncodingUnicodeDefault,
kUnicodeCanonicalDecompVariant, kUnicode16BitFormat),
kUnicodeUseLatestMapping};
UniChar buffer[MAX_BUFFER_LENGTH];
ByteCount inputRead, outputLen;
OSStatus status;
status = CreateUnicodeToTextInfo(&mapping, &textInfo);
if (noErr != status) {
NSLog(@"Failed to create UnicodeToTextInfo");
exit(1);
}
status = ConvertFromUnicodeToText(textInfo,
sizeof(characters), characters, kTECKeepInfoFixMask, 0, NULL,
NULL, NULL, MAX_BUFFER_LENGTH * sizeof(UniChar), &inputRead,
&outputLen, >>> buffer);
if (noErr != status) {
NSLog(@"Failed to convert string");
exit(1);
}
DisposeUnicodeToTextInfo(&textInfo);
[pool release];
return 0;
}
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.