Re: POSIX paths and UTF-8 on Mac OS X...

23 Nov 2005

      site_archiver@lists.apple.com
Delivered-To: darwin-dev@lists.apple.com

I am in the process of porting some software the runs on Windows to
Mac OS X and I want to validate what I believe to be true on Mac OS X
and also understand any special requirements that exist.
I am fairly sure (and API docs support) that things like open and
fopen accept UTF-8 (ASCII is a subset of UTF-8, so of course it accept
traditional ASCII as well). Is this correct?
I do see in Core Foundation code that in the new 10.4 method that
returns the file system representation of of a CFString that UTF-8 is
used but that it also attempts to do some specific type of string
decomposition[1] (assume dealing with combining accents and the
likes). So my question what exactly is expected and/or required in a
UTF-8 string handed to something like fopen.
Also can a BOM exist at the head of the UTF-8 string or must I assure
that it doesn't exist.
I tried to find a good document that calls things out fully but the
best I could find was the UTF-8 is used.

--
Adam Nohejl
Loki Software
mailto:adam@lokisw.com
http://lokisw.com
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list      (Darwin-dev@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl...

2005/11/23 v 17:37, Shawn Erickson:

http://developer.apple.com/documentation/MacOSX/Conceptual/
BPInternational/Articles/FileEncodings.html
"All BSD system functions expect their string parameters to be in
UTF-8 encoding and nothing else. Code that calls BSD system routines
should ensure that the contents of all const *char parameters are in
canonical UTF-8 encoding. In a canonical UTF-8 string, all
decomposable characters are decomposed; for example, é (0x00E9) is
represented as e (0x0065) + ´ (0x0301). To put things into a
canonical UTF-8 encoding, use the “file-system representation”
interfaces defined in Cocoa and Carbon (including Core Foundation)."
I think that you should use the mentioned APIs in the first place,
but otherwise getting rid of BOMs seems reasonable as the interfaces
never return them and it doesn't make sense to use them if you work
only with UTF-8 (UTF-8 has a defined byte order independent of
endiannes), UTF-8 BOMs are actually quite rare.
This email sent to site_archiver@lists.apple.com