Re: Re[2]: gcc and extended character source code
Re: Re[2]: gcc and extended character source code
- Subject: Re: Re[2]: gcc and extended character source code
- From: Alastair Houghton <email@hidden>
- Date: Mon, 11 Feb 2008 15:48:28 +0000
On 11 Feb 2008, at 09:47, Peter Mulholland wrote:
Monday, February 11, 2008, 6:48:25 AM, you wrote:
Variable names that don't use ASCII are illegal -- doesn't matter
what
encoding the source code file has, the compiler won't parse the A-
umlaut.
That's true for C89 (sort of... they don't talk about ASCII, but
rather about the basic source character set; source code might be
expressed in EBCDIC or some other exotic encoding), but not C99 (see
below)
If you don't want to change it, have the original author do so,
because it's not standard.
Lame. Typical of *nix to still have its head in the sand.
This isn't anything to do with UNIX.
The ISO C standard (C99 in this case), which is what says what is
valid C and what is not, says that
6.4.2.1.3 ...An implementation may allow multibyte characters that
are not
part of the basic source character set to appear in
identifiers;
which characters and their correspondence to universal
character
names is implementation-defined
For "implementation-defined", read "not guaranteed to be portable".
This whole area is rather more complicated than your off-the-cuff
dismissal implies. There is no portable way to specify the character
encoding of a C source file. Using identifiers with characters
outside the basic set requires support from a variety of tools besides
just the compiler (e.g. the assembler, the linker, the dynamic linker,
debuggers, and potentially other tools besides) and in the case of
GCC, which is often sat on top of the system assembler and linker and
which has no control over which dynamic linker or debugger you might
be using, it's potentially a tricky problem. Moreover, a mistake in
this area could break binary compatibility, which is a very disruptive
thing to do.
There are also some nasty gotchas, for instance the fact that the
mapping from some source character sets to Unicode might not be what
people expect; the fact that some scripts include characters that look
just like those in other scripts (e.g. people tend to confuse the
German esset with the Greek beta character, and some of the Cyrillic
and Greek characters look exactly like their Latin counterparts).
Microsoft's compiler may presently allow the use of characters outside
the basic set, but because there is no standard behaviour here, doing
so is not likely to be portable. Furthermore, it looks to me like the
GCC team has been working on this problem for some time, so to say
that they have their heads in the sand is rather unfair.
The original author doesn't care - he's only concerned about the code
compiling on Windows, as far as he's concerned it's my job to port it.
Then it's your job to either fix his code or to get him to fix it
himself, right? The best fix is to do what Christian Demmer suggested
(i.e. replace umlauts and essets with their two-character
equivalents). Your German programmer shouldn't find this too
objectionable, since it doesn't change the meanings of any of the
things that he wrote.
I wouldn't bother messing around with UCNs right now, since I think
they're a work in progress and anyway they hamper readability. I
notice, for instance, that newer FSF GCC versions require the -
fextended-identifiers switch if you want to use them in identifiers.
Apple's GCC 4.0.1 doesn't seem to (it seems to accept them with just --
std=c99).
Kind regards,
Alastair.
--
http://alastairs-place.net
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden