Re: Best guess at expressing a string as a variable
Re: Best guess at expressing a string as a variable
- Subject: Re: Best guess at expressing a string as a variable
- From: Uli Kusterer <email@hidden>
- Date: Wed, 23 Jan 2013 14:07:43 +0100
On Jan 23, 2013, at 2:18 AM, email@hidden wrote:
> Hmm. Maybe not. I want to keep the generated variable name legible.
You need to nail down the languages you want to deploy to, and then find out what their criteria for identifiers are. Then you can decide to either generate identical lowest-common-denominator-names for all of them (which is [a-zA-Z_]([a-zA-Z_0-9]*) in the case of C, i.e. it may not start with a number either), or adjust what characters to permit based on the target programming language.
Apart from character set, you may also have to be aware of length limits etc. Early C compilers, for instance, only used the first 8 characters of an identifier. So "ExceptionalHouse" and "ExceptionalCow" both ended up as the same identifier, "Exceptio". I'm hard pressed to think of a language with such a limit today, but I don't know what languages your targeting. Maybe one has such a limit.
If you have a case where you can't express a character in a particular character set, you have several options:
1) Transcribe it to an equivalent character set. E.g. U-Umlaut (ü) is usually written as "ue". However, you will then have to deal with collisions. E.g. what if one user enters the word "Frauen", but another makes up a new word "Fraün". The latter would transcribe to the former, and you might get unexpected side effects. You might have to generate a look-up-table, and if you find a collision like that, make the name unique again, e.g. by naming one "frauen" and the other "frauen2". IIRC there are official transcriptions for many languages, e.g. Romanji for Japanese characters.
2) Fail and tell the user what the valid characters are, and only let them enter valid characters.
3) Transcribe in some other way, e.g. by base64-encoding, or using a hex-representation of the given byte sequence, or whatever. This way you could keep ASCII sentences valid, but modify everything else. But even then you could have collisions. E.g. if you replace spaces with underscores, what if there's a second version with the underscore?
> Is + (id)letterCharacterSet the best choice here?
According to the docs (https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSCharacterSet_Class/Reference/Reference.html): "An NSCharacterSet object represents a set of Unicode-compliant characters." The +letterCharacterSet documentation says "Returns a character set containing the characters in the categories Letters and Marks." So a Google later, here http://www.fileformat.info/info/unicode/category/index.htm the categories mentioning "Letters" include greek characters, accented characters, hiragana and cyrillic characters among others (most of which are invalid as C identifier names). Oddly, "marks" seem to include some kind of punctuation. I couldn't find a section that is obviously only "letters and marks" or two separate "letters" and "marks" sections.
Anyway, I think building your own custom character set from a string including the characters you *know* are valid identifiers in your target programming language(s) is probably the route of least surprise.
Cheers,
-- Uli Kusterer
"The Witnesses of TeachText are everywhere..."
http://hammer-language.com
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden