Re: Best guess at expressing a string as a variable
Re: Best guess at expressing a string as a variable
- Subject: Re: Best guess at expressing a string as a variable
- From: "email@hidden" <email@hidden>
- Date: Wed, 23 Jan 2013 13:34:02 +0000
On 23 Jan 2013, at 13:07, Uli Kusterer <email@hidden> wrote:
> On Jan 23, 2013, at 2:18 AM, email@hidden wrote:
>> Hmm. Maybe not. I want to keep the generated variable name legible.
>
> You need to nail down the languages you want to deploy to, and then find out what their criteria for identifiers are. Then you can decide to either generate identical lowest-common-denominator-names for all of them (which is [a-zA-Z_]([a-zA-Z_0-9]*) in the case of C, i.e. it may not start with a number either), or adjust what characters to permit based on the target programming language.
This is known http://www.mugginsoft.com/kosmictask/help/languages.
The app uses a plugin-architecture so more may appear.
>
> Apart from character set, you may also have to be aware of length limits etc. Early C compilers, for instance, only used the first 8 characters of an identifier. So "ExceptionalHouse" and "ExceptionalCow" both ended up as the same identifier, "Exceptio". I'm hard pressed to think of a language with such a limit today, but I don't know what languages your targeting. Maybe one has such a limit.
The plugin defines the language properties so length constraints can be included.
Som experimentation will determine the limits.
>
> If you have a case where you can't express a character in a particular character set, you have several options:
>
> 1) Transcribe it to an equivalent character set. E.g. U-Umlaut (ü) is usually written as "ue". However, you will then have to deal with collisions. E.g. what if one user enters the word "Frauen", but another makes up a new word "Fraün". The latter would transcribe to the former, and you might get unexpected side effects. You might have to generate a look-up-table, and if you find a collision like that, make the name unique again, e.g. by naming one "frauen" and the other "frauen2". IIRC there are official transcriptions for many languages, e.g. Romanji for Japanese characters.
>
> 2) Fail and tell the user what the valid characters are, and only let them enter valid characters.
>
> 3) Transcribe in some other way, e.g. by base64-encoding, or using a hex-representation of the given byte sequence, or whatever. This way you could keep ASCII sentences valid, but modify everything else. But even then you could have collisions. E.g. if you replace spaces with underscores, what if there's a second version with the underscore?
I was intending to decompose U-Umlaut (ü) to u + Umlaut and then discard the umlaut if possible. Or perhaps an API exists to decompose the likes of U-Umlaut (ü) to ue.
I already have collision detection code that appends integers for uniqueness.
>
>> Is + (id)letterCharacterSet the best choice here?
>
> According to the docs (https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSCharacterSet_Class/Reference/Reference.html): "An NSCharacterSet object represents a set of Unicode-compliant characters." The +letterCharacterSet documentation says "Returns a character set containing the characters in the categories Letters and Marks." So a Google later, here http://www.fileformat.info/info/unicode/category/index.htm
Thanks for the link. I didn't know that the categories were specified by unicode. I had assumed they were arbitrarily defined by Apple.
> the categories mentioning "Letters" include greek characters, accented characters, hiragana and cyrillic characters among others (most of which are invalid as C identifier names). Oddly, "marks" seem to include some kind of punctuation. I couldn't find a section that is obviously only "letters and marks" or two separate "letters" and "marks" sections.
I see that. Anyhow, I can have a look at the likes of +nonBaseCharacterSet and see how they correlate exactly with the Unicode categories.
>
> Anyway, I think building your own custom character set from a string including the characters you *know* are valid identifiers in your target programming language(s) is probably the route of least surprise.
>
Agree.
I want to get a sensible wide base and restrict it on a per language basis.
Thanks for such a detailed reply.
Jonathan
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden