Michael Hall wrote:
>Second I was thinking each String getBytes() would be a direct hex
>translation of the Unicode value, which I have verified it does not
>seem to be. I'm still not completely sure that makes what I was
>suggesting complete nonsense.
I don't know what you mean by "a direct hex translation" of a Unicode value.
>Does java String.getBytes("\u0439") always produce the same hex value
>- whatever it might be?
It doesn't: what it returns depends on what the default encoding is. Read
The Fine Manual section describing String.getBytes().
And I suspect you meant this:
"\u0439".getBytes()
as the only other possible interpretation is that there is an encoding
whose name is Cyrillic-small-letter-short-i, which seems unlikely.
>If it doesn't then what I assumed was in fact complete nonsense and
>every Unicode string has to correspond to one and only one character
>encoding that knows how to handle it correctly.
False. If I understand what you're saying correctly.
Many different byte-sequences can be produced from a single Unicode string.
Also, many different byte-sequences can produce a single Unicode string. I
mentioned ASCII and EBCDIC earlier. Clearly, the byte-strings are
different, but one Unicode string can code for either.
Going the other direction, a constant byte sequence can be decoded into
very different Unicode strings, depending on what you decode the bytes as.
EBCDIC will get you one thing, and UTF8 quite another.
The question is context, and that context is "What is the byte-encoding?"
>Bad assumption on the direct mapping. However, not complete nonsense
>in that then...
>new String("cyrllic","SuperHybridLatinCyrllicCharSet"); // might be
>possible and not complete nonsense.
True. But you'd have to define SuperHybridLatinCyrllicCharSet as the context.
There is nothing magical about charsets or encodings. Unicode is an
encoding. Even binary is an encoding. Contrast 2's-complement,
1's-complement, sign-magnitude, and BCD; they are all "binary", but
arithmetic is different in each one.
A bit-pattern only has meaning given an interpretation, aka an encoding
that tells you what the patterns mean.
>However if it always produces the same hex bytes then you could vary
>the encoding and it _might_ sense.
I try not to write code that only _might_ make sense.
>You could have
>String mac_cyrllic = new String("\u0439".getBytes
>(),"MacCryllic")); // and it would work
>or
>String other_cyrllic_charset = new String("\u0439".getBytes
>(),"OtherCryllicCharSet")); // and it would also work
The plain no-arg getBytes() calls use a default encoding. The default
encoding is context-dependent. Therefore, the results of this code
fragment are context-dependent. In other words, whether its output makes
sense or not depends on context.
>Basically rightly or wrongly you are claiming the encoding handles
>the string.
No I'm not, because it doesn't. Read The Fine Source and see for yourself.
>Specifying the encoding does not itself do any byte
>conversions in the String constructor.
False. The encoding, either a named one or the default, leads to a
converter object. That object converts the bytes you give it into Java
chars, which are a primitive type with a 16-bit size and Unicode encoding.
Those 16-bit Java chars are then arranged in an array or sequence, with
which it creates a String.
A String always holds an array of 16-bit Java chars, and the bit-patterns
in that array always represent Unicode characters.
The object that converts your bytes into Java chars is a decoder, and its
implementation and class name determine which encoding-name it will be
associated with. A decoder is most definitely doing byte conversions.
If you get bytes from a String, the inverse occurs: an encoder object
translates Java chars (16-bit data, Unicode encoding) into a series of
bytes. That's the inverse "byte conversion", and it is precisely what
encoders do.
>I guess I was thinking maybe MacRoman was sort of filling this
>SuperHybrid function on the OS X platform. Again I missed the switch
>to MacCryllic so I was in fact wrong. But being mistaken does not
>make the idea nonsense. Although there may very well be other
>considerations that make composite Latin+Cyrllic character sets
>_nonsense_.
There is nothing intrinsically nonsensical about composite charsets of ANY
combination of alphabets (or even non-alphabets). Unicode is just such a
composite charset.
I couldn't comment on the IDEA of what your code was trying to do, because
I couldn't tell what that idea was. All I could comment on was whether it
was doing the correct things, given the context. And the answer to that
was "No."
It was doing something that wasn't entirely random, but was certainly not
correct, and appeared to be based on some flawed assumptions. It SEEMED to
make sense in a "not quite right" kind of way, and that is what I call
nonsense.
"Nonsense" isn't necessarily pejorative. Lewis Carroll's "Jabberwocky" is
fine nonsense, an example of an entire category called Nonsense Verse.
Hacker Lore is filled with nonsense: programming languages intentionally
created to be as obtuse as possible, contests that reward obscure programs
that perform useful functions, etc.
>Not sure on the PrintStream stuff. You're probably right that it all
>depends on file.encoding.
Again, RTFM, or even RTFS included with Apple's Java Developer downloads.
Or decompile it using 'javap' on PrintStream.
>However, I was thinking the String
>constructor encoding might set a field for the String instance that
>OutputStream's/gui components could use eventually in displaying a
>glyph.
Strings always and only have one encoding: Unicode. The way Strings get
Unicode is by having decoder objects convert bytes to Unicode.
Java isn't C (or C++ or ObjC), where you have to track which encoding your
bytes represent, or whether they're even bytes (could be wchar_t).
-- GG
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden
This email sent to email@hidden