On Oct 17, 2005, at 6:43 AM, Alexey Proskuryakov wrote:
Well, there are several quite popular encodings you could try before falling back to MacRoman. Shift-JIS springs to mind right away :)
MacCyrillic springs next ;-)
Well, I thought of Shift-JIS because it's easy to reject text which is malformed--if it parses as Shift-JIS without error, it's probably Shift-JIS. You won't have a ton of false positives. MacCyrillic, OTOH, isn't a double-byte character system, so all combinations of characters ought to be valid (much like MacRoman). This makes it a lot harder to reject false positives; you need to use the statistical analysis method, which takes a lot more effort to develop.
|