Re: Encoding
Re: Encoding
- Subject: Re: Encoding
- From: Shane Stanley <email@hidden>
- Date: Sun, 31 Aug 2014 22:41:26 +1000
On 31 Aug 2014, at 7:14 pm, koenig.yvan <email@hidden> wrote:
> To be complete, when I read the file with an hexadecimal editor,
>
> the first occurrence is displayed as :
>
>
> 63 65 69 63 E3 6F 20 which, according to character ID values means ceicão
>
>
> the second is displayed as :
>
> 63 65 69 63 C3 E3 6F 20 which, according to UTF-8 values means also ceicão
I think you've made an error in transcribing. That last case should be:
63 65 69 63 C3 A3 6F 20
That's what's showing in the screenshot you sent me off-list, and that's what I get when reading the file "as data". So there's no conflict.
The important missing piece of information is that it is a PDF file. A PDF is made up of several sections, and they are not encoded the same. The last case is from the XMP metadata part of the file, which is XML presumably encoded as UTF-8, and the earlier example looks to be in PDFDocEncoding, which I gather is a variation on iso-8859-1.
There's no way you can sensibly read a full PDF file using a single encoding -- it has to be treated as a binary file.
I think we canvassed the options for reading the text in a PDF here a while back.
--
Shane Stanley <email@hidden>
<www.macosxautomation.com/applescript/apps/>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden
References: | |
| >Encoding (From: "koenig.yvan" <email@hidden>) |
| >Re: Encoding (From: Shane Stanley <email@hidden>) |
| >Re: Encoding (From: "koenig.yvan" <email@hidden>) |