Re: about pdf content parsing (Chinese or Korea or Japanese)
Re: about pdf content parsing (Chinese or Korea or Japanese)
- Subject: Re: about pdf content parsing (Chinese or Korea or Japanese)
- From: Wim Lewis <email@hidden>
- Date: Wed, 10 Aug 2011 12:08:06 -0700
On 9 Aug 2011, at 9:30 PM, Jerry Huang wrote:
> Has anyone resolved the issue of parsing Chinese/Korean/Japanese PDF content? We encounter the issue and need advice.
>
> It was originally described in the following thread,
>
> http://lists.apple.com/archives/quartz-dev/2010/Jul/msg00018.html
>
> [Quote from the thread]
>
> Dear All:
>
> I am trying to extracc text on a CGPDF page, and catch the PDF operator Tj / TJ in a callback function.
> And now, i can extracc text from PDF File, but only limit in english PDF file (some english PDF File still can not get current text, i don't know why)
> But if PDF File's are Chinese or Korea or Japanese, the results of extract text are all garbled.
There are several difficulties with extracting text from a PDF file (some of which were brought up in the previous thread):
1. Font encoding. It's possible for a font embedded in a PDF document to arbitrarily change the mapping between bytes and glyphs (or characters). This is common if the PDF producer embedded a font subset. To deal with this, you will need to look at the Font object and its /Encoding key.
2. String encoding. By default, strings are encoded using the PDF Document string encoding, which is a superset of ASCII but is not the same as 8859-Latin1. (It's closer to WinANSI but not quite the same.) In some contexts, strings can include a 2-byte prefix which indicates that they're in UTF-16[1].
3. Logical layout. PDF files indicate where characters should be drawn on the page, but do not necessarily indicate how those characters are logically related to each other. You'll need to look at the locations where text is drawn and guess whether each fragment is part of the same word, or same line, as another, and how to assemble lines into logical flows (extra difficult if there is multicolumn text, callouts, or other format changes). There are optional structures that can appear in a PDF file to help you with this[2] but not all producers emit them.
[1] See PDF reference version 1.5 section 3.8.1
[2] See PDF reference version 1.5 section 10.6 and following
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden