Re: NSLinguisticTagger
Re: NSLinguisticTagger
- Subject: Re: NSLinguisticTagger
- From: "Gerriet M. Denkmann" <email@hidden>
- Date: Wed, 24 Sep 2014 12:52:01 +0700
On 24 Sep 2014, at 12:23, Roland King <email@hidden> wrote:
>
>> On 24 Sep 2014, at 1:02 pm, Gerriet M. Denkmann <email@hidden> wrote:
>>
>>
>> On 24 Sep 2014, at 11:46, Roland King <email@hidden> wrote:
>>
>>>
>>>> On 24 Sep 2014, at 12:31 pm, Gerriet M. Denkmann <email@hidden> wrote:
>>>>
>>>> I have a problem with NSLinguisticTagger / CFStringTokenizer on iOS 8.0
>>>>
>>>> OS X 10.9.5 (and iOS 7 and earlier) parses "สีเหลือง" quite rightly as two words: "สี" = colour and "เหลือง" = yellow.
>>>>
>>>> No dictionary will ever contain "yellow colour". Every dictionary will contain "yellow" and "colour".
>>>> There are hundreds, if not thousands of these expressions, which are wrongly classified as one word.
>>>> Might have something to do with the new predictive keyboard.
>>>>
>>>> But I am not writing this to complain, but to ask for a favour: could anybody on 10.10 just click anywhere in: "สีเหลือง" and tell me whether all gets highlighted, or just a part (as in 10.9.5)?
>>>
>>>
>>> If I double click anywhere on the right of that I get the second part (all bar the first character) highlighted. Clicking on the first character I get just that character. So 10.10 (beta 8) splits that sequence into two ‘words’.
>> This is a big relief. Thanks a lot.
>>
>>>
>>> Why do you suspect the predictive keyboard? Certainly wouldn’t be the first thing I thought of seeing that issue. I would probably instead assume I’d written myself a bug.
>>
>> Well, here is the code; maybe you can find a bug:
>>
>> let text = "สีเหลือง"
>> let opts: Int = 0
>> let schemes = [ NSLinguisticTagSchemeTokenType, NSLinguisticTagSchemeNameTypeOrLexicalClass ]
>> let tagger = NSLinguisticTagger(tagSchemes: schemes, options: opts )
>>
>> let nsText = text as NSString
>> let length = nsText.length
>> tagger.string = nsText
>> let range = NSMakeRange(0,length)
>> let theScheme = NSLinguisticTagSchemeTokenType
>> let ops = NSLinguisticTaggerOptions(0)
>> tagger.enumerateTagsInRange (
>> range,
>> scheme: theScheme,
>> options: ops,
>> usingBlock:
>> { ( tag: String!,
>> tokenRange: NSRange,
>> sentenceRange: NSRange,
>> stop: UnsafeMutablePointer<ObjCBool>
>> ) -> Void in
>>
>> let word = nsText.substringWithRange(tokenRange)
>> println("\(tag) = \(word) " )
>> }
>> )
>>
>> Gerriet.
>>
>
>
>
> Here’s my version I was just writing - I ran it in an iOS playground AND in an OSX playground and get the same ‘single word’ result either time. So I’m not entirely sure that the click test on OSX proved anything. If you comment out the Thai string and uncomment Chinese one, it works better and splits stuff up although the last two words are wrong there as well, they should be ‘去“ and “健身房“. It’s the same in an OSX playground and an iOS one but then again iOS playgrounds are emulated so ..
>
> I also compiled it as an OSX command line tool and it does the same thing for my phrase AND yours. So whatever is doing the highlighting when you ‘click’ isn’t the same thing NSLinguisticTagger is doing.
>
> The click test works on my chinese phrase too, it gets the last two words correct. Something sure ain’t right.
>
> Should write the objc version to eliminate any possibility it’s swift.
I have an app in ObjC using NSLinguisticTagger, which on 10.9.5 prints:
"我" = Word
"今天" = Word
"还" = Word
"没有" = Word
"去健" = Word <-- wrong
"身房" = Word <-- wrong
But when I click on "去" I just get "to go",
and when I click on "健身房" I get "gym".
So, you are right: the clicking algorithm seems NOT to be using NSLinguisticTagger. And I didn't go to the gym either.
Further investigating (again ObjC on 10.9.5):
CFStringTokenizer as wrong as NSLinguisticTagger
Icu 51.1 correct:
token[1] {0, 1} = "我" -- UnKnown Word --
token[2] {1, 2} = "今天" -- UnKnown Word --
token[3] {3, 1} = "还" -- UnKnown Word --
token[4] {4, 2} = "没有" -- UnKnown Word --
token[5] {6, 1} = "去" -- UnKnown Word --
token[6] {7, 3} = "健身房" -- UnKnown Word --
NSTextView (selectionRangeForProposedRange:granularity: NSSelectByWord), AttributedString (doubleClickAtIndex:) correct as Icu.
I thought that all were based on Icu, but this proves that I am wrong.
Probably I should use doubleClickAtIndex, now that iOS has AttributedStrings.
> let str = "สีเหลือง"
> //let str = "我今天还没有去健身房"
> let str2 = str as NSString
>
> let tagger = NSLinguisticTagger(tagSchemes: [NSLinguisticTagSchemeTokenType], options: 0 )
>
>
> let range = NSMakeRange( 0, str2.length )
>
> tagger.string = str2
>
> var ranges : NSArray?
> let things = tagger.tagsInRange( range, scheme: NSLinguisticTagSchemeTokenType, options: NSLinguisticTaggerOptions.allZeros, tokenRanges: &ranges )
> things.count
>
> ranges
>
> for ( index, type ) in enumerate( things )
> {
> let type_range : NSValue? = ranges?[ index ] as NSValue?
> print( "Type: '\(type)' at \(type_range!) ")
> println( str2.substringWithRange(type_range! ) )
>
> }
>
>
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden