sort | uniq -c
It’s a lot faster, too, because it only goes through the text twice (once for sort(1), once for uniq(1)), instead of once for every unique word. uniq(1) prints the count first and then the word; if you want it the other way around, do this:
sort | uniq -c | awk '{ print $2 " " $1 }' | column -t
(I actually didn't know about using “column -t” to clean up the columns; that’s a useful trick. Another handy trick is to put the script output on the clipboard by piping to pbcopy(1), like this: “echo hello world | pbcopy”.) For those determined to use grep(1) as part of the solution, you should know about three options:
-w Only match entire words. This cures the “apple as a substring” problem.
-i Match case-insensitively. Depending on your application, you may or may not want this.
-o Only print the matching part of the line.
grep(1) normally prints the entire line that the match occurs in. For the originator, this didn’t matter, because every line had only one word, but when searching arbitrary text, it makes a big difference: this is probably why Omar got 75 for his “Hedges” example — 75 is the number of words in the line where “Hedges” occurs.
—Chris Nebel
AppleScript Engineering
On Sep 8, 2014, at 10:43 AM, Christopher Stone <
email@hidden> wrote:
On Sep 08, 2014, at 10:22, Christopher Stone <
email@hidden> wrote:
When run from a 10,000 word file or BBEdit window using your example words the run-time is less than 2/10 of a second on my system.
______________________________________________________________________
Hey Guido,
Oh, yeah. If you want to run a TextWrangler text filter you can do this. It's just about instantaneous on the same 10K word test file.
Remember that text filters are destructive, so if you want to keep the original be sure to run on a copy.
--
Best Regards,
Chris
#! /usr/bin/env bash
T=$(tr '\r' '\n');
S=$(sort -u <<< "$T");
A="";
for i in $S
do
X=$(grep "$i" <<< "$T" | wc -w);
A="$A$i $X\n";
done
echo -e "$A" | column -t;