On Jan 27, 2009, at 9:22 AM, Chris Williams wrote:
Virtually all applications I’ve seen that are “slow” are slow not because of CPU time, but because of other operations (such as disk and network accesses) that are orders of magnitude more expensive. Or because they needlessly do operations many times where once would do.
This is true for many traditional desktop applications (word processing, spreadsheets, etc), but not true for scientific applications where often the working set fits the prefetching behavior and cache size well and thus the CPU is the bottleneck simply by virtue of how long the operations take to execute.
Certainly you know more about the specifics of your application and its larger objectives than a compiler can know. And therefore you CAN write code that is faster in the individual case than a compiler. That is quite true.
But I have said this before, in 25+ years in the software business I’ve seen precisely one case where someone made huge performance gains in an application by hand-coding or tricking the compiler, and that was almost 20 years ago when the person wrote code that just fit in the 80286 pre-fetch cache.
You'd be surprised at how often that same trick is used today. This is also why things like cache affinity in the scheduler have become important. Not only is the code being written to fit in the cache, but the data is being partitioned to be used by threads cooperatively so that multiple cores can share cache lines.
In every other case, the return on time investment for this kind of stuff is tiny (or negative) and serves merely to amuse the inner-geek in the coder and not the larger case of making the application faster.
Look at any large scientific application that has no user interface and does nothing but number crunching. An improvement in the computational kernel improves the execution time significantly. Most of the time, this involves manually scheduling the code or using processor features that the compiler doesn't use. This is why many math libraries use assembly and have variants for each processor microarchitecture.
In any case, profile first. Optimizing a function that takes 1% of the execution time of an action will at most improve overall performance by 1%. For performance work, intuition is frequently wrong. Use the tools available to you (Shark, Instruments, DTrace) and collect information showing what the bottleneck is and then tackle it.
</soapbox> :)