I've been trying to find information about the SSE implementation on
the new Intel Core Solo / Duo / Yonah architecture. Specifically, I'm
trying to find something listing the execution units and ports.
I've had a look at Intel's Optimization Reference Manual (the recent
April version) but it seems to have been only partially updated. The
closest thing I can find is Figure 1-4, but this seems to detail the
Pentium 4 architecture.
The diagram implies that ports 2, 3 & 4 are for load / store and that
ports 0 & 1 are for computation. It shows that port 0 & 1 have both a
floating point (x87 scalar? SSE scalar?) and SSE unit. Does this mean
that Core can dispatch two independent SSE instructions per cycle
(although presumably still with a 2 cycle latency)? Can it execute
independent (scalar) floating point multiply and add per cycle?
Some information or links to appropriate documentation would be
greatly appreciated.
Additionally, a lot of the optimisation literature says to avoid
instructions which generate multiple micro-ops. However, I haven't
been able to find anything which details which instructions decode to
how many micro-ops. I assume that this will probably only be
applicable to the more convoluted instructions (e.g. operating
systems support), but it would still be nice to know.
Also, are there plans to release documentation about optimising for
the Intel architecture? By which I mean something comparable to the
current copious (and excellent) Altivec documentation.