The diagram implies that ports 2, 3 & 4 are for load / store and
that ports 0 & 1 are for computation. It shows that port 0 & 1 have
both a floating point (x87 scalar? SSE scalar?) and SSE unit. Does
this mean that Core can dispatch two independent SSE instructions
per cycle (although presumably still with a 2 cycle latency)? Can
it execute independent (scalar) floating point multiply and add per
cycle?
Yes, in principle, at least for short periods of time. Other things
like loop overhead, limitations on the number of register file read
ports, decode bandwidth, floating point operand edge case stalls,
etc. can get in the way of achieving sustained 2 flop/cycle in scalar
arithmetic in real world problems.
Additionally, a lot of the optimisation literature says to avoid
instructions which generate multiple micro-ops. However, I haven't
been able to find anything which details which instructions decode
to how many micro-ops. I assume that this will probably only be
applicable to the more convoluted instructions (e.g. operating
systems support), but it would still be nice to know.
That is not a good assumption. If you play around with Shark a bit,
you'll notice that there are PMCs for instructions retired and µops
retired.
Also, are there plans to release documentation about optimising for
the Intel architecture? By which I mean something comparable to the
current copious (and excellent) Altivec documentation.
Hmmm... I think maybe that falls under the area of questions about
future products. I can certainly tell you that we understand the
value of such documentation, needing that sort of info ourselves to
vectorize/optimize our own code! There is some stuff up already...