On this topic I was considering packing two sets of 4 bits into a
single byte, to keep memory usage to the minimum possible, but I'm
not sure if the operation overhead this would cause is worth it..
That's rather hard to guess, because it depends on the actual cache
hit rates, which in turn depend on the size of the data and the size
of the machine's caches.
However, assuming that you do any serious amount of computation at
all, and that the size of the data is larger than the level 1 cache
(otherwise you probably would not worry about speed), one or two
additional machine instructions will likely be lost in the noise.
Boolean operations and bit shifts by invariant amounts are typically
among the fastest instructions. The loads and stores saved are not
(but they come close).
Well, G3/G4 both had 64kb L1, I seem to remember intel cores have less
than that, so 32kb unless
they have made them bigger in recent chips.. So you would need to stay
in 32kb for intel..?
Cache misses, on the other hand, are terribly expensive. Hundreds or
even thousands of instructions might have been executed in the time
it takes to handle a single cache miss.
If you cannot measure cache hit rates, or if the actual data sets
used in practice vary too widely, you are probably better of packing
the 4-bit quantities into bytes. You won't be wasting many cycles in
the case of 100% cache hits, but you have a lot to gain in the
presence of cache misses.
Are there a good tool for measuring this? I'v not really explored
shark very deeply..
One caveat, though: the addressing of sub-byte quantities will be
more work. If you iterate over them with stride 1, then this is no
problem at all (you just unroll the loop and pick the right sub-
field in every phase). But if you access data in more random ways,
you might have to do full read-modify-write sequences to store only
one half of a byte.
The data would need random reads, but would only ever be written to
sequentially.
Write performance is secondary to read performance though, reads are
where the
most speed is required, even 3x read to write would be acceptable..