On this topic I was considering packing two sets of 4 bits into a single
byte, to keep memory usage to the minimum possible, but I'm not sure if
the operation overhead this would cause is worth it..
That's rather hard to guess, because it depends on the actual cache hit
rates, which in turn depend on the size of the data and the size of the
machine's caches.
However, assuming that you do any serious amount of computation at all,
and that the size of the data is larger than the level 1 cache
(otherwise you probably would not worry about speed), one or two
additional machine instructions will likely be lost in the noise. Boolean
operations and bit shifts by invariant amounts are typically among the
fastest instructions. The loads and stores saved are not (but they come
close).
Cache misses, on the other hand, are terribly expensive. Hundreds or even
thousands of instructions might have been executed in the time it takes to
handle a single cache miss.
If you cannot measure cache hit rates, or if the actual data sets used in
practice vary too widely, you are probably better of packing the 4-bit
quantities into bytes. You won't be wasting many cycles in the case of
100% cache hits, but you have a lot to gain in the presence of cache
misses.
One caveat, though: the addressing of sub-byte quantities will be more
work. If you iterate over them with stride 1, then this is no problem at
all (you just unroll the loop and pick the right sub-field in every
phase). But if you access data in more random ways, you might have to do
full read-modify-write sequences to store only one half of a byte.