On Jan 15, 2008, at 7:36 AM, Jerahmy Pocott wrote:
The data would need random reads, but would only ever be written to
sequentially.
Write performance is secondary to read performance though, reads are
where the
most speed is required, even 3x read to write would be acceptable..
Sometimes the random reads are the problem. In order to get 4-bits,
the whole cacheline needs to be read into the processor's caches (if
it isn't there already). If your data set is too large to fit into the
L2 cache, then this is likely your bottleneck. In that case, packing
the data down to 4-bits over 8-bits might not help much because in
both cases you still need to load in the full 32/64/128-byte cacheline
in order to get those 4 bits. So, it wont save you much time on any
single random read from memory.
However, compressing to 4 bits will do a couple things. It will cut
your cache usage in half, so you can fit twice as much data in the
cache. If this is the difference between fitting all the data in the
cache vs. only half of it, it could be a huge (as much as 20x?) win.
Even if it doesn't all fit could be a partial win, depending on how
much more of your data set you can fit in the cache. Likewise it will
cut your memory usage in half. So, if you are actually paging to/from
disk to service your data set (because it doesn't fit in RAM) you
could get a similar savings there, though the actual win might be more.
Sometimes a better approach would be to structure your data layout or
access pattern differently so that the reads are more sequential. A
lot really depends on your data size and access patterns. It is
difficult to present a general solution to this problem. Perhaps if
you told us more about how the data is stored and what order it is
accessed, we might devise something suitably diabolical.