1) Do you have any evidence to suggest that changing data in structs
is a performance problem in your application? Unless such evidence
emerges, it seems highly unlikely to me this is worth worrying about.
2) Check the disassembly. If it is doing multiplication in the inner
loop, you might look at ways to change those to adds. So, for
example, inside vImage, we rarely do ptr + y*width + x, we do this:
uint8_t *row = vImage_Buffer->data;
for( y = 0; y < height; y++ )
{
uint8_t *pixel = row;
for( x = 0; x < width; x++ )
{
Do something with *pixel here
pixel++;
}
row += vImage_Buffer->rowBytes;
}
...except, like, vectorized. Often the compiler can do this kind of
transformation for you, but not always.
Note that vImage does do some limited data reorganization. Look at
the vImagePermute functions. However, I don't think these are
vectorized on intel at the moment, for obvious reasons. We are
looking at special case vectorization for some common cases. If you
think you have one, file a bug at bugreporter.apple.com against
component Accelerate/X asking for the special case.
3) Those aren't functions so nothing to inline there. Technically,
all you really need there is to use the boolean & operation to look
at the 1's bit, which is a lot cheaper than a real mod operation.
However, it is quite possible the compiler knows that and is doing
that optimization for you.
4) Uhh... only if you are good. :-) vImage functions are plain C
functions.
5) Well, if it is the same kernel every time, you could make that
array static const. I doubt this is worth worrying about.
Overall, performance-wise, I suspect you are asking the wrong
questions. I suggest running Shark to see where the time is going,
and then figure out how to accelerate those things that are taking up
the time. Except for maybe #2, it is unlikely that these other things
are going to impact your speed much, because comparatively speaking,
they don't happen very often. An operation on half a million pixels
is vastly more expensive than the ObjC overhead for one function call
or the cost of copying a 9 element array. Just think of how much data
needs to be touched for each task and you'll get the idea.
Depending on what you are doing, in certain cases, OpenGL/CoreVideo
might be faster.
Ian
On Apr 20, 2006, at 2:59 PM, Juan P. Pertierra wrote:
Hello,
My software processes very large raw video files. There is a
separate processing thread which loads raw
data from the large file, then repeatedly calls a cocoa instance
method which renders each frame from a
pointer to the raw data. Everything runs fine and does what it is
supposed to do, i'm just trying to make
sure it runs as fast as it can because it is a lengthy process even
on the fastest macs.
Here are some questions I have about possibly making this render
function run faster:
1.)I'm passing a large number of arguments to the render instance
method, mostly are pointers to
temporary image buffers. This prevents allocating/freeing memory
between renders of each frame.
However, the pointers are to vImage_Buffers, which are structs. In
the render method I am frequently
accessing(in loops) members of these structs directly...so for
example using vBuffer_ptr->data to access
the image data.
Since I am accessing these struct members repeatedly, should I be
instead assigning the values to local
variables and using those instead? I'm thinking perhaps the
members are being fetched from memory
each time, it would be much faster if I could force the processor
to keep it in a register...i think?
2.)I'm using vImage functions for as many things a I can because I
am under the impression that is the
fastest way to do those image operations. There are however a
couple of things that vImage doesn't do.
For example I need to take raw data pixels and arrange them in a
larger image(like building a Bayer
mosaic) which I implemented with loops and scalar code.
For example, I have to take the red pixels from the raw data and
spread them out in an image such that
the pixels sit only at locations where x and y coordinates are
even. I do this in this manner:
for(y = 0; y < HEIGHT; y++)
{
for(x = 0; x < WIDTH; x++)
{
if(x%2 == 0 && y%2 == 0)
{
*((unsigned short int *)vBufferRed-
>data + (y * WIDTH) + x) = (unsigned short int) *
((unsigned char *)vBufferRed0->data + (y * WIDTH) + x);
}
else {...similar code to set these locations
to 0...}
}
}
Is there anything fishy or perhaps a better way to do this? As you
can see I am also converting from the
8-bit raw data to a 16-bit image, it is my understanding that as
long as i'm not converting to/from a float
this is OK to do.
3.)Also, I'm not using any inline functions. Should I be? For
example, would it benefit to make an inline
function for the x%2 == 0 && y%2 == 0 statement?
4.)The instance method which renders each frame only uses vImage
functions and NSBitmapRep/
NSImage ot save each image...is it faster to use a plain C/C++
function for this kind of repeated calls for
processing?
5.)Finally, at the beginning of the render method I am defining
some convolution kernels as:
const float kernel1[] = {..3x3 values...}
since this is done on every frame, should I be defining this
kernels outside and pass them to the method,
or is this really as efficient as it gets?
Thanks for any input and taking the time to read through this.
Cheers,
Juan
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimization-
email@hidden)
Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/iano%
40apple.com