Image processing performance T30

Hi!

I am building an (video) image processing application on windows embedded 2013 using the T30, but I find that low level image processing is rather slow. E.g. simply subtracting two 8bit gray-scale images (640x480), will take more then 20ms per image. That is by far not fast enough for our application (there is more to be done). I tried several things to improve the performance: parallelizing the code in three different ways: using std::thread, using a thread pool, and using openMP. All run correctly but are slightly slower then the sequential version. I also used the NEON intrinsics, vectorizing the loop, also working, but giving only a very slight performance gain.

Do you have any tips or should I switch to a board with a faster processor with SSE2? The same code runs > 20x faster on a regular desktop…

Thanks in advance,

Rob Ottenhoff
p.s. using a function with floats gives a dramaticly slow performance.

I see two ways how to accelerate this task:

1. Make us of all the cores

As you wrote, you already tried to do that. Did you check once the CPU usage of the 4 different CPUs? You can use the Colibri Monitor for that. Are they all loaded or only one of the cores? If you want to force the Core the Thread is running on, you can use CeSetThreadAffinity.

2. Use GPU for hardware acceleration

I am not fully sure what the result should be of your operation. Do you see any way to use any kind of a blit function or DirectDraw to do the same as you do now? In that way you can make use of the GPU and get out the power of the T30. For example a BitBlit with XOR. Or are you already doing that?