在 Apalis iMX6Q 上测试 OpenCL

2016年12月2日星期五
Apalis
简介

相比曾经,如今科技设备对处理性能和速度要求越来越高。为了应对这种技术需求,许多公司发明了不少方法来获得更好的处理性能。例如苹果公司,发明了 Open Computing Language (OpenCL)。2008年6月,苹果公司向 Khronos Group 提交了 OpenCL 工作建议。历经五个月的研发,OpenCL 1.0 于 2008 年 11 月发布。

OpenCL 是为个人电脑、服务器、移动设备以及嵌入式设备的多核系统提供并行编程开发的底层 API。OpenCL 的编程语言类似于 C 语言。其可以用于包含 CPU、GPU 以及来自主流制造商如 NXP®、NVIDIA®、Intel®、AMD、IBM 等的处理器的异构平台。OpenCL 旨在提高应用软件如游戏、娱乐以及科研和医疗软件的运行速度和响应。

在本博文中,我们使用 Apalis iMX6Q 系统模块测试 OpenCL,对比两个应用 - 一个运行在 GPU 上,另一个则在 CPU。最后我们将分享本次测试的结果。

OpenCL on Apalis iMX6Q
测试硬件

Toradex 的 Apalis iMX6Q 计算机模块采用 NXP 的 iMX6 四核处理器,其提供的处理性能特别适合于多媒体应用。该处理器具有 4 个 Arm® Cortex®-A9 核,最高主频为 800MHz。除了处理器,Apalis 系统模块还具有高达 2GB DDR3 RAM(64bit)和 4GB eMMC Flash。

除了具备出色的图形和多媒体处理能力,该处理器还具有 Vivante GC2000 3D GPU,其能够支持 OpenCL EP (Embedded Profile)。因此,我们能充分够利用 i.MX6Q GPU 处理能力。

Toradex Embedded Linux 镜像中添加 OpenCL

我们假设你已经具有能够编译 Apalis iMX6 镜像的 OpenEmbedded 编译环境。你可以参考我们 OpenEmbedded (core) 文章。

为编译支持 OpenCL 以及相关库文件的嵌入式 Linux 镜像,需要采取以下步骤:

首先,修改下面目录中的文件。

/meta-toradex/recipes-fsl/packagegroups/packagegroup-fsl-tools-gpu.bbappend

添加如下内容:

SOC_TOOLS_GPU_append_mx6 = " \
    libopencl-mx6 \
    libgles-mx6 \
"


并在 local.conf 文件中添加 imx-gpu-viv

IMAGE_INSTALL_append = "imx-gpu-viv"


现在就可以编译镜像:

bitbake angstrom-lxde-image
GPU 和 CPU 代码

所有的代码可以从 GitHub 上下载。

我们使用数列求和应用作为基本的演示例程。第一部分代码运行在 GPU 上,第二部分则在 CPU 上。应用执行完毕后打印其所消耗的时间。使用 OpenCL 所需的头文件是 cl.h,位于文件系统的 /usr/include/CL 目录。链接程序所需的库文件是 libGAL.so 和 libOpenCL.so,位于 /usr/lib 目录。

为了计算消耗的时间,我们创建带分析功能的队列,在结束的时候获取分析的结果。

下面是 OpenCL 代码:

//************************************************************
// Demo OpenCL application to compute a simple vector addition
// computation between 2 arrays on the GPU
// ************************************************************
#include 
#include 
#include 
#include <CL/cl.h>
//
// OpenCL source code
const char* OpenCLSource[] = {
"__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",
"{",
" // Index of the elements to add \n",
" unsigned int n = get_global_id(0);",
" // Sum the nth element of vectors a and b and store in c \n",
" c[n] = a[n] + b[n];",
"}"
};
// Some interesting data for the vectors
Int InitialData1[80] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[80] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
// Number of elements in the vectors to be added
#define SIZE 600000
// Main function
// ************************************************************
int main(int argc, char **argv)
{ 
     // Two integer source vectors in Host memory
     int HostVector1[SIZE], HostVector2[SIZE];
     //Output Vector
     int HostOutputVector[SIZE];
     // Initialize with some interesting repeating data
     for(int c = 0; c < SIZE; c++)
     {
          HostVector1[c] = InitialData1[c%20];
          HostVector2[c] = InitialData2[c%20];
          HostOutputVector[c] = 0;
     }
     //Get an OpenCL platform
     cl_platform_id cpPlatform;
     clGetPlatformIDs(1, &amp;cpPlatform, NULL);
     // Get a GPU device
     cl_device_id cdDevice;
     clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &amp;cdDevice, NULL);
     char cBuffer[1024];
     clGetDeviceInfo(cdDevice, CL_DEVICE_NAME, sizeof(cBuffer), &amp;cBuffer, NULL);
     printf("CL_DEVICE_NAME: %s\n", cBuffer);
     clGetDeviceInfo(cdDevice, CL_DRIVER_VERSION, sizeof(cBuffer), &amp;cBuffer, NULL);
     printf("CL_DRIVER_VERSION: %s\n\n", cBuffer);
     // Create a context to run OpenCL enabled GPU
     cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);     
     // Create a command-queue on the GPU device
     cl_command_queue cqCommandQueue = clCreateCommandQueue(GPUContext, cdDevice, CL_QUEUE_PROFILING_ENABLE, NULL);
     // Allocate GPU memory for source vectors AND initialize from CPU memory
     cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
     CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);
     cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
     CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);
     // Allocate output memory on GPU
     cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,
     sizeof(int) * SIZE, NULL, NULL);
     // Create OpenCL program with source code
     cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource, NULL, NULL);
     // Build the program (OpenCL JIT compilation)
     clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);
     // Create a handle to the compiled OpenCL function (Kernel)
     cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);
     // In the next step we associate the GPU memory with the Kernel arguments
     clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem), (void*)&amp;GPUOutputVector);
     clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&amp;GPUVector1);
     clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&amp;GPUVector2);
 
     //create event
     cl_event event = clCreateUserEvent(GPUContext, NULL);
 
     // Launch the Kernel on the GPU
     // This kernel only uses global data
     size_t WorkSize[1] = {SIZE}; // one dimensional Range
     clEnqueueNDRangeKernel(cqCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, &amp;event);
     // Copy the output in GPU memory back to CPU memory
     clEnqueueReadBuffer(cqCommandQueue, GPUOutputVector, CL_TRUE, 0,
     SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);
     // Cleanup
     clReleaseKernel(OpenCLVectorAdd);
     clReleaseProgram(OpenCLProgram);
     clReleaseCommandQueue(cqCommandQueue);
     clReleaseContext(GPUContext);
     clReleaseMemObject(GPUVector1);
     clReleaseMemObject(GPUVector2);
     clReleaseMemObject(GPUOutputVector);    
 
     clWaitForEvents(1, &amp;event);
     cl_ulong start = 0, end = 0;
     double total_time;     
 
     clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &amp;start, NULL);
     clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &amp;end, NULL);
 
     total_time = end - start;     
 
     printf("\nExecution time in milliseconds = %0.3f ms", (total_time / 1000000.0) );
     printf("\nExecution time in seconds = %0.3f s\n\n", ((total_time / 1000000.0))/1000 );          
 
     return 0;
}


CPU 代码是简单的 C 程序,和上面一样计算同样的队列求和。为了计算消耗的时间,我们使用 time.h中的库。代码如下:

#include 
#include 
#include  
 
int InitialData1[80] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[80] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
 
#define SIZE 600000
 
int main(int argc, char **argv)
{
time_t start, stop;
clock_t ticks;
 
time(&amp;start);    
// Two integer source vectors in Host memory
int HostVector1[SIZE], HostVector2[SIZE];
//Output Vector
int HostOutputVector[SIZE];
// Initialize with some interesting repeating data
//int n;
for(int c = 0; c < SIZE; c++)
{
HostVector1[c] = InitialData1[c%20];
HostVector2[c] = InitialData2[c%20];
HostOutputVector[c] = 0;
}
 
for(int i = 0; i < SIZE; i++)
{
        HostOutputVector[i] = HostVector1[i] + HostVector2[i];
        ticks = clock();
}     
 
time(&amp;stop);
 
printf("\nExecution time in miliseconds = %0.3f ms",((double)ticks/CLOCKS_PER_SEC)*1000);
 
printf("\nExecution time in seconds = %0.3f s\n\n", (double)ticks/CLOCKS_PER_SEC);
 
return 0;
}
交叉编译应用

同一个 Makefile 可以用于交叉编译 GPU 和 CPU 应用。你需要注意下面的三个变量。根据你的系统做相应的调整:

export ARCH=arm
export ROOTFS_DIR=/usr/local/toradex-linux-v2.5/oe-core/build/out-glibc/sysroots/apalis-imx6
 
APPNAME = proc_sample
TOOLCHAIN = /home/prjs/toolchain/gcc-linaro
 
CROSS_COMPILER = $(TOOLCHAIN)/bin/arm-linux-gnueabihf-
CC= $(CROSS_COMPILER)gcc
DEL_FILE = rm -rf
CP_FILE = cp -rf
TARGET_PATH_LIB = $(ROOTFS_DIR)/usr/lib
TARGET_PATH_INCLUDE = $(ROOTFS_DIR)/usr/include
CFLAGS = -DLINUX -DUSE_SOC_MX6 -Wall -std=c99 -O2 -fsigned-char -march=armv7-a -mfpu=neon -DEGL_API_FB -DGPU_TYPE_VIV -DGL_GLEXT_PROTOTYPES -DENABLE_GPU_RENDER_20 -I../include -I$(TARGET_PATH_INCLUDE)
LFLAGS = -Wl,--library-path=$(TARGET_PATH_LIB),-rpath-link=$(TARGET_PATH_LIB) -lm -lglib-2.0 -lOpenCL -lCLC -ldl -lpthread
OBJECTS = $(APPNAME).o
first: all
all: $(APPNAME)
$(APPNAME): $(OBJECTS)
$(CC) $(LFLAGS) -o $(APPNAME) $(OBJECTS)
$(APPNAME).o: $(APPNAME).c
$(CC) $(CFLAGS) -c -o $(APPNAME).o $(APPNAME).c
clean:
$(DEL_FILE) $(APPNAME)


在应用所在的目录中保持 Makefile 文件,然后运行 make。
将编译生成的文件复制到 Apalis iMX6 开发板上。

测试结果

在执行两个应用程序后,我们得到以下结果:

### Processor time
Execution time in miliseconds = 778.999 ms
Execution time in seconds = 0.779 s 
 
### GPU time 
Execution time in milliseconds = 12.324 ms
Execution time in seconds = 0.012 s


根据以上结果,我们可以很清楚地看到在 Apalis iMX6Q GPU 上使用 OpenCL 能够加速队列求和运算。

总结

用户想要使用 Apalis iMX6Q GPU ,除了其他的方法,还可以使用 OpenCL 提高计算性能。正如本博文所描述,借助 OpenCL,可以在不同设备从图形显卡到超级计算机以及嵌入式设备,运行代码。用户还可以进一步结合,例如在 OpenCV 中使用 OpenCL 提高计算机视觉的性能。这个演示可以作为开发无图形界面应用的例程。

参考

https://www.khronos.org/opencl/
https://en.wikipedia.org/wiki/OpenCL
http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854
http://developer.toradex.com/products/apalis-imx6
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro


本文最初以葡萄牙语在 Embarcados.com 上发表,请参考这里

#Embedded Linux #NXP® i.MX6 #OpenCL #Yocto
Author Giovanni Bauermeister, Toradex Brasil

1 comments

Gunasekaran - 1 year 10 months | Reply

Are you sure that OpenCL can used with OpenCV in imx6 boards? As far as I know, OpenCV needs OpenCL full profile but imx6 supports only Embedded profile. Could you clarify that?

Toradex - 1 year 10 months | Reply

Indeed, to take advantage of OpenCL support provided by OpenCV, you need the Full Profile, but since OpenCV is a set of libraries, you could write your own OpenCL Embedded Profile functions and use them along with OpenCV functions. It's similar to the approach described in this NXP Application Note (http://www.nxp.com/docs/en/application-note/AN4629.pdf), the difference being that here they have used OpenGL, and not OpenCL, with OpenCV.

Leave a comment

Your email ID will be kept confidential. Required fields are marked *


请填写上面所示的字符。不区分大小写。



* Your comment will be reviewed and then added. Thank you.