My World: SIMD Optimization

ARM processors have conquered the mobile and embedded world by storm. The next frontier for ARM architecture is in making strides in cloud servers. There are several ARM licensees out there who are trying to improve their microarchitecture implementations to extract the best IPC with lowest possible power drain.

This article is about a nice little coprocessor called NEON. NEON coprocessor has its own separate instruction pipeline aimed at working efficiently with vectors and matrices. If you like MATLAB coding you will love to work with NEON. NEON is a SIMD (Single Instruction Multiple Data) processor. SIMD is executing more bits per instruction and thereby theoretically improving the data throughput. A typical ARMv8 processor works on 64 bits (8 bytes) of data per instruction, while a NEON coprocessor can work utmost 256 bits (32 bytes) at a time. In theory there is a 4x improvement in data throughput by simply using NEON over ARM apart from the added benefits of efficiently using the (prefetcher backed) D-cache and I-cache in CPU using NEON loads (vld) and stores (vst).

In the initial days of my neon optimization I used to submit patches using neon assembly. But over the period I have realized that it is much easier and quicker to transform an ordinary "C/C++" looping constructs using NEON intrinsics over NEON assembly. With NEON intrinsics you as a developer need not worry about the register allocation. The ARM compilers/assemblers have evolved so much over the years that they can generate better NEON assembly instructions (efficient NEON register usage) over writing straight NEON assembly. Another advantage of using NEON intrinsics over NEON assembly is your code is future proof for any instruction set changes that has gone in NEON architecture between ARM generations like between ARMv7 and ARMv8.

This video gives a very good introduction of ARMv8 NEON architecture and potential applications.

My first line of attack on any optimization problem is to study the CPU usage using perf tool supported by Android Linux kernel. Other tools out there are of interest are top, DDMS etc. For example, say an Android application comprising 2 threads. One thread is used for drawing (painting) and other used for rendering with hardware acceleration (GPU) enabled. Based on the top functions that shows up in perf report, the next step is to start looking at the source code of the function and understanding whether the code can be translated to NEON. Sometime CPU utilization numbers in perf report doesn't reflect the actual performance bottlenecks. I have seen cases wherein a Skia [Skia is the popular 2D graphics library used for 2D drawing in Android OS+Chrome/Other WebKit based browsers] function is slowing down things by lying in the critical path of the code flow and thereby limiting the frames/sec (fps) of the rendering application. Whenever there is a Skia function in the perf report, it is always beneficial to double check the source code and understand what the code is doing.

In the below article I am sharing a neon optimization I have implemented in Skia 2D library (libskia.so) in Android for a mathematical visualization problem . Please note the optimization is for ARMv7 architecture which has 32x64bit D registers or 16x128bit Q registers. D/Q are different views of the same register space, which is again shared with floating point processor (VFP). So by writing an optimization using Q registers, you are implementing an algorithm which works on 16 bytes in 1 instruction vs. 4 bytes in 1 instruction (or) 4 instructions to accomplish the same task when written in ARM assembly. I am using simple instruction counting ignoring the pipeline details.

After understanding the C/C++ looping constructs, the next step is to draw and solve the problem on a piece of paper using vectors and matrices. Translating the code into NEON intrinsics can be accomplished using this document.

Sharing some bottleneck blitting function (function name is not explicitly shown below) in Skia before and after optimization. Note every pixel in Skia is represented as ARGB8888 (32 bits or 4 bytes data with A=Alpha, R=Red, G=Green, B=Blue) in the below code.

Before Optimization

static void SomeSkFunction
                          (void* SK_RESTRICT dst, size_t dstRB,
                          const void* SK_RESTRICT maskPtr, size_t maskRB,
                          SkColor color, int width, int height) {
    SkPMColor pmc = SkPreMultiplyColor(color);
    SkPMColor* SK_RESTRICT device = (SkPMColor*)dst;
    const uint8_t* SK_RESTRICT mask = (const uint8_t*)maskPtr;

    maskRB -= width;
    dstRB -= (width &lt;&lt; 2);
    do {
        int w = width;
        do {
            unsigned aa = *mask++;
            *device = SkAlphaMulQ(pmc, SkAlpha255To256(aa)) + SkAlphaMulQ(*device, SkAlpha255To256(255 - aa));
            device += 1;
        } while (--w != 0);
        device = (uint32_t*)((char*)device + dstRB);
        mask += maskRB;
    } while (--height != 0);
}

After Optimization Using NEON Intrinsics
In order to include NEON intrinsics in Skia the following macros and header file should be included.

#if defined(__ARM_HAVE_NEON) && defined(SK_CPU_LENDIAN)
    #define SK_USE_NEON
   #include <arm_neon.h>
#endif

The below code is broken down into 3 segments: UNROLL (to work on 8 pixels), UNROLL_BY_2 (to work on 4 pixels) and UNROLL_BY_4 (to work on 2 pixels). There is a existing residual C++ code to handle anything which is lesser than 2 pixels. To put things in perspective, imagine the benefit you are going to gain when the below function is working on a 720p (1280x720 pixels) screen or 1080p (1920x1080 pixels) screen.

#ifdef SK_USE_NEON
#define UNROLL 8
#define UNROLL_BY_2 4
#define UNROLL_BY_4 2
    do {
        int w = width;
        int count = 0;
        //cache line is 64 bytes
        __builtin_prefetch(mask);
        __builtin_prefetch(device);

        if( w>= UNROLL){
            /*mask or alpha*/
            uint16x8_t alpha_full;
            uint32x4_t alpha_lo, alpha_hi;
            uint32x4_t alpha_slo, alpha_shi; /*Saturated and scaled */
            uint32x4_t rb,ag;    /*For device premultiply */
            uint32x4_t rbp,agp;  /*For pmc premultiply */
            uint32x4_t dev_lo, dev_hi;
            uint32x4_t pmc_dup = vdupq_n_u32(pmc);
            uint32x4_t pmc_lo, pmc_hi;

            do{
                if( (int) count % 64 == 0){
                   __builtin_prefetch(mask);
                }
                if( (int) count % 16 == 0){
                   __builtin_prefetch(device);
                }

                alpha_full = vmovl_u8(vld1_u8(mask));
                alpha_lo = vmovl_u16(vget_low_u16(alpha_full));
                alpha_hi = vmovl_u16(vget_high_u16(alpha_full));

                dev_lo = vld1q_u32(device);
                dev_hi = vld1q_u32(device + 4);

                /*SkAlpha255To256(255-aa)*/
                //alpha_slo = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_lo), vdupq_n_u32(0x00000001));
                alpha_slo = vsubq_u32( vdupq_n_u32(0x00000100), alpha_lo);
                //alpha_shi = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_hi), vdupq_n_u32(0x00000001));
                alpha_shi = vsubq_u32( vdupq_n_u32(0x00000100), alpha_hi);
                /*SkAlpha255To256(aa)*/
                alpha_lo = vaddq_u32(alpha_lo, vdupq_n_u32(0x00000001));
                alpha_hi = vaddq_u32(alpha_hi, vdupq_n_u32(0x00000001));

                rb = vshrq_n_u32( vmulq_u32( vandq_u32( dev_lo, vdupq_n_u32(0x00FF00FF)), alpha_slo), 8);
                ag = vmulq_u32( vandq_u32( vshrq_n_u32(dev_lo, 8), vdupq_n_u32(0x00FF00FF)), alpha_slo);
                dev_lo = vorrq_u32( vandq_u32(rb,  vdupq_n_u32(0x00FF00FF)), vandq_u32(ag, vdupq_n_u32(0xFF00FF00)));
                rbp = vshrq_n_u32( vmulq_u32( vandq_u32( pmc_dup, vdupq_n_u32(0x00FF00FF)), alpha_lo), 8);
                agp = vmulq_u32( vandq_u32( vshrq_n_u32(pmc_dup, 8), vdupq_n_u32(0x00FF00FF)), alpha_lo);
                pmc_lo = vorrq_u32( vandq_u32(rbp,  vdupq_n_u32(0x00FF00FF)), vandq_u32(agp, vdupq_n_u32(0xFF00FF00)));

                dev_lo = vaddq_u32 ( pmc_lo, dev_lo);

                rb = vshrq_n_u32( vmulq_u32( vandq_u32( dev_hi, vdupq_n_u32(0x00FF00FF)), alpha_shi), 8);
                ag = vmulq_u32( vandq_u32( vshrq_n_u32(dev_hi, 8), vdupq_n_u32(0x00FF00FF)), alpha_shi);
                dev_hi = vorrq_u32( vandq_u32(rb, vdupq_n_u32(0x00FF00FF)), vandq_u32(ag, vdupq_n_u32(0xFF00FF00)));
                rbp = vshrq_n_u32( vmulq_u32( vandq_u32( pmc_dup, vdupq_n_u32(0x00FF00FF)), alpha_hi), 8);
                agp = vmulq_u32( vandq_u32( vshrq_n_u32(pmc_dup, 8), vdupq_n_u32(0x00FF00FF)), alpha_hi);
                pmc_hi = vorrq_u32( vandq_u32(rbp,  vdupq_n_u32(0x00FF00FF)), vandq_u32(agp, vdupq_n_u32(0xFF00FF00)));

                dev_hi = vaddq_u32 ( pmc_hi, dev_hi);

                vst1q_u32(device, dev_lo);
                vst1q_u32(device + 4, dev_hi);

                device += UNROLL;
                mask += UNROLL;
                count += UNROLL;
                w -= UNROLL;
               }while (w >= UNROLL);
            }            else if(w >= UNROLL_BY_4){
               /*mask or alpha*/
               uint16x8_t alpha_full = vmovl_u8(vld1_u8(mask));

               if(w >= UNROLL_BY_2){
                   uint32x4_t alpha_lo;
                   uint32x4_t alpha_slo; /*Saturated and scaled */
                   uint32x4_t rb,ag;     /*For device premultiply */
                   uint32x4_t rbp,agp;   /*For pmc premultiply */
                   uint32x4_t dev_lo;
                   uint32x4_t pmc_lo;
                   uint32x4_t pmc_dup = vdupq_n_u32(pmc);

                   dev_lo = vld1q_u32(device);
                   alpha_lo = vmovl_u16(vget_low_u16(alpha_full));

                   /*SkAlpha255To256(255-aa)*/
                   //alpha_slo = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_lo), vdupq_n_u32(0x00000001));
                   alpha_slo = vsubq_u32( vdupq_n_u32(0x00000100), alpha_lo);
                   /*SkAlpha255To256(aa)*/
                   alpha_lo = vaddq_u32(alpha_lo, vdupq_n_u32(0x00000001));

                   rb = vshrq_n_u32( vmulq_u32( vandq_u32( dev_lo, vdupq_n_u32(0x00FF00FF)), alpha_slo), 8);
                   ag = vmulq_u32( vandq_u32( vshrq_n_u32(dev_lo, 8), vdupq_n_u32(0x00FF00FF)), alpha_slo);
                   dev_lo = vorrq_u32( vandq_u32(rb,  vdupq_n_u32(0x00FF00FF)), vandq_u32(ag, vdupq_n_u32(0xFF00FF00)));
                   rbp = vshrq_n_u32( vmulq_u32( vandq_u32( pmc_dup, vdupq_n_u32(0x00FF00FF)), alpha_lo), 8);
                   agp = vmulq_u32( vandq_u32( vshrq_n_u32(pmc_dup, 8), vdupq_n_u32(0x00FF00FF)), alpha_lo);
                   pmc_lo = vorrq_u32( vandq_u32(rbp,  vdupq_n_u32(0x00FF00FF)), vandq_u32(agp, vdupq_n_u32(0xFF00FF00)));

                   dev_lo = vaddq_u32 ( pmc_lo, dev_lo);

                   vst1q_u32(device, dev_lo);

                   device += UNROLL_BY_2;
                   mask += UNROLL_BY_2;
                   w -= UNROLL_BY_2;
               }

               if (w >= UNROLL_BY_4){
                   /*mask or alpha*/
                   uint32x2_t alpha_lo;
                   uint32x2_t alpha_slo; /*Saturated and scaled */
                   uint32x2_t rb,ag;
                   uint32x2_t rbp,agp;   /*For pmc premultiply */
                   uint32x2_t dev_lo;
                   uint32x2_t pmc_lo;
                   uint32x2_t pmc_dup = vdup_n_u32(pmc);

                   dev_lo = vld1_u32(device);
                   alpha_lo = vget_low_u32(vmovl_u16(vget_high_u16(alpha_full)));

                   /*SkAlpha255To256(255-aa)*/
                   //alpha_slo = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_lo), vdupq_n_u32(0x00000001));
                   alpha_slo = vsub_u32( vdup_n_u32(0x00000100), alpha_lo);
                   /*SkAlpha255To256(aa)*/
                   alpha_lo = vadd_u32(alpha_lo, vdup_n_u32(0x00000001));

                   rb = vshr_n_u32( vmul_u32( vand_u32( dev_lo, vdup_n_u32(0x00FF00FF)), alpha_slo), 8);
                   ag = vmul_u32( vand_u32( vshr_n_u32(dev_lo, 8), vdup_n_u32(0x00FF00FF)), alpha_slo);
                   dev_lo = vorr_u32( vand_u32(rb,  vdup_n_u32(0x00FF00FF)), vand_u32(ag, vdup_n_u32(0xFF00FF00)));
                   rbp = vshr_n_u32( vmul_u32( vand_u32( pmc_dup, vdup_n_u32(0x00FF00FF)), alpha_lo), 8);
                   agp = vmul_u32( vand_u32( vshr_n_u32(pmc_dup, 8), vdup_n_u32(0x00FF00FF)), alpha_lo);
                   pmc_lo = vorr_u32( vand_u32(rbp,  vdup_n_u32(0x00FF00FF)), vand_u32(agp, vdup_n_u32(0xFF00FF00)));

                   dev_lo = vadd_u32 ( pmc_lo, dev_lo);

                   vst1_u32(device, dev_lo);

                   device += UNROLL_BY_4;
                   mask += UNROLL_BY_4;
                   w -= UNROLL_BY_4;
               }
        }

        /*residuals (which is everything that cannot be handled by neon) */
        while( w > 0){
            unsigned aa = *mask++;
            if( (aa != 0) || (aa != 255)){
                *device = SkAlphaMulQ(pmc, SkAlpha255To256(aa)) + SkAlphaMulQ(*device, SkAlpha255To256(255 - aa));
            }
            device += 1;
            --w;

        }
        device = (uint32_t*)((char*)device + dstRB);
        mask += maskRB;
    } while (--height != 0);
#endif

My World

Monday, August 15, 2016

SIMD Optimization

No comments: