My World

Monday, August 15, 2016

SIMD Optimization

ARM processors have conquered the mobile and embedded world by storm. The next frontier for ARM architecture is in making strides in cloud servers. There are several ARM licensees out there who are trying to improve their microarchitecture implementations to extract the best IPC with lowest possible power drain.

This article is about a nice little coprocessor called NEON. NEON coprocessor has its own separate instruction pipeline aimed at working efficiently with vectors and matrices. If you like MATLAB coding you will love to work with NEON. NEON is a SIMD (Single Instruction Multiple Data) processor. SIMD is executing more bits per instruction and thereby theoretically improving the data throughput. A typical ARMv8 processor works on 64 bits (8 bytes) of data per instruction, while a NEON coprocessor can work utmost 256 bits (32 bytes) at a time. In theory there is a 4x improvement in data throughput by simply using NEON over ARM apart from the added benefits of efficiently using the (prefetcher backed) D-cache and I-cache in CPU using NEON loads (vld) and stores (vst).

In the initial days of my neon optimization I used to submit patches using neon assembly. But over the period I have realized that it is much easier and quicker to transform an ordinary "C/C++" looping constructs using NEON intrinsics over NEON assembly. With NEON intrinsics you as a developer need not worry about the register allocation. The ARM compilers/assemblers have evolved so much over the years that they can generate better NEON assembly instructions (efficient NEON register usage) over writing straight NEON assembly. Another advantage of using NEON intrinsics over NEON assembly is your code is future proof for any instruction set changes that has gone in NEON architecture between ARM generations like between ARMv7 and ARMv8.

This video gives a very good introduction of ARMv8 NEON architecture and potential applications.

My first line of attack on any optimization problem is to study the CPU usage using perf tool supported by Android Linux kernel. Other tools out there are of interest are top, DDMS etc. For example, say an Android application comprising 2 threads. One thread is used for drawing (painting) and other used for rendering with hardware acceleration (GPU) enabled. Based on the top functions that shows up in perf report, the next step is to start looking at the source code of the function and understanding whether the code can be translated to NEON. Sometime CPU utilization numbers in perf report doesn't reflect the actual performance bottlenecks. I have seen cases wherein a Skia [Skia is the popular 2D graphics library used for 2D drawing in Android OS+Chrome/Other WebKit based browsers] function is slowing down things by lying in the critical path of the code flow and thereby limiting the frames/sec (fps) of the rendering application. Whenever there is a Skia function in the perf report, it is always beneficial to double check the source code and understand what the code is doing.

In the below article I am sharing a neon optimization I have implemented in Skia 2D library (libskia.so) in Android for a mathematical visualization problem . Please note the optimization is for ARMv7 architecture which has 32x64bit D registers or 16x128bit Q registers. D/Q are different views of the same register space, which is again shared with floating point processor (VFP). So by writing an optimization using Q registers, you are implementing an algorithm which works on 16 bytes in 1 instruction vs. 4 bytes in 1 instruction (or) 4 instructions to accomplish the same task when written in ARM assembly. I am using simple instruction counting ignoring the pipeline details.

After understanding the C/C++ looping constructs, the next step is to draw and solve the problem on a piece of paper using vectors and matrices. Translating the code into NEON intrinsics can be accomplished using this document.

Sharing some bottleneck blitting function (function name is not explicitly shown below) in Skia before and after optimization. Note every pixel in Skia is represented as ARGB8888 (32 bits or 4 bytes data with A=Alpha, R=Red, G=Green, B=Blue) in the below code.

Before Optimization

static void SomeSkFunction
                          (void* SK_RESTRICT dst, size_t dstRB,
                          const void* SK_RESTRICT maskPtr, size_t maskRB,
                          SkColor color, int width, int height) {
    SkPMColor pmc = SkPreMultiplyColor(color);
    SkPMColor* SK_RESTRICT device = (SkPMColor*)dst;
    const uint8_t* SK_RESTRICT mask = (const uint8_t*)maskPtr;

    maskRB -= width;
    dstRB -= (width &lt;&lt; 2);
    do {
        int w = width;
        do {
            unsigned aa = *mask++;
            *device = SkAlphaMulQ(pmc, SkAlpha255To256(aa)) + SkAlphaMulQ(*device, SkAlpha255To256(255 - aa));
            device += 1;
        } while (--w != 0);
        device = (uint32_t*)((char*)device + dstRB);
        mask += maskRB;
    } while (--height != 0);
}

After Optimization Using NEON Intrinsics
In order to include NEON intrinsics in Skia the following macros and header file should be included.

#if defined(__ARM_HAVE_NEON) && defined(SK_CPU_LENDIAN)
    #define SK_USE_NEON
   #include <arm_neon.h>
#endif

The below code is broken down into 3 segments: UNROLL (to work on 8 pixels), UNROLL_BY_2 (to work on 4 pixels) and UNROLL_BY_4 (to work on 2 pixels). There is a existing residual C++ code to handle anything which is lesser than 2 pixels. To put things in perspective, imagine the benefit you are going to gain when the below function is working on a 720p (1280x720 pixels) screen or 1080p (1920x1080 pixels) screen.

#ifdef SK_USE_NEON
#define UNROLL 8
#define UNROLL_BY_2 4
#define UNROLL_BY_4 2
    do {
        int w = width;
        int count = 0;
        //cache line is 64 bytes
        __builtin_prefetch(mask);
        __builtin_prefetch(device);

        if( w>= UNROLL){
            /*mask or alpha*/
            uint16x8_t alpha_full;
            uint32x4_t alpha_lo, alpha_hi;
            uint32x4_t alpha_slo, alpha_shi; /*Saturated and scaled */
            uint32x4_t rb,ag;    /*For device premultiply */
            uint32x4_t rbp,agp;  /*For pmc premultiply */
            uint32x4_t dev_lo, dev_hi;
            uint32x4_t pmc_dup = vdupq_n_u32(pmc);
            uint32x4_t pmc_lo, pmc_hi;

            do{
                if( (int) count % 64 == 0){
                   __builtin_prefetch(mask);
                }
                if( (int) count % 16 == 0){
                   __builtin_prefetch(device);
                }

                alpha_full = vmovl_u8(vld1_u8(mask));
                alpha_lo = vmovl_u16(vget_low_u16(alpha_full));
                alpha_hi = vmovl_u16(vget_high_u16(alpha_full));

                dev_lo = vld1q_u32(device);
                dev_hi = vld1q_u32(device + 4);

                /*SkAlpha255To256(255-aa)*/
                //alpha_slo = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_lo), vdupq_n_u32(0x00000001));
                alpha_slo = vsubq_u32( vdupq_n_u32(0x00000100), alpha_lo);
                //alpha_shi = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_hi), vdupq_n_u32(0x00000001));
                alpha_shi = vsubq_u32( vdupq_n_u32(0x00000100), alpha_hi);
                /*SkAlpha255To256(aa)*/
                alpha_lo = vaddq_u32(alpha_lo, vdupq_n_u32(0x00000001));
                alpha_hi = vaddq_u32(alpha_hi, vdupq_n_u32(0x00000001));

                rb = vshrq_n_u32( vmulq_u32( vandq_u32( dev_lo, vdupq_n_u32(0x00FF00FF)), alpha_slo), 8);
                ag = vmulq_u32( vandq_u32( vshrq_n_u32(dev_lo, 8), vdupq_n_u32(0x00FF00FF)), alpha_slo);
                dev_lo = vorrq_u32( vandq_u32(rb,  vdupq_n_u32(0x00FF00FF)), vandq_u32(ag, vdupq_n_u32(0xFF00FF00)));
                rbp = vshrq_n_u32( vmulq_u32( vandq_u32( pmc_dup, vdupq_n_u32(0x00FF00FF)), alpha_lo), 8);
                agp = vmulq_u32( vandq_u32( vshrq_n_u32(pmc_dup, 8), vdupq_n_u32(0x00FF00FF)), alpha_lo);
                pmc_lo = vorrq_u32( vandq_u32(rbp,  vdupq_n_u32(0x00FF00FF)), vandq_u32(agp, vdupq_n_u32(0xFF00FF00)));

                dev_lo = vaddq_u32 ( pmc_lo, dev_lo);

                rb = vshrq_n_u32( vmulq_u32( vandq_u32( dev_hi, vdupq_n_u32(0x00FF00FF)), alpha_shi), 8);
                ag = vmulq_u32( vandq_u32( vshrq_n_u32(dev_hi, 8), vdupq_n_u32(0x00FF00FF)), alpha_shi);
                dev_hi = vorrq_u32( vandq_u32(rb, vdupq_n_u32(0x00FF00FF)), vandq_u32(ag, vdupq_n_u32(0xFF00FF00)));
                rbp = vshrq_n_u32( vmulq_u32( vandq_u32( pmc_dup, vdupq_n_u32(0x00FF00FF)), alpha_hi), 8);
                agp = vmulq_u32( vandq_u32( vshrq_n_u32(pmc_dup, 8), vdupq_n_u32(0x00FF00FF)), alpha_hi);
                pmc_hi = vorrq_u32( vandq_u32(rbp,  vdupq_n_u32(0x00FF00FF)), vandq_u32(agp, vdupq_n_u32(0xFF00FF00)));

                dev_hi = vaddq_u32 ( pmc_hi, dev_hi);

                vst1q_u32(device, dev_lo);
                vst1q_u32(device + 4, dev_hi);

                device += UNROLL;
                mask += UNROLL;
                count += UNROLL;
                w -= UNROLL;
               }while (w >= UNROLL);
            }            else if(w >= UNROLL_BY_4){
               /*mask or alpha*/
               uint16x8_t alpha_full = vmovl_u8(vld1_u8(mask));

               if(w >= UNROLL_BY_2){
                   uint32x4_t alpha_lo;
                   uint32x4_t alpha_slo; /*Saturated and scaled */
                   uint32x4_t rb,ag;     /*For device premultiply */
                   uint32x4_t rbp,agp;   /*For pmc premultiply */
                   uint32x4_t dev_lo;
                   uint32x4_t pmc_lo;
                   uint32x4_t pmc_dup = vdupq_n_u32(pmc);

                   dev_lo = vld1q_u32(device);
                   alpha_lo = vmovl_u16(vget_low_u16(alpha_full));

                   /*SkAlpha255To256(255-aa)*/
                   //alpha_slo = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_lo), vdupq_n_u32(0x00000001));
                   alpha_slo = vsubq_u32( vdupq_n_u32(0x00000100), alpha_lo);
                   /*SkAlpha255To256(aa)*/
                   alpha_lo = vaddq_u32(alpha_lo, vdupq_n_u32(0x00000001));

                   rb = vshrq_n_u32( vmulq_u32( vandq_u32( dev_lo, vdupq_n_u32(0x00FF00FF)), alpha_slo), 8);
                   ag = vmulq_u32( vandq_u32( vshrq_n_u32(dev_lo, 8), vdupq_n_u32(0x00FF00FF)), alpha_slo);
                   dev_lo = vorrq_u32( vandq_u32(rb,  vdupq_n_u32(0x00FF00FF)), vandq_u32(ag, vdupq_n_u32(0xFF00FF00)));
                   rbp = vshrq_n_u32( vmulq_u32( vandq_u32( pmc_dup, vdupq_n_u32(0x00FF00FF)), alpha_lo), 8);
                   agp = vmulq_u32( vandq_u32( vshrq_n_u32(pmc_dup, 8), vdupq_n_u32(0x00FF00FF)), alpha_lo);
                   pmc_lo = vorrq_u32( vandq_u32(rbp,  vdupq_n_u32(0x00FF00FF)), vandq_u32(agp, vdupq_n_u32(0xFF00FF00)));

                   dev_lo = vaddq_u32 ( pmc_lo, dev_lo);

                   vst1q_u32(device, dev_lo);

                   device += UNROLL_BY_2;
                   mask += UNROLL_BY_2;
                   w -= UNROLL_BY_2;
               }

               if (w >= UNROLL_BY_4){
                   /*mask or alpha*/
                   uint32x2_t alpha_lo;
                   uint32x2_t alpha_slo; /*Saturated and scaled */
                   uint32x2_t rb,ag;
                   uint32x2_t rbp,agp;   /*For pmc premultiply */
                   uint32x2_t dev_lo;
                   uint32x2_t pmc_lo;
                   uint32x2_t pmc_dup = vdup_n_u32(pmc);

                   dev_lo = vld1_u32(device);
                   alpha_lo = vget_low_u32(vmovl_u16(vget_high_u16(alpha_full)));

                   /*SkAlpha255To256(255-aa)*/
                   //alpha_slo = vaddq_u32(vsubq_u32( vdupq_n_u32(0x000000FF), alpha_lo), vdupq_n_u32(0x00000001));
                   alpha_slo = vsub_u32( vdup_n_u32(0x00000100), alpha_lo);
                   /*SkAlpha255To256(aa)*/
                   alpha_lo = vadd_u32(alpha_lo, vdup_n_u32(0x00000001));

                   rb = vshr_n_u32( vmul_u32( vand_u32( dev_lo, vdup_n_u32(0x00FF00FF)), alpha_slo), 8);
                   ag = vmul_u32( vand_u32( vshr_n_u32(dev_lo, 8), vdup_n_u32(0x00FF00FF)), alpha_slo);
                   dev_lo = vorr_u32( vand_u32(rb,  vdup_n_u32(0x00FF00FF)), vand_u32(ag, vdup_n_u32(0xFF00FF00)));
                   rbp = vshr_n_u32( vmul_u32( vand_u32( pmc_dup, vdup_n_u32(0x00FF00FF)), alpha_lo), 8);
                   agp = vmul_u32( vand_u32( vshr_n_u32(pmc_dup, 8), vdup_n_u32(0x00FF00FF)), alpha_lo);
                   pmc_lo = vorr_u32( vand_u32(rbp,  vdup_n_u32(0x00FF00FF)), vand_u32(agp, vdup_n_u32(0xFF00FF00)));

                   dev_lo = vadd_u32 ( pmc_lo, dev_lo);

                   vst1_u32(device, dev_lo);

                   device += UNROLL_BY_4;
                   mask += UNROLL_BY_4;
                   w -= UNROLL_BY_4;
               }
        }

        /*residuals (which is everything that cannot be handled by neon) */
        while( w > 0){
            unsigned aa = *mask++;
            if( (aa != 0) || (aa != 255)){
                *device = SkAlphaMulQ(pmc, SkAlpha255To256(aa)) + SkAlphaMulQ(*device, SkAlpha255To256(255 - aa));
            }
            device += 1;
            --w;

        }
        device = (uint32_t*)((char*)device + dstRB);
        mask += maskRB;
    } while (--height != 0);
#endif

Wednesday, November 20, 2013

Taste of starting your own business

Recently I had the pleasure of being part of "Startup Weekend Event". Our team "LetsChipIn" won the first place.

More details can be read from the below sites.

I wrote a blog highlighting our team composition and a brief description of our business model in the TSW blog. Check this link http://triangle.startupweekend.org/uncategorized/lets-chip-in-crowdsourced-community-gift-giving/.

In less than 54 hours each team need to pitch a business model, form a cohesive unit of business, design and development engineers, and come up with a minimum viable product (MVP). The winning team's final product is a web application. See this site [http://letschipin.co] for more information about the business. There is also a Facebook page [https://www.facebook.com/letschipin]/Twitter handle/Instagram etc. created to bring the social context.

Lessons learnt

Web application development skills are paramount for any business to thrive these days. The web application requires the front end design (using HTML/JS/CSS) + back end database management/service framework to go hand in hand. The back end can be implemented by picking the right “Web Application Framework [http://en.wikipedia.org/wiki/Web_application_framework]”/”Content Management System [http://en.wikipedia.org/wiki/Content_management_system]-Content Management Framework [http://en.wikipedia.org/wiki/Content_management_framework]”.

There is quite a good set of free developer tools available online that you could utilize.

Wireframe to model the website layout. This can either be drawn on paper or using the tools like http://balsamiq.com/ and http://moqups.com.

The website was designed using http://getbootstrap.com/2.3.2/

Website can be hosted in servers for a nominal fee. Check https://www.digitalocean.com/.

The payment system was implemented using https://www.braintreepayments.com/.

This link http://startupweekend.org/resources/ lists all the necessary tools to have your high tech business model up and running.

A good web site requires good front end developers ably supported by creative web designers. These are folks who are well versed with Adobe creative suite to design cool things.

On a closing note, Entrepreneurship is all about being optimist and ready to take risks.

Thursday, April 25, 2013

Debugging Remote Applications With ECLIPSE

DEBUGGING REMOTE APPS WITH ECLIPSE

1. Must have the Android SDK plugin for Eclipse. Instructions to install it here:

http://developer.android.com/sdk/installing/installing-adt.html

2. From the DDMS perspective or adb shell, find the port of the process that you want to debug.

3. From the Debug perspective, go to Run, then Debug Configurations… Select Remote Java Application and click the New Launch Configuration button on the top left corner of the screen.

4. Create a new configuration. Enter a name for it and change the Port to the corresponding port of the process that will be debugged.

5. Click on the source tab, then click Add…

Select File System Directory and click OK.

Browse to the location of android’s framework and make sure that the Search subfolders box is checked. Then click OK. Repeat the process for any other paths that contain source files required for your application.

6. Click Apply and then click Debug to start debugging the process previously selected.

7. When the debugging starts, the Debug perspective should look similar to this:

After hitting a breakpoint, the debugger will show the file and line where the breakpoint is and the stack trace:

For subsequent debugging sessions of other processes, just go to STEP 3, change the port number to the port number of the new process that you want to debug, click Apply and then Debug, nothing else has to be reconfigured.

Wednesday, April 24, 2013

Embedded panels and its impact on application frame rate and UI fluidity

DISPLAY BASICS [1]

In general, the display subsystem of an embedded system is designed to transfer the data from a block of internal processor memory called a Frame Buffer (FB), which software running on the main Processor updates to change the image being displayed. The display data consists of some number of bits for each pixel of the displayed area. In most cases there are specific bits which describe the intensity of the red, green and blue (RGB) components of each pixel, although other formats such as YUV/YCrCb (luminance, red chrominance, blue chrominance) are occasionally used.

A logic block generally referred to as a Display Controller fetches data from the Frame Buffer, formats it according to the desired display interface, and transmits it to the Display. Figure 1 shows a basic system, where the Display Controller is internal to the Embedded Processor and communicates directly to the Display

In many systems the Embedded Processor includes a Display Interface Controller, which creates an intermediate communication structure which is separate from the Display Interface. In this case the Display Controller is external to the Embedded processor, and is often included within the Display itself. This external Display Controller often includes its own Frame Buffer. Systems of this type are shown in Figure 2. Note that it is also possible to connect to an external Display Controller via a standard bus such as PCI.

The intensity of each pixel on a display, whether it is a CRT, LCD or other type, generally must be periodically refreshed. The architecture of this function was developed in the days of CRTs, but has remained quite consistent as shown in Figure 3. The refresh is usually done by "raster scanning", which starts at the first pixel (typically the upper left hand corner), generates an intensity for that pixel, and then moves horizontally through all pixels of the first scan line. At that point a "horizontal synchronization" or HSYNC signal occurs, which causes the refresh to move to the beginning of the second line, and so on. This process continues until all lines have been refreshed, at which point a "vertical synchronization" or VSYNC signal occurs. This causes the refresh to return to the first pixel, and the process repeats.

The physical nature of the display generally dictates that the response to HSYNC or VSYNC is not instantaneous, and thus nothing is refreshed for a time after each signal is received. These times are referred to as "blanking periods". They are shown as the dashed lines In Figure 3.

Command Mode Panels (Generally used in high end smartphones)

• DSI (Display Serial Interface) in command mode expects the panel to have an internal frame buffer. The job of DSS (Display Sub System) is just to push out a frame to the panel. The panel has a controller inside it which refreshes the screen using the content of the internal buffer; the panel's controller refreshes the screen using its own timings.

• In DSI command mode, you control the rate at which data is pushed out to the panel. So if application generates 50 fps of data. It pushes out data at that rate. However, since the panel's controller is refreshing at 60 Hz by itself, you will see tearing since the internal buffer is filled slower than 60 Hz.

Video Mode Panels (Generally used in Tablets)

• DSI in video mode requires us to push out data continuously to the panel. The DSS completely controls the panels timings and hence the rate of refresh of the panel. This is needed as the panel has no internal buffer and needs the host to pump out data to it continuously.

• In the video mode, the DSS driver provides some API which tells an upper layer the right time to present a new frame to DSS, this is the time when a vsync occurs. If you use this API to wait for a vsync, because of its slow rate of pushing out data, the application is never ready with a frame for every alternate vsync from the panel, hence leading to half the frame rate.

Conclusion

In general command mode panels have better frame rates than video mode panels for browser use cases because of additional internal frame buffer.

References

(1) http://www.kozio.com/view_files/Embedded_Display_Interfaces_WP_final.pdf