Friday, April 19, 2013

A Performace Study On Using Uncached Buffers In WLAN Drivers



This is short study conducted in SDIO WLAN driver (Linux) in August 2010. I am publishing it to gain some traction on the idea of "how marking some buffers as uncached can significantly reduces the CPU load and improve the embedded system performance."

Introduction
A short study on the benefits of using uncached buffers in WLAN driver. I did an experiment with WLAN driver to get rid of some cache coherency overhead on DMA buffers by making them uncached. The results are encouraging and would like to share with the rest of the driver developers.Give it a try in your respective embedded components on a case by case basis.

In the below code snippets uncached buffers will be enabled if you define the ENABLE_DMA_UNCACHED_BUFF. 
Follow the code appropriately.

Section 1: 
How to allocate Uncached DMA buffers? Note that I am storing the base address after the allocation of 8KB and call it physical_base_readaddress. I need to do it since for every DMA transaction I will be using any portion of the 8 KB chunk i.e. Say if I am transmitting something, then my data will be located any where within the 8 KB. In that case I need to compute the actual dma_bus_address based on the dma_base_readaddress and physical address offset of the data chunk I will be transmitting. (Refer Section 3)

/* Allocate a DMA-able buffer and provide it to the upper layer to be used for all read and write transactions */
    if (pDmaReadBufAddr == 0) /* allocate only once (in case this function is called multiple times) */
    {
#ifndef ENABLE_DMA_UNCACHED_BUFF      
        pDmaReadBufAddr = kmalloc (MAX_BUS_TXN_SIZE, GFP_ATOMIC | GFP_DMA);
#else
        pDmaReadBufAddr = dma_alloc_writecombine(g_drv.dev, 8192, &g_drv.dma_base_readaddress, GFP_KERNEL);
#endif
        if (pDmaReadBufAddr == 0) { return -1; }
    }

#ifndef ENABLE_DMA_UNCACHED_BUFF       
        *pRxDmaBufAddr = pDmaReadBufAddr;
        *pTxDmaBufAddr = pDmaWriteBufAddr;
#else
        g_drv.physical_base_readaddress = *pRxDmaBufAddr = pDmaReadBufAddr;
        g_drv.physical_base_writeaddress = *pTxDmaBufAddr = pDmaWriteBufAddr;
#endif   

Section 2:
Uncached DMA buffer clean up

if (pDmaReadBufAddr)
    {
#ifndef ENABLE_DMA_UNCACHED_BUFF
        kfree (pDmaReadBufAddr);
#else
        dma_free_writecombine(g_drv.dev, 8192, pDmaReadBufAddr, g_drv.dma_base_readaddress);
#endif
        pDmaReadBufAddr = 0;
    }
   
Section 3:
 DMA Transaction (Only for Tx Buffer). Note I am avoiding the dma_map_single () call.

#ifndef ENABLE_DMA_UNCACHED_BUFF
    {
        dma_bus_address = dma_map_single(g_drv.dev, pData, uLen, DMA_FROM_DEVICE);
        if (!dma_bus_address) {
            PERR("sdioDrv_WriteAsync: dma_map_single failed\n");
            return -1;
        }
    }
#else
    {
        dma_bus_address = g_drv.dma_base_readaddress + (dma_addr_t)(pData - g_drv.physical_base_readaddress);
    }
#endif

Section 4:
 DMA Callback. Note I am avoiding the dma_unmap_single() call.

if (g_drv.dma_read_addr != 0) {
        //printk(KERN_INFO "in return sdioDrv_ReadAsync ret\n");
#ifndef ENABLE_DMA_UNCACHED_BUFF       
        dma_unmap_single(g_drv.dev, g_drv.dma_read_addr, g_drv.dma_read_size, DMA_FROM_DEVICE);
#endif        
        g_drv.dma_read_addr = 0;
        g_drv.dma_read_size = 0;
    }

Based on the DMA direction (DMA_FROM_DEVICE / DMA_TO_DEVICE) appropriate cache invalidate or cache flush operations are performed to maintain the cache coherency for every dma_map_single() call and dma_unmap_single().
Apart from the above cache operations which take CPU cycles, there is a possibility of heavy cache misses (pollution) because of frequent cache operations. This is of high importance for web browsing use case, where webkit caches lot of data that makes a given web page. Cache misses are costly in web browsing use cases. Of course this does comes with cost. In my case, I was doing a memcpy on those DMA buffers (uncached buffer) with my internal buffer (cached skb buffer) and thereby taking a hit.

From the internal tests I did, following are the performance numbers I have gathered.
Http Throughput
Uncached DMA buffer
Cached DMA buffer
18.67 Mbsec
17.1 Mbsec

Avg CPU Utilization - HTTP throughput test (8MB file download)
Avg CPU Utilization - Static CNN web page load
Uncached DMA buffer (%)
Cached DMA buffer (%)
Uncached DMA buffer (%)
Cached DMA buffer (%)
User (50) + Sys(8)
User(63) + Sys (8)
User (25) + Sys (33)
User (50) + Sys (31)


[*Apart from the user and system cpu utilization reduction, wlan workqueue cpu utilization dropped down (by around 10%) because of eliminating the dma_map_single() and dma_unmap_single() calls , but at the same time sdio workqueue increased (by around 6%) because of memcpy from uncached to cached buffers.]

Afer bit of literature survey, I came across this interesting research paper [http://quning.org/self/pku_cache.pdf] worth reading. It talks about advantages of using uncached buffers in certain use cases.
(1)   Talks about making DMA buffers uncached within Ethernet driver (similar to what I did in my experiment). Conclusion is it helps significantly in increasing throughput in TCP Tx and to some extent in TCP Rx.
(2)   Takes implementation (1) one step higher by making the skb buffers (descriptors that holds that actual TCP packet) uncached. I didn’t try this.
(3)   Talks about pushing the TLB page table to uncached memory region.

No comments: