This is short study conducted in SDIO WLAN driver (Linux) in August 2010. I am publishing it to gain some traction on the idea of "how marking some buffers as uncached can significantly reduces the CPU load and improve the embedded system performance."
Introduction
A short study on the benefits of using uncached buffers in
WLAN driver. I did an experiment with WLAN
driver to get rid of some cache coherency overhead on DMA buffers by making
them uncached. The results are encouraging and would like to share with the
rest of the driver developers.Give it a try in your respective embedded components on a case by
case basis.
In the below code snippets uncached buffers will be enabled
if you define the ENABLE_DMA_UNCACHED_BUFF. 
Follow the code appropriately.
Section 1: 
How to allocate Uncached DMA buffers? Note that I am
storing the base address after the allocation of 8KB and call it
physical_base_readaddress. I need to do it since for every DMA transaction I
will be using any portion of the 8 KB chunk i.e. Say if I am transmitting
something, then my data will be located any where within the 8 KB. In that case
I need to compute the actual dma_bus_address based on the dma_base_readaddress
and physical address offset of the data chunk I will be transmitting. (Refer Section 3)
/* Allocate a DMA-able buffer and provide it to the upper
layer to be used for all read and write transactions */
    if
(pDmaReadBufAddr == 0) /* allocate only once (in case this function is called
multiple times) */
    {
#ifndef ENABLE_DMA_UNCACHED_BUFF       
       
pDmaReadBufAddr = kmalloc (MAX_BUS_TXN_SIZE, GFP_ATOMIC | GFP_DMA);
#else
       
pDmaReadBufAddr = dma_alloc_writecombine(g_drv.dev, 8192, &g_drv.dma_base_readaddress,
GFP_KERNEL);
#endif 
        if
(pDmaReadBufAddr == 0) { return -1; }
    }
#ifndef ENABLE_DMA_UNCACHED_BUFF        
        *pRxDmaBufAddr
= pDmaReadBufAddr;
        *pTxDmaBufAddr
= pDmaWriteBufAddr;
#else
       
g_drv.physical_base_readaddress = *pRxDmaBufAddr = pDmaReadBufAddr;
       
g_drv.physical_base_writeaddress = *pTxDmaBufAddr = pDmaWriteBufAddr;
#endif    
Section 2:
Uncached DMA buffer clean up
if (pDmaReadBufAddr)
    {
#ifndef ENABLE_DMA_UNCACHED_BUFF 
        kfree
(pDmaReadBufAddr);
#else
       
dma_free_writecombine(g_drv.dev, 8192, pDmaReadBufAddr,
g_drv.dma_base_readaddress);
#endif
       
pDmaReadBufAddr = 0;
    }
Section 3:
 DMA Transaction (Only for Tx Buffer). Note I am avoiding
the dma_map_single () call.
#ifndef ENABLE_DMA_UNCACHED_BUFF
    {
       
dma_bus_address = dma_map_single(g_drv.dev, pData, uLen,
DMA_FROM_DEVICE);
        if
(!dma_bus_address) {
            PERR("sdioDrv_WriteAsync:
dma_map_single failed\n");
            return -1;
        }
    }
#else
    {
       
dma_bus_address = g_drv.dma_base_readaddress + (dma_addr_t)(pData -
g_drv.physical_base_readaddress);
    }
#endif
Section 4:
 DMA Callback. Note I am avoiding the dma_unmap_single()
call.
if (g_drv.dma_read_addr != 0) {
        //printk(KERN_INFO "in return
sdioDrv_ReadAsync ret\n");
#ifndef ENABLE_DMA_UNCACHED_BUFF        
       
dma_unmap_single(g_drv.dev, g_drv.dma_read_addr, g_drv.dma_read_size,
DMA_FROM_DEVICE);
#endif         
       
g_drv.dma_read_addr = 0;
        g_drv.dma_read_size
= 0;
    }
Based on the DMA direction (DMA_FROM_DEVICE / DMA_TO_DEVICE)
appropriate cache invalidate or cache flush operations are performed to
maintain the cache coherency for every dma_map_single() call and
dma_unmap_single().
Apart from the above cache operations which take CPU cycles,
there is a possibility of heavy cache misses (pollution) because of frequent cache
operations. This is of high importance for web browsing use case, where webkit
caches lot of data that makes a given web page. Cache misses are costly in web
browsing use cases. Of course this does comes with cost. In my case, I was
doing a memcpy on those DMA buffers (uncached buffer) with my internal buffer
(cached skb buffer) and thereby taking a hit. 
From the internal tests I did, following are the performance
numbers I have gathered.
| 
Http Throughput | |
| 
Uncached DMA buffer | 
Cached DMA buffer | 
| 
18.67 Mbsec | 
17.1 Mbsec | 
| 
Avg CPU
  Utilization - HTTP throughput test (8MB file download) | 
Avg CPU
  Utilization - Static CNN web page load | ||
| 
Uncached DMA buffer (%) | 
Cached DMA buffer (%) | 
Uncached DMA buffer (%) | 
Cached DMA buffer (%) | 
| 
User (50) + Sys(8) | 
User(63) + Sys (8) | 
User (25) + Sys (33) | 
User (50) + Sys (31) | 
[*Apart from the user and system cpu utilization reduction,
wlan workqueue cpu utilization dropped down (by around 10%) because of
eliminating the dma_map_single() and dma_unmap_single() calls , but at the same
time sdio workqueue increased (by around 6%) because of memcpy from uncached to
cached buffers.]
Afer bit of literature survey, I came across this interesting
research paper [http://quning.org/self/pku_cache.pdf]
worth reading. It talks about advantages of using uncached buffers in certain
use cases.
(1)  
Talks about making DMA buffers uncached within
Ethernet driver (similar to what I did in my experiment). Conclusion is it helps
significantly in increasing throughput in TCP Tx and to some extent in TCP Rx.
(2)  
Takes implementation (1) one step higher by
making the skb buffers (descriptors that holds that actual TCP packet)
uncached. I didn’t try this.
(3)  
Talks about pushing the TLB page table to
uncached memory region.
 
No comments:
Post a Comment