Friday, April 19, 2013

What's slowing down Mobile Facebook?

Reasons cited by Facebook for switching from WebView to native app in iOS platform. 
In short the issues to some extent can be resolved by exposing new JS APIs:


  • API to enable JS to query/manage available physical/image memory, total texture memory, # of tiles, etc.
  • API to query HW FPS (PVRTune like)
  •  API to query JS heap size, object count
  • API to invoke JS GC manually
  • API to query display refresh rate and requestAnimationFrame trigger frequency
  • API to query browser composition info (# layers, # SW composited, # HW composited, etc.)
Excerpts from http://lists.w3.org/Archives/Public/public-coremob/2012Sep/0021.html.

Following the recent announcements[1] we (Facebook) made about rebuilding
our iOS app using more native technology, we have had a lot of requests to
provide detailed feedback on the performance issues we encountered while
building for the mobile Web. Here it is. Comments welcomed.

1. Tooling / Developer APIs
---------------------------

The lack of tooling in mobile browsers makes it very difficult to dig down
and find out what the real issues are. Hence tooling, or rather,
lack-thereof is a key issue.

The biggest issues we've been facing here are memory related. Given the
size of our content, it's not uncommon for our application to exhaust the
hardware capabilities of the device, causing crashes. Unfortunately, it's
difficult for us to understand exactly what's causing these issues. GPU
buffer exhaustion? Reaching resource limits? Something else? hard to say.

### What's missing? ###

Mainly, dev tools on the device and/or easily accessible remotely.

Things we'd want to know more about as we develop:

#### Down memory lane ####

- Heap size,
- Object count,
- GC cycles,
- GPU buffer size,
- resource limits.

Some of those are very much useful outside of the development phase,
however. E.g.: Linkedin uses UA string sniffing[2] to determine how many
pictures can be kept in memory before hitting the device's limit, an API
to the device's memory resource would be a much more appropriate (albeit
finger-printable) way of doing that.

#### FPS ####

- The ability to measure fps at the hardware level. This is essential for
testing[3], but would also be very useful to determine whether to use
infinite scrolling or pagination, for example. Same fingerprinting caveat
as above.


2. Scrolling performance
------------------------
I've already started sharing some of it with the W3C WebPerf WG[4]. Will
continue bringing it to other relevant WG in the upcoming weeks.

This is one of our most important issues. It's typically a problem on the
newsfeed and on Timeline which use infinite scrolling (content is
prefetched as the user scrolls down the app and appended) and end up
containing large amounts of content (both text AND images). Currently, we
do all of the scrolling using JS, as other options were not fast enough
(because of implementation issues).

### QoI Issues

- Inconsistent framerates, UI thread lag (stuttering).
- GPU buffer exhaustion due to size of content and number of images.
- Native momentum scrolling has a different feel across operating systems.
JS implementation end up being tailored for one OS and feels wrong on
other ones (uncanny valley).
- Perf issue with touch events on Android devices (latency, not enough
events) which makes JS implementations of scrolling more brittle there.

### Requirements:

- Scrolling must be fast and smooth (this is really important for
engagement).
- It must trigger a given set of events so content can be prefetched,
computed and appended as the user scrolls towards the bottom of the loaded
content.
- It must allow i/o and computation in the background (without affecting
the smoothness).
- It must allow appending fresh content to the main content while
scrolling (again, without affecting the smoothness).
- It must reliably handle scrolling though a lot of content, including
lots of images.
- It must be possible to capture touch events during scrolling.

### new API suggestions:

- A standardized way to enable momentum scrolling across browsers.
- `onscroll` events triggered *during* scrolling.
- Structured cloning of rootless document fragments (for building doc
fragments in workers and moving them back to the main thread to be
appended).
- Simple way to implement pull to refresh (via dedicated off-bound-scroll
events?). 


3. GPU
------

Currently, the GPU is a black-box (which from what I understand is what
vendors would like to keep it as). In truth however, developers rely on
tricks[5] to force given content to be hardware accelerated. So it
basically a black-box with a clunky API to add things to it. Given the
size of GPU buffers relative to the size of content consumed on devices
nowadays, I doubt well get to a place where managing GPU can be left
strictly to the browser in a reasonable amount of time.

Think there's value in at least discussing the pros and cons of providing
some form of API to the GPU, if that's possible at all.


4. Other
--------

- Better touch tracking support, especially on Android.
- Smoother animations are always an asset.
- Better caching.
- AppCache is soooooo busted we stopped using it[6].

Best,

--tobie

---
[1]: 
https://www.facebook.com/notes/facebook-engineering/under-the-hood-rebuildi
ng-facebook-for-ios/10151036091753920
[2]: 
http://engineering.linkedin.com/linkedin-ipad-5-techniques-smooth-infinite-
scrolling-html5
[3]: http://lists.w3.org/Archives/Public/public-coremob/2012Aug/0014.html
[4]: http://www.w3.org/2012/09/12-webperf-minutes.html
[5]: http://davidwalsh.name/translate3d
[6]: https://etherpad.mozilla.org/appcache-london

Good resources for Web App developers


(1) To get involved with the World Wide Web Consortium (W3C)
Now anyone can form a W3C community group that could lead to standardization.
It would be good to monitor these W3C sponsored community groups and any new ones that are formed. I expect some of these to evolve into full W3C level (or even at ISO level) standards. Declarative-3D and Web Payments groups looks interesting from our perspective.

 (2)  Graphics and OGL programming
·         
(3) Benchmarks

·         
(4) Testing
·        A neat tool from Google GWT team that will give you lot of insight into a Chrome desktop browser – unlike Firebug/FF or WebInspector/Chrome, Speed Tracer tool will break down all Chrome web processing – including paint, layout, style calculations.But it only works on a desktop Chrome browser and on a dev channel Chrome build.So next time you want to compare a scenario on desktop Chrome, I suggest try it with Speed Tracer.

http://code.google.com/webtoolkit/speedtracer/ (To monitor time spent on different WebKit activities likes loading resources, parsing (html, css, & js),  layout, painting, and rendering)


(5) Good resources on HTML5 know-how

  • http://www.webplatform.org/


Bit Twiddling Hacks

Recently I have come across this interesting article on bit twiddling hacks from Stanford university.


"When totaling the number of operations for algorithms here, any C operator is counted as one operation. Intermediate assignments, which need not be written to RAM, are not counted. Of course, this operation counting approach only serves as an approximation of the actual number of machine instructions and CPU time. All operations are assumed to take the same amount of time, which is not true in reality, but CPUs have been heading increasingly in this direction over time. There are many nuances that determine how fast a system will run a given sample of code, such as cache sizes, memory bandwidths, instruction sets, etc. In the end, benchmarking is the best way to determine whether one method is really faster than another, so consider the techniques below as possibilities to test on your target architecture" [Source: http://graphics.stanford.edu/~seander/bithacks.htm]

Link: http://graphics.stanford.edu/~seander/bithacks.html

A Performace Study On Using Uncached Buffers In WLAN Drivers



This is short study conducted in SDIO WLAN driver (Linux) in August 2010. I am publishing it to gain some traction on the idea of "how marking some buffers as uncached can significantly reduces the CPU load and improve the embedded system performance."

Introduction
A short study on the benefits of using uncached buffers in WLAN driver. I did an experiment with WLAN driver to get rid of some cache coherency overhead on DMA buffers by making them uncached. The results are encouraging and would like to share with the rest of the driver developers.Give it a try in your respective embedded components on a case by case basis.

In the below code snippets uncached buffers will be enabled if you define the ENABLE_DMA_UNCACHED_BUFF. 
Follow the code appropriately.

Section 1: 
How to allocate Uncached DMA buffers? Note that I am storing the base address after the allocation of 8KB and call it physical_base_readaddress. I need to do it since for every DMA transaction I will be using any portion of the 8 KB chunk i.e. Say if I am transmitting something, then my data will be located any where within the 8 KB. In that case I need to compute the actual dma_bus_address based on the dma_base_readaddress and physical address offset of the data chunk I will be transmitting. (Refer Section 3)

/* Allocate a DMA-able buffer and provide it to the upper layer to be used for all read and write transactions */
    if (pDmaReadBufAddr == 0) /* allocate only once (in case this function is called multiple times) */
    {
#ifndef ENABLE_DMA_UNCACHED_BUFF      
        pDmaReadBufAddr = kmalloc (MAX_BUS_TXN_SIZE, GFP_ATOMIC | GFP_DMA);
#else
        pDmaReadBufAddr = dma_alloc_writecombine(g_drv.dev, 8192, &g_drv.dma_base_readaddress, GFP_KERNEL);
#endif
        if (pDmaReadBufAddr == 0) { return -1; }
    }

#ifndef ENABLE_DMA_UNCACHED_BUFF       
        *pRxDmaBufAddr = pDmaReadBufAddr;
        *pTxDmaBufAddr = pDmaWriteBufAddr;
#else
        g_drv.physical_base_readaddress = *pRxDmaBufAddr = pDmaReadBufAddr;
        g_drv.physical_base_writeaddress = *pTxDmaBufAddr = pDmaWriteBufAddr;
#endif   

Section 2:
Uncached DMA buffer clean up

if (pDmaReadBufAddr)
    {
#ifndef ENABLE_DMA_UNCACHED_BUFF
        kfree (pDmaReadBufAddr);
#else
        dma_free_writecombine(g_drv.dev, 8192, pDmaReadBufAddr, g_drv.dma_base_readaddress);
#endif
        pDmaReadBufAddr = 0;
    }
   
Section 3:
 DMA Transaction (Only for Tx Buffer). Note I am avoiding the dma_map_single () call.

#ifndef ENABLE_DMA_UNCACHED_BUFF
    {
        dma_bus_address = dma_map_single(g_drv.dev, pData, uLen, DMA_FROM_DEVICE);
        if (!dma_bus_address) {
            PERR("sdioDrv_WriteAsync: dma_map_single failed\n");
            return -1;
        }
    }
#else
    {
        dma_bus_address = g_drv.dma_base_readaddress + (dma_addr_t)(pData - g_drv.physical_base_readaddress);
    }
#endif

Section 4:
 DMA Callback. Note I am avoiding the dma_unmap_single() call.

if (g_drv.dma_read_addr != 0) {
        //printk(KERN_INFO "in return sdioDrv_ReadAsync ret\n");
#ifndef ENABLE_DMA_UNCACHED_BUFF       
        dma_unmap_single(g_drv.dev, g_drv.dma_read_addr, g_drv.dma_read_size, DMA_FROM_DEVICE);
#endif        
        g_drv.dma_read_addr = 0;
        g_drv.dma_read_size = 0;
    }

Based on the DMA direction (DMA_FROM_DEVICE / DMA_TO_DEVICE) appropriate cache invalidate or cache flush operations are performed to maintain the cache coherency for every dma_map_single() call and dma_unmap_single().
Apart from the above cache operations which take CPU cycles, there is a possibility of heavy cache misses (pollution) because of frequent cache operations. This is of high importance for web browsing use case, where webkit caches lot of data that makes a given web page. Cache misses are costly in web browsing use cases. Of course this does comes with cost. In my case, I was doing a memcpy on those DMA buffers (uncached buffer) with my internal buffer (cached skb buffer) and thereby taking a hit.

From the internal tests I did, following are the performance numbers I have gathered.
Http Throughput
Uncached DMA buffer
Cached DMA buffer
18.67 Mbsec
17.1 Mbsec

Avg CPU Utilization - HTTP throughput test (8MB file download)
Avg CPU Utilization - Static CNN web page load
Uncached DMA buffer (%)
Cached DMA buffer (%)
Uncached DMA buffer (%)
Cached DMA buffer (%)
User (50) + Sys(8)
User(63) + Sys (8)
User (25) + Sys (33)
User (50) + Sys (31)


[*Apart from the user and system cpu utilization reduction, wlan workqueue cpu utilization dropped down (by around 10%) because of eliminating the dma_map_single() and dma_unmap_single() calls , but at the same time sdio workqueue increased (by around 6%) because of memcpy from uncached to cached buffers.]

Afer bit of literature survey, I came across this interesting research paper [http://quning.org/self/pku_cache.pdf] worth reading. It talks about advantages of using uncached buffers in certain use cases.
(1)   Talks about making DMA buffers uncached within Ethernet driver (similar to what I did in my experiment). Conclusion is it helps significantly in increasing throughput in TCP Tx and to some extent in TCP Rx.
(2)   Takes implementation (1) one step higher by making the skb buffers (descriptors that holds that actual TCP packet) uncached. I didn’t try this.
(3)   Talks about pushing the TLB page table to uncached memory region.

Friday, December 24, 2010

Game Theory and my personal dilemma

Recently I had the pleasure of applying Game Theory on one of the problems I worked with. Initially I was skeptical about the validity of the theory and felt that it is trying to give an explanation of an already occurring social event (how individuals react to the situation). Later realized the beauty of the theory on why people adopt a specific strategy on a given situation. After all we humans want to maximize our profit or at least get as much as what the opponent will get in the game of life. In "The Beautiful Mind" movie, when the beautiful blonde girl along with her 4 friends comes to the party, the best strategy of John Nash’s 4 friends is to better propose the 4 of her friends rather than the blonde girl. There is a higher probability of the blonde girl denying company to all 4 of them as well as the 4 of her friends being pissed off by the fact they all 4 approached the blonde girl in their covey. [In a lighter vein, of course once 4 of Nash’s friends have given the company to 4 of her friends, Nash has always the gain of giving company to the blonde girl]

The essence of game theory is captured by the famous Prisoner’s Dilemma (PD) experiment. Two suspects (A and B) committed a crime and were arrested. Interrogated separately: they may confess (C) or deny (D). These are their “strategies.” Payoffs are what they get by adopting a particular strategy. By adopting a strategy of D they have the freedom of not going to jail. On the other hand by adopting a strategy of C, at least one of them can escape prison and can also be rewarded for turning approver. “Best” outcome for the PD experiment is for both of them to deny: (D, D) – Pareto Optimal (i.e., the sum of payoffs is maximized). Pareto optimality implies that it is not possible to make somebody better off without making somebody else worse off. This is the whole idea for countries spending on defense and amassing weapons. This creates a sense of fear in the other country to wage a war. [I used to wonder as a kid why countries spend the maximum % of GDP on defense. Now I know why partially. See my conclusion on my personal dilemma]. The equilibrium outcome of the game need not be the Pareto optimal outcome. In the PD there’s always an incentive for a player to deviate from (D, D). In fact, choosing C always gets you a better payoff than choosing D (the black sheep or the one who turns approver escapes from prison sentence). We say that the players have a dominant strategy. Clearly, they will use this strategy in equilibrium. Nash Equilibrium (NE): by definition, a NE is a pair of strategies (SA, SB), so that if A sticks to his NE strategy SA, B has no profitable deviation (i.e., B will do no better if he uses any other strategy). The same holds for A: if B plays his NE strategy, A cannot do better than playing his equilibrium strategy. The NE for PD is (C, C). [There are no unilateral profitable deviations from the Nash equilibrium exist.] Suppose that the game in hand is the twice-repeated PD. How does a strategy look like? What about the strategy in the ten times repeated PD? In the finitely repeated PD game, consider the last period. What would you do if you were A? Go to the next-to-last period. What would you do if you were A? Conclusion: There is no collusion in the finite-period repeated PD. Equilibrium strategies are history-independent here.

What if the game is repeated an infinite (or random) number of times? We will show that if players put enough weight on future streams of income, collusion may be enforced. In PD both players would be better off if they played in each period (D, D). They choose to play C because of the opponent’s incentive to “cheat” by playing C. Mr. A offers the following “plan” to Mr. B: he will start by playing D in the first round; will continue to play D as long as Mr. B plays D; if a deviation is detected, Mr. A will play C forever. This is a “grim” trigger strategy (any deviation from D triggers “punishment” of the opponent – forever!). This won't occur in reality since as humans we are always suspicious of opponent's commitment. This partially explains why Walmart, Kroger have different pricing strategies and fortunately there is no collusion in their pricing strategies. There is always an incentive to collude and set a fixed price. But this cannot last long for an infinitely repeated game because there is always an incentive for say Walmart to lower the price and attract more customers. So it is better off for both the sellers to not collude. Thus we shoppers are protected.

What about mixed strategy in the game of chance like poker or the simple coin toss experiment? These are zero sum games: One person’s gain is another person’s loss. The only way to win in this game is to randomize our strategies: choose H and T with equal probability with no bias.

Having explained the Game Theory, I always felt it has some short comings particularly in explaining one of the critical event in Indian history (my personal dilemma). The strategy adopted by Mahatma Gandhi in freedom struggle. His strategy of adopting non-violence pitted against violence, proved to a dominant strategy. By applying Game Theory, the strategy against violence is violence of greater degree but by adopting a strategy of non-violence he was able to achieve what India badly needed - freedom.

Anyway, start analyzing how you reacted to situations with Game Theory in mind (postmortem analysis). You will be surprised to find your dominant strategy in the given situation.