Chapter 13. Optimizing Application Delivery

High-performance browser networking relies on a host of networking technologies (Figure 13-1), and the overall performance of our applications is the sum total of each of their parts.

We cannot control the network weather between the client and server, nor the client hardware or the configuration of their device, but the rest is in our hands: TCP and TLS optimizations on the server, and dozens of application optimizations to account for the peculiarities of the different physical layers, versions of HTTP protocol in use, as well as general application best practices. Granted, getting it all right is not an easy task, but it is a rewarding one! Let’s pull it all together.

Optimization layers for web application delivery
Figure 13-1. Optimization layers for web application delivery

The physical properties of the communication channel set hard performance limits on every application: speed of light and distance between client and server dictate the propagation latency, and the choice of medium (wired vs. wireless) determines the processing, transmission, queuing, and other delays incurred by each data packet. In fact, the performance of most web applications is limited by latency, not bandwidth, and while bandwidth speeds will continue to increase, unfortunately the same can’t be said for latency:

As a result, while we cannot make the bits travel any faster, it is crucial that we apply all the possible optimizations at the transport and application layers to eliminate unnecessary roundtrips, requests, and minimize the distance traveled by each packet—i.e., position the servers closer to the client.

Every application can benefit from optimizing for the unique properties of the physical layer in wireless networks, where latencies are high and bandwidth is always at a premium. At the API layer, the differences between the wired and wireless networks are entirely transparent, but ignoring them is a recipe for poor performance. Simple optimizations in how and when we schedule resource downloads, beacons, and the rest can translate to significant impact on the experienced latency, battery life, and overall user experience of our applications:

Moving up the stack from the physical layer, we must ensure that each and every server is configured to use the latest TCP and TLS best practices. Optimizing the underlying protocols ensures that each client is able to get the best performance—high throughput and low latency—when communicating with the server:

Finally, we arrive at the application layer. By all accounts and measures, HTTP is an incredibly successful protocol. After all, it is the common language between billions of clients and servers, enabling the modern Web. However, it is also an imperfect protocol, which means that we must take special care in how we architect our applications:

The secret to a successful and sustainable web performance strategy is simple: measure first, link business goals to performance metrics, apply optimizations, lather, rinse, and repeat. Developing and investing into appropriate measurement tools and application metrics is top priority; see “Synthetic and Real-User Performance Measurement”.

Evergreen Performance Best Practices

Regardless of the type of network or the type or version of the networking protocols in use, all applications should always seek to eliminate or reduce unnecessary network latency and minimize the amount of transferred bytes. These two criteria are the evergreen performance best practices that serve as the foundation for dozens of familiar performance rules:

Reduce DNS lookups
Every hostname resolution requires a network roundtrip, imposing latency on the request and blocking the request while the lookup is in progress.
Reuse TCP connections
Leverage connection keepalive whenever possible to eliminate the TCP handshake and slow-start latency overhead; see “Slow-Start”.
Minimize number of HTTP redirects
HTTP redirects can be extremely costly, especially when they redirect the client to a different hostname, which results in additional DNS lookup, TCP handshake latency, and so on. The optimal number of redirects is zero.
Use a Content Delivery Network (CDN)
Locating the data geographically closer to the client can significantly reduce the network latency of every TCP connection and improve throughput. This advice applies both to static and dynamic content; see “Uncached Origin Fetch”.
Eliminate unnecessary resources
No request is faster than a request not made.

By this point, all of these recommendations should require no explanation: latency is the bottleneck, and the fastest byte is a byte not sent. However, HTTP also provides a number of additional mechanisms, such as caching and compression, as well as its own set of version-specific performance quirks:

Cache resources on the client
Application resources should be cached to avoid re-requesting the same bytes each time the resources are required.
Compress assets during transfer
Application resources should be transferred with the minimum number of bytes: always apply the best compression method for each transferred asset.
Eliminate unnecessary request bytes
Reducing the transferred HTTP header data (i.e., HTTP cookies) can save entire roundtrips of network latency.
Parallelize request and response processing
Request and response queuing latency, both on the client and server, often goes unnoticed, but contributes significant and unnecessary latency delays.
Apply protocol-specific optimizations
HTTP 1.x offers limited parallelism, which requires that we bundle resources, split delivery across domains, and more. By contrast, HTTP 2.0 performs best when a single connection is used and HTTP 1.x specific optimizations are removed.

Each of these warrants closer examination. Let’s dive in.

Cache Resources on the Client

The fastest network request is a request not made. Maintaining a cache of previously downloaded data allows the client to use a local copy of the resource, thereby eliminating the request. For resources delivered over HTTP, make sure the appropriate cache headers are in place:

  • Cache-Control header can specify the cache lifetime (max-age) of the resource.
  • Last-Modified and ETag headers provide validation mechanisms.

Whenever possible, you should specify an explicit cache lifetime for each resource, which allows the client to use a local copy, instead of re-requesting the same object all the time. Similarly, specify a validation mechanism to allow the client to check if the expired resource has been updated: if the resource has not changed, we can eliminate the data transfer.

Finally, note that you need to specify both the cache lifetime and the validation method! A common mistake is to provide only one of the two, which results in either redundant transfers of resources that have not changed (i.e., missing validation), or redundant validation checks each time the resource is used (i.e., missing cache lifetime).

Compress Transferred Data

Leveraging a local cache allows the client to avoid fetching duplicate content on each request. However, if and when the resource must be fetched, either because it has expired, it is new, or it cannot be cached, then it should be transferred with the minimum number of bytes. Always apply the best compression method for each asset.

The size of text-based assets, such as HTML, CSS, and JavaScript, can be reduced by 60%–80% on average when compressed with Gzip. Images, on the other hand, require a more nuanced consideration:

  • Images account for over half the transferred bytes of an average page.
  • Image files can be made smaller by eliminating unnecessary metadata.
  • Images should be resized on the server to avoid shipping unnecessary bytes.
  • An optimal image format should be chosen based on type of image.
  • Lossy compression should be used whenever possible.

Different image formats can yield dramatically different compression ratios on the same image file, because different formats are optimized for different use cases. In fact, picking the wrong image format (e.g., using PNG for a photo instead of JPEG or WebP) can easily translate into hundreds and even thousands of unnecessary kilobytes of transferred data. Invest into tools and automation to help determine the optimal format!

Once the right image format is selected, ensure that the dimensions of the files are no larger than they need to be. Resizing an oversized image on the client negatively impacts the CPU, GPU, and memory requirements (see “Calculating Image Memory Requirements”), in addition to unnecessarily increasing the transfer size.

Finally, with the right format and image dimensions in place, investigate using a lossy image format, such as JPEG or WebP, with various compression levels: higher compression can yield significant byte savings with minimal or no perceptible change in image quality, especially on smaller (mobile) screens.

Eliminate Unnecessary Request Bytes

HTTP is a stateless protocol, which means that the server is not required to retain any information about the client between different requests. However, many applications require state for session management, personalization, analytics, and more. To enable this functionality, the HTTP State Management Mechanism (RFC 2965) extension allows any website to associate and update "cookie" metadata for its origin: the provided data is saved by the browser and is then automatically appended onto every request to the origin within the Cookie header.

The standard does not specify a maximum limit on the size of a cookie, but in practice most browsers enforce a 4 KB limit. However, the standard also allows the site to associate many cookies per origin. As a result, it is possible to associate tens of kilobytes of arbitrary metadata, split across multiple cookies, for each origin! Needless to say, this can have significant performance implications for your application:

  • Associated cookie data is automatically sent by the browser on each request.
  • In HTTP 1.x, all HTTP headers, including cookies, are transferred uncompressed.
  • In HTTP 2.0, compression is applied, but the potential overhead is still high.
  • In the worst case, large HTTP cookies can add entire roundtrips of network latency by exceeding the initial TCP congestion window.

When using HTTP 1.x, a common best practice is to designate a dedicated "cookie-free" origin, which can be used to deliver responses that do not need client-specific optimization.

Parallelize Request and Response Processing

In order to achieve the fastest response times within your application, all resource requests should be dispatched as soon as possible. However, another important point to consider is how these requests, and their respective responses, will be processed on the server. After all, if all of our requests are then serially queued by the server, then we are once again incurring unnecessary latency. Here’s how to get the best performance:

  • Use connection keepalive and upgrade from HTTP 1.0 to HTTP 1.1.
  • Leverage multiple HTTP 1.1 connections where necessary for parallel downloads.
  • Leverage HTTP 1.1 pipelining whenever possible.
  • Investigate upgrading to HTTP 2.0 to improve performance.
  • Ensure that the server has sufficient resources to process requests in parallel.

Without connection keepalive, a new TCP connection is required for each HTTP request, which incurs significant overhead due to the TCP handshake and slow-start. For best results, use HTTP 1.1, and reuse existing TCP connections whenever possible. Then, on rare occasions where HTTP pipelining can be used, do so, or even better, consider upgrading to HTTP 2.0 to get the best performance.

Identifying the sources of unnecessary client and server latency is both an art and a science: examine the client resource waterfall (see “Analyzing the Resource Waterfall”), as well as your server logs. Common pitfalls often include the following:

  • Underprovisioned servers, forcing unnecessary processing latency.
  • Underprovisioned proxy and load-balancer capacity, forcing delayed delivery of the request (queuing latency) to the application server.
  • Blocking resources on the client forcing delayed construction of the page; see “DOM, CSSOM, and JavaScript”.

Optimizing for HTTP 1.x

The order in which we optimize HTTP 1.x deployments is important: configure servers to deliver the best possible TCP and TLS performance, then carefully review and apply mobile and evergreen application best practices: measure, iterate.

With the evergreen optimizations in place, and with good performance instrumentation within the application, evaluate whether the application can benefit from applying HTTP 1.x specific optimizations (read, workarounds):

Leverage HTTP pipelining
If your application controls both the client and the server, then pipelining can help eliminate significant amounts of network latency.
Apply domain sharding
If your application performance is limited by the default six connections per origin limit, consider splitting resources across multiple origins.
Bundle resources to reduce HTTP requests
Techniques such as concatenation and spriting can both help minimize the protocol overhead and deliver pipelining-like performance benefits.
Inline small resource
Consider embedding small resources directly into the parent document to minimize the number of requests.

Pipelining has limited support, and each of the remaining optimizations comes with its own set of benefits and trade-offs. In fact, it is often overlooked that each of these optimizations can hurt performance when applied too aggressively, or incorrectly; review Chapter 11 for an in-depth discussion. Be pragmatic, instrument your application, measure impact carefully and iterate. Distrust any one-size-fits-all advice.

And one last thing…consider upgrading to HTTP 2.0, as it eliminates the need for most of the HTTP 1.x-specific optimizations previously outlined! Not only will your application load faster with HTTP 2.0, but it will also be simpler and easier to work with.

Optimizing for HTTP 2.0

The primary focus of HTTP 2.0 is on improving transport performance and enabling lower latency and higher throughput between the client and server. Not surprisingly, getting the best possible performance out of TCP and TLS, as well as eliminating other unnecessary network latency, has never been as important. At a minimum:

  • Server should start with a TCP cwnd of 10 segments.
  • Server should support TLS with ALPN negotiation (NPN for SPDY).
  • Server should support TLS resumption to minimize handshake latency.

In short, review “Optimizing for TCP” and “Optimizing for TLS”. Getting the best performance out of HTTP 2.0, especially in light of the one-connection-per-origin recommendation, requires a well-tuned network stack.

Next up—surprise—apply the mobile and other evergreen application best practices: send fewer bytes, eliminate requests, and adapt resource scheduling for wireless networks. Reducing the amount of data transferred and eliminating unnecessary network latency are the best optimizations you can do for any application, web or native, regardless of the version of the transport protocol.

Finally, undo and unlearn the bad habits of domain sharding, concatenation, and image spriting; these workarounds are no longer required with HTTP 2.0. In fact, they will hurt performance rather than help! Instead, we can now rely on built-in multiplexing and new features such as server push.

Removing 1.x Optimizations

The optimization strategy for HTTP 2.0 diverges significantly from HTTP 1.x. Instead of having to worry about the various limitations of the HTTP 1.x protocol, we can now undo many of the previously necessary workarounds:

Use a single connection per origin
HTTP 2.0 improves performance by maximizing throughput of a single TCP connection. In fact, use of multiple connections (e.g., domain sharding) is a performance anti-pattern for HTTP 2.0, as it reduces the effectiveness of header compression and request prioritization provided by the protocol.
Remove unnecessary concatenation and image spriting
Resource bundling has many downsides, such as expensive cache invalidations, larger memory requirements, deferred execution, and increased application complexity. With HTTP 2.0, many small resources can be multiplexed in parallel, which means that the downsides of asset bundling will almost always outweigh the benefits of delivering more granular resources.
Leverage server push
The majority of resources that were previously inlined with HTTP 1.x can and should be delivered via server push. By doing so, each resource can be cached individually by the client and reused across different pages, instead of being embedded in each and every page.

For best performance, consolidate as many resources as possible on the same origin. Domain sharding is a performance anti-pattern for HTTP 2.0 and will hurt performance of the protocol: this is a critical first step. From there, a more gradual migration can take place. Bundled assets do not affect performance of the HTTP 2.0 protocol itself, but they can have a negative impact on cache performance and execution speed.

For a reminder of the negative costs of concatenation and spriting, see “Concatenation and Spriting” and “Calculating Image Memory Requirements”.

Similarly, inlined resources can be replaced with server push to further improve cache performance on the client, without incurring any extra network latency; see “Implementing HTTP 2.0 server push”. In fact, the use of server push may offer the most benefits for mobile clients, due to the high cost of network roundtrips on 3G and 4G networks.

Dual-Protocol Application Strategies

Unfortunately, the upgrade to HTTP 2.0 won’t happen overnight. As a result, many applications will have to carefully consider the trade-offs of dual-protocol deployment strategies: the same application code can be delivered over HTTP 1.x and HTTP 2.0, without any modifications. However, aggressive optimization for HTTP 1.x can hurt HTTP 2.0 performance and vice versa.

If the application controls both the server and the client, then it is in a position to dictate the protocol in use—that’s the simplest case. Most applications do not and cannot control the client and will have to use a hybrid or an automated strategy to accommodate both versions of the protocol. Let’s evaluate some options:

Same application code, dual-protocol deployment
The same application code can be delivered over HTTP 1.x and HTTP 2.0. As a result, you may not get the best performance out of either, but it may be the most pragmatic way to get good enough performance on both, where good enough should be carefully measured with respect to each individual application. With this strategy, a good first step is to eliminate domain sharding to enable efficient HTTP 2.0 delivery. From there, as more users migrate toward HTTP 2.0, you can also undo the resource bundling techniques and start to leverage server push where possible.
Split application code, dual-protocol deployment
Different versions of the application can be delivered based on the version of the protocol in use. This increases operational complexity but in practice may be a reasonable strategy for many applications—e.g., an edge server responsible for terminating the connection can direct the client request to an appropriate server based on the version of negotiated protocol.
Dynamic HTTP 1.x and HTTP 2.0 optimization
Some automated web optimization frameworks, and open source and commercial products, can perform dynamic rewriting (concatenation, spriting, sharding, and so on) of the delivered application code when the request is served. In that case, the server could also take into account the negotiated version of the protocol and dynamically apply the appropriate optimization strategy.
HTTP 2.0, single-protocol deployment
If the application controls both the server and the client, then there is no reason why HTTP 2.0 cannot be used exclusively. In fact, if such an option is available, then this should be the default strategy.

The route you choose will depend on the current infrastructure, the complexity of the application, and the demographics of your users. Ironically, it is the applications that have invested the most effort into HTTP 1.x optimization that will have the hardest time to manage this migration. Alternatively, if you control the client, have an automated application optimization solution in place, or are not using any 1.x-specific optimizations in your existing application, then you can safely bet on HTTP 2.0 and not look back.

Translating 1.x to 2.0 and Back

In addition to thinking about a dual-protocol application optimization strategy, many existing deployments may need an intermediate path for their application servers: an end-to-end HTTP 2.0 stack is the end goal for best performance, but a translation layer (Figure 13-2) can enable existing 1.x servers to take advantage of HTTP 2.0 as well.

HTTP 2.0 to 1.x translation: streams converted to 1.x requests
Figure 13-2. HTTP 2.0 to 1.x translation: streams converted to 1.x requests

An intermediate server can accept an HTTP 2.0 session, process it, and dispatch 1.x formatted requests to existing infrastructure. Then, once it receives the response, it can convert it back to HTTP 2.0 and respond back to the client. In many cases this is the simplest way to get started with HTTP 2.0, as it allows us to reuse our existing 1.x infrastructure with minimum or zero modification.

Most web servers with HTTP 2.0 support provide a 2.0 to 1.x translation mechanism by default: the 2.0 session is terminated at the server (e.g., Apache, Nginx), and if the server is configured as a reverse proxy, then 1.x requests are dispatched to individual application servers.

However, the 2.0 to 1.x convenience path should not be mistaken for a good long-term strategy; in many ways, this workflow is exactly backward. Instead of converting an optimized, multiplexed session into a series of 1.x requests, and thereby deoptimizing the session within our own infrastructure, we should be doing the opposite: converting inbound 1.x client requests to 2.0 streams, and standardizing our application infrastructure to speak 2.0 in all cases.

To get the best performance, and to enable the low latency and real-time Web, we should demand our internal infrastructure to meet the following criteria:

  • Load balancer and proxy connections to application servers should be persistent.
  • Request and response streaming and multiplexing should be the default.
  • Communication with application servers should be message-oriented.
  • Communication between clients and application servers should be bidirectional.

An end-to-end HTTP 2.0 session meets all of these criteria and enables low latency delivery to the client, as well as within our own data centers: there is no longer a need for custom RPC layers and mechanisms to communicate between internal services to get the desired performance. In short, don’t downgrade 2.0 to 1.x; that’s not a good long-term strategy. Instead upgrade 1.x to 2.0 to get the best performance.

Evaluating Server Quality and Performance

The quality of implementation of the HTTP 2.0 server will have a significant impact on the performance of the client. A well-tuned HTTP server has always been important, but the performance benefits of prioritization, server push, and multiplexing are all closely tied to the quality of the implemented logic in the server:

  • HTTP 2.0 server must understand stream priorities.
  • HTTP 2.0 server must prioritize response processing and delivery.
  • HTTP 2.0 server must support server push.
  • HTTP 2.0 server should provide different push strategy implementations.

A naive implementation of an HTTP 2.0 server may "speak" the protocol, but without explicit awareness of request priorities, and server push, will deliver suboptimal performance—e.g., saturate the bandwidth by sending large, static image files, while the client is blocked on other critical resources, such as CSS and JavaScript.

To get the best possible performance, an HTTP 2.0 client has to be "optimistic": it should send all requests as soon as possible and defer to the server to optimize delivery. Hence, the performance of an HTTP 2.0 client is even more dependent on the server than before.

Similarly, different servers may offer different mechanisms and strategies for leveraging server push; see “Implementing HTTP 2.0 server push”. It is not an understatement to say that the performance of your application will be closely tied to the quality of your HTTP 2.0 server.

Given the fast-evolving nature of HTTP 2.0 and SPDY, different server implementations (Apache, Nginx, Jetty, etc.) are all at different stages in their HTTP 2.0 implementations. Check the appropriate documentation and release notes for supported features and latest news.

Speaking 2.0 with and without TLS

In practice, due to many incompatible intermediaries, early HTTP 2.0 deployments will have to be delivered over an encrypted channel, which leaves us with two options of where the ALPN negotiation and TLS termination can occur:

  • The TLS connection can be terminated at the HTTP 2.0 server.
  • The TLS connection can be terminated upstream (e.g., load balancer).

The first case requires that the HTTP 2.0 server is able to handle TLS, but otherwise is simple. The second case is far more interesting: the TLS+ALPN handshake can be terminated by an upstream proxy (Figure 13-3), at which point another encrypted tunnel can be established, or unencrypted HTTP 2.0 frames can be sent directly to the server.

TLS+ALPN aware load-balancer
Figure 13-3. TLS+ALPN aware load-balancer

The choice of using a secure or an unencrypted tunnel for communication between the proxy and the application server is up to the application: as long as we control the internal infrastructure, we can guarantee that the unencrypted frames won’t be modified or dropped. As a result, while most HTTP 2.0 servers should support TLS+ALPN negotiation, they should also be able to talk to HTTP 2.0 without encryption.

Further, a smart load balancer can also use the TLS+ALPN negotiation mechanism to selectively route the different clients to different servers, based on the version of the negotiated protocol!

HAProxy, a popular open source load balancer, supports both NPN negotiation and routing based on the negotiated protocol. For a hands-on look, see "Simple SPDY and NPN Negotiation with HAProxy".

Load Balancers, Proxies, and Application Servers

Depending on the existing infrastructure in place, as well as the complexity and scale of the application, your infrastructure may need one or more load balancers (Figure 13-4) or HTTP 2.0-aware proxies.

Load balancing and TLS termination strategies
Figure 13-4. Load balancing and TLS termination strategies

In the simplest case, the HTTP 2.0 server is accessible directly by the client and is responsible for terminating the TLS connection, performing the ALPN negotiation, and servicing all inbound requests.

However, a single server is insufficient for larger applications, which require that we introduce a load balancer to split the inbound traffic. In this case, the load balancer could terminate the TLS connection (see preceding section), or it can be configured as a TCP proxy and pass the encrypted data directly to the application server.

Many cloud providers offer HTTP and TCP load balancers as a service. However, while most support TLS termination, they may not provide ALPN negotiation, which is a requirement for HTTP 2.0 over TLS. In these cases, the load balancer should be configured as a TCP proxy: pass the encrypted data to the application server and let it perform the TLS+ALPN negotiation.

In practice, the most important questions to answer are which component of your infrastructure will terminate the TLS connection and whether it is capable of performing the necessary ALPN negotiation:

  • To enable HTTP 2.0 over TLS, the termination server must support ALPN.
  • Terminate TLS as close to the user as possible; see “Early Termination”.
  • If ALPN support is unavailable, then use TCP load-balancing mode.
  • If ALPN support is unavailable and TCP load balancing is not possible, then you have to fall back to HTTP Upgrade flow over an unencrypted channel; see “Efficient HTTP 2.0 Upgrade and Discovery”.