Chapter 13. Optimizing Application Delivery

High-performance browser networking relies on a host of networking technologies (Figure 13-1), and the overall performance of our applications is the sum total of each of their parts.

We cannot control the network weather between the client and server, nor the client hardware or the configuration of their device, but the rest is in our hands: TCP and TLS optimizations on the server, and dozens of application optimizations to account for the peculiarities of the different physical layers, versions of HTTP protocol in use, as well as general application best practices. Granted, getting it all right is not an easy task, but it is a rewarding one! Let’s pull it all together.

Optimization layers for web application delivery
Figure 13-1. Optimization layers for web application delivery

Optimizing Physical and Transport Layers

The physical properties of the communication channel set hard performance limits on every application: speed of light and distance between client and server dictate the propagation latency, and the choice of medium (wired vs. wireless) determines the processing, transmission, queuing, and other delays incurred by each data packet. In fact, the performance of most web applications is limited by latency, not bandwidth, and while bandwidth speeds will continue to increase, unfortunately the same can’t be said for latency:

As a result, while we cannot make the bits travel any faster, it is crucial that we apply all the possible optimizations at the transport and application layers to eliminate unnecessary roundtrips, requests, and minimize the distance traveled by each packet—i.e., position the servers closer to the client.

Every application can benefit from optimizing for the unique properties of the physical layer in wireless networks, where latencies are high and bandwidth is always at a premium. At the API layer, the differences between the wired and wireless networks are entirely transparent, but ignoring them is a recipe for poor performance. Simple optimizations in how and when we schedule resource downloads, beacons, and the rest can translate to significant impact on the experienced latency, battery life, and overall user experience of our applications:

Moving up the stack from the physical layer, we must ensure that each and every server is configured to use the latest TCP and TLS best practices. Optimizing the underlying protocols ensures that each client is able to get the best performance—high throughput and low latency—when communicating with the server:

Finally, we arrive at the application layer. By all accounts and measures, HTTP is an incredibly successful protocol. After all, it is the common language between billions of clients and servers, enabling the modern Web. However, it is also an imperfect protocol, which means that we must take special care in how we architect our applications:

  • We must work around the limitations of HTTP/1.x.
  • We must learn how to leverage new performance enhancements in HTTP/2.
  • We must be vigilant about applying the evergreen performance best practices.

The secret to a successful and sustainable web performance strategy is simple: measure first, link business goals to performance metrics, apply optimizations, lather, rinse, and repeat. Developing and investing into appropriate measurement tools and application metrics is top priority; see “Synthetic and Real-User Performance Measurement”.

Evergreen Performance Best Practices

Regardless of the type of network or the type or version of the networking protocols in use, all applications should always seek to eliminate or reduce unnecessary network latency and minimize the amount of transferred bytes. These two simple rules are the foundation for all of the evergreen performance best practices:

Reduce DNS lookups
Every hostname resolution requires a network roundtrip, imposing latency on the request and blocking the request while the lookup is in progress.
Reuse TCP connections
Leverage connection keepalive whenever possible to eliminate the TCP handshake and slow-start latency overhead; see “Slow-Start”.
Minimize number of HTTP redirects
HTTP redirects can be extremely costly, especially when they redirect the client to a different hostname, which results in additional DNS lookup, TCP handshake latency, and so on. The optimal number of redirects is zero.
Use a Content Delivery Network (CDN)
Locating the data geographically closer to the client can significantly reduce the network latency of every TCP connection and improve throughput. This advice applies both to static and dynamic content; see “Uncached Origin Fetch”.
Eliminate unnecessary resources
No request is faster than a request not made.

By this point, all of these recommendations should require no explanation: latency is the bottleneck, and the fastest byte is a byte not sent. However, HTTP provides a number of additional mechanisms, such as caching and compression, as well as its own set of version-specific performance quirks:

Cache resources on the client
Application resources should be cached to avoid re-requesting the same bytes each time the resources are required.
Compress assets during transfer
Application resources should be transferred with the minimum number of bytes: always apply the best compression method for each transferred asset.
Eliminate unnecessary request bytes
Reducing the transferred HTTP header data (e.g., HTTP cookies) can save entire roundtrips of network latency.
Parallelize request and response processing
Request and response queuing latency, both on the client and server, often goes unnoticed, but contributes significant and unnecessary latency delays.
Apply protocol-specific optimizations
HTTP/1.x offers limited parallelism, which requires that we bundle resources, split delivery across domains, and more. By contrast, HTTP/2 performs best when a single connection is used and HTTP/1.x specific optimizations are removed.

Each of these warrants closer examination. Let’s dive in.

Cache Resources on the Client

The fastest network request is a request not made. Maintaining a cache of previously downloaded data allows the client to use a local copy of the resource, thereby eliminating the request. For resources delivered over HTTP, make sure the appropriate cache headers are in place:

  • Cache-Control header can specify the cache lifetime (max-age) of the resource.
  • Last-Modified and ETag headers provide validation mechanisms.

Whenever possible, you should specify an explicit cache lifetime for each resource, which allows the client to use a local copy, instead of re-requesting the same object all the time. Similarly, specify a validation mechanism to allow the client to check if the expired resource has been updated: if the resource has not changed, we can eliminate the data transfer.

Finally, note that you need to specify both the cache lifetime and the validation method! A common mistake is to provide only one of the two, which results in either redundant transfers of resources that have not changed (i.e., missing validation), or redundant validation checks each time the resource is used (i.e., missing cache lifetime).

For hands-on advice on optimizing your caching strategy, see the "HTTP caching" section on Google’s Web Fundamentals: http://hpbn.co/wf-caching.

Compress Transferred Data

Leveraging a local cache allows the client to avoid fetching duplicate content on each request. However, if and when the resource must be fetched, either because it has expired, it is new, or it cannot be cached, then it should be transferred with the minimum number of bytes. Always apply the best compression method for each asset.

The size of text-based assets, such as HTML, CSS, and JavaScript, can be reduced by 60%–80% on average when compressed with Gzip. Images, on the other hand, require a more nuanced consideration:

  • Images account for over half the transferred bytes of an average page.
  • Image files can be made smaller by eliminating unnecessary metadata.
  • Images should be resized on the server to avoid shipping unnecessary bytes.
  • An optimal image format should be chosen based on type of image.
  • Lossy compression should be used whenever possible.

Different image formats can yield dramatically different compression ratios on the same image file, because different formats are optimized for different use cases. In fact, picking the wrong image format (e.g., using PNG for a photo instead of JPEG) can easily translate into hundreds and even thousands of unnecessary kilobytes of transferred data. Invest into tools and automation to help determine the optimal format!

Once the right image format is selected, ensure that the dimensions of the files are no larger than they need to be. Resizing an oversized image on the client negatively impacts the CPU, GPU, and memory requirements (see “Calculating Image Memory Requirements”), in addition to unnecessarily increasing the transfer size.

Finally, with the right format and image dimensions in place, investigate using a lossy image format, such as JPEG or WebP, with various compression levels: higher compression can yield significant byte savings with minimal or no perceptible change in image quality, especially on smaller (mobile) screens.

For hands-on advice on reducing the transfer size of text, image, webfont, and other resources, see the "Optimizing Content Efficiency" section on Google’s Web Fundamentals: http://hpbn.co/wf-compression.

Eliminate Unnecessary Request Bytes

HTTP is a stateless protocol, which means that the server is not required to retain any information about the client between different requests. However, many applications require state for session management, personalization, analytics, and more. To enable this functionality, the HTTP State Management Mechanism (RFC 2965) extension allows any website to associate and update "cookie" metadata for its origin: the provided data is saved by the browser and is then automatically appended onto every request to the origin within the Cookie header.

The standard does not specify a maximum limit on the size of a cookie, but in practice most browsers enforce a 4 KB limit. However, the standard also allows the site to associate many cookies per origin. As a result, it is possible to associate tens of kilobytes of arbitrary metadata, split across multiple cookies, for each origin!

Pay close attention to the cookie overhead on your analytics scripts and other trackers. It is not uncommon for these requests to have kilobytes of cookie meta-data on each request, which adds up quickly.

Needless to say, this can have significant performance implications for your application. Associated cookie data is automatically sent by the browser on each request, which, in the worst case can add entire roundtrips of network latency by exceeding the initial TCP congestion window, regardless of whether HTTP/1.x or HTTP/2 is used:

  • In HTTP/1.x, all HTTP headers, including cookies, are transferred uncompressed on each request.
  • In HTTP/2, headers are compressed with HPACK, but at a minimum the cookie value is transfered on the first request, which will affect the performance of your initial page load.

When using HTTP/1.x, a common best practice is to designate a dedicated "cookie-free" origin, which can be used to deliver responses that do not need client-specific optimization.

Parallelize Request and Response Processing

In order to achieve the fastest response times within your application, all resource requests should be dispatched as soon as possible. However, another important point to consider is how these requests, and their respective responses, will be processed on the server. After all, if all of our requests are then serially queued by the server, then we are once again incurring unnecessary latency. Here’s how to get the best performance:

  • Upgrade to HTTP/2 to enable multiplexing and best performance.
  • Use multiple HTTP/1.1 connections where necessary for parallel downloads.
  • Re-use TCP connections between requests by optimizing connection keepalive timeouts.
  • Ensure that the server has sufficient resources to process requests in parallel.

Without connection keepalive, a new TCP connection is required for each HTTP request, which incurs significant overhead due to the TCP handshake and slow-start. To get the best performance, use HTTP/2, which will allow the client and server to re-use the same TCP connection for all requests. For HTTP/1.x, you will need multiple TCP connections for request parallelism. In both cases, make sure to optimize your server and proxy timeouts to minimize the costly TCP connection overhead.

Identifying the sources of unnecessary client and server latency is both an art and a science: examine the client resource waterfall (see “Analyzing the Resource Waterfall”), as well as your server logs. Common pitfalls often include the following:

  • Underprovisioned servers, forcing unnecessary processing latency.
  • Underprovisioned proxy and load-balancer capacity, forcing delayed delivery of the request (queuing latency) to the application server.
  • Blocking resources on the client forcing delayed construction of the page; see “DOM, CSSOM, and JavaScript”.

Optimizing for HTTP/1.x

The order in which we optimize HTTP/1.x deployments is important: configure servers to deliver the best possible TCP and TLS performance, then carefully review and apply mobile and evergreen application best practices: measure, iterate.

With the evergreen optimizations in place, and with good performance instrumentation within the application, evaluate whether the application can benefit from applying HTTP/1.x specific optimizations (read, protocol workarounds):

Leverage HTTP pipelining
If your application controls both the client and the server, then pipelining can help eliminate significant amounts of network latency; see “HTTP Pipelining”.
Apply domain sharding
If your application performance is limited by the default six connections per origin limit, consider splitting resources across multiple origins; see “Domain Sharding”.
Bundle resources to reduce HTTP requests
Techniques such as concatenation and spriting can both help minimize the protocol overhead and deliver pipelining-like performance benefits; see “Concatenation and Spriting”.
Inline small resources
Consider embedding small resources directly into the parent document to minimize the number of requests; see “Resource Inlining”.

Pipelining has limited support, and each of the remaining optimizations comes with its own set of benefits and trade-offs. In fact, it is often overlooked that each of these techniques can hurt performance when applied aggressively, or incorrectly; review Chapter 11 for an in-depth discussion. Be pragmatic, instrument your application, measure impact carefully and iterate.

HTTP/2 eliminates the need for all of the above HTTP/1.x workarounds, making our applications both simpler and more performant. Which is to say, the best optimization for HTTP/1.x is to deploy HTTP/2.

Optimizing for HTTP/2

The primary focus of HTTP/2 is on improving transport performance and enabling lower latency and higher throughput between the client and server. Not surprisingly, getting the best possible performance out of TCP and TLS, as well as eliminating other unnecessary network latency, has never been as important. At a minimum:

  • Server should start with a TCP cwnd of 10 segments.
  • Server should deliver 1-RTT TLS handshakes for new and resumed connections.
  • Server must support ALPN to negotiate HTTP/2 support.

Review “Optimizing for TCP” and “Optimizing for TLS” for an in-depth discussion of optimizing the transport layer. Getting the best performance out of HTTP/2, especially in light of the one-connection-per-origin model, requires a well-tuned network stack.

Next up—surprise—apply the mobile and other evergreen application best practices: send fewer bytes, eliminate requests, and adapt resource scheduling for wireless networks. Reducing the amount of data transferred and eliminating unnecessary network latency are the best optimizations for any application, web or native, regardless of the version of the transport and application protocols.

Finally, undo and unlearn the bad habits of domain sharding, concatenation, and image spriting. With HTTP/2 we are no longer constrained by limited parallelism: requests are cheap, and both requests and responses can be multiplexed efficiently. These workarounds are no longer necessary and, even better, omitting them should improve performance.

Eliminate Domain Sharding

HTTP/2 achieves the best performance by multiplexing requests over the same TCP connection, which enables effective request and response prioritization, flow control, and header compression. As a result, the optimal number of connections per origin is exactly one and domain sharding is an anti-pattern.

There are several strategies for fixing this in an HTTP/1.x-friendly way. First, the server can inspect the ALPN-negotiated protocol and deliver alternate HTML markup: with sharded asset references for legacy HTTP/1.x clients, and with same-origin asset references for HTTP/2 clients. Alternatively, HTTP/2 provides a TLS connection-pooling mechanism that allows the server to return the same markup and defers the selection of optimal connection strategy to the user agent. Specifically, when HTTP/2 is in use, the user-agent will re-use the same TCP connection for distinct origins if:

  • The origins are covered by the same TLS certificate - e.g. a wildcard certificate, or a certificate with matching "Subject Alternative Names".
  • The origins resolve to the same server IP address.
  • The existing TCP connection to the origin uses HTTP/2.

For example, if example.com is served with a wildcard certificate valid for all of its subdomains (e.g., *.example.com) and references an asset on static.example.com, which in turn, resolves to the same server IP. Then, if HTTP/2 is in use, the client will re-use the same TCP connection to fetch resources from static.example.com, and otherwise it will open and use multiple TCP connections. As a result, we get the best of both worlds: connection re-use for HTTP/2 and sharding for HTTP/1.x, without any additional modifications in our applications.

Due to third-party dependencies it may not be possible to fetch all the resources from the same connection - that’s OK. Seek to minimize the number of used origins regardless of the protocol, and when HTTP/2 is in use, also eliminate sharding to get the best performance.

Minimize Concatenation and Image Spriting

Bundling multiple assets into a single resource download was a critical optimization for HTTP/1.x where limited parallelism, and high protocol overhead, typically outweighed all other concerns - see “Concatenation and Spriting”. However, with HTTP/2, multiplexing is no longer an issue and header compression dramatically reduces the metadata overhead of each HTTP request. As a result, we need to reconsider the use of concatenation and spriting in light of its new pros and cons:

  • Bundled resources may result in unnecessary data transfers: the user might not need all the assets on a particular page, or at all.
  • Bundled resources may result in expensive cache invalidations: a single updated byte results in a full fetch of the entire bundle.
  • Bundled resources may delay execution: many content-types cannot be processed and applied until the entire file is transferred.
  • Bundled resources may require additional infrastructure at build or delivery time to generate the bundles.
  • Bundled resources may provide better compression if the resources are similar

In practice, while HTTP/1.x allowed us to provide granular resources and optimized caching policies for each one, limited parallelism forced us to bundle resources together into large bundles. This, in turn, introduced the negative effects of more frequent cache invalidations that transferred a lot of redundant data (data that could have been served from cache if it was stored as a separate resource), and led to increased transfer times, cost, and delayed execution of the assets.

With HTTP/2 these protocol limitations are no longer a concern, which means that our application can deliver granular resources without worrying about delayed downloads and request overhead. Further, each resource can optimize own caching policies (expiry time and revalidation tokens) and be individually updated, which improves cache re-use, and reduces download times and costs for the user visiting the site.

That said, concatenation is still a valid optimization strategy in some cases. For example, files that contain similar data may achieve much higher compression ratio that can help reduce the total transfer size. However, these savings should be balanced against previously discussed criteria - i.e. cache re-use, delayed execution, and others.

For applications with many assets, the overhead of issuing many I/O requests may be another optimization criteria to consider: issuing many individual I/O requests adds overhead as compared to reading one large file. The costs of these lookups are architecture and browser specific, so measure first and optimize accordingly.

Eliminate Roundtrips with Server Push

Server push is a powerful new feature of HTTP/2 that enables the server to send multiple responses for a single client request. That said, recall that the use of resource inlining (e.g. embedding an image into an HTML page via a data URI) is, in fact, a form of application-layer server push. As such, while this is not an entirely new capability for web developers, the use of HTTP/2 server push offers significant performance benefits over inlining: pushed resources can be cached individually, reused across pages, canceled by the client, and more—see “Server Push”.

With HTTP/2 there is no longer a reason to inline resources just because they are small; we’re no longer constrained by parallelism and request overhead is very low. As a result, server push acts as a latency optimization that removes a full request-response roundtrip between the client and server - e.g. if, after sending a particular response, we know that the client will always come back and request a specific subresource, we can avoid the roundtrip by pushing the subresource to the client.

If the client does not support, or disables server push, the browser will simply fallback to initiating the request for the required resource as usual; nothing breaks.

Critical resources that block page construction and rendering (see “DOM, CSSOM, and JavaScript”) are prime candidates for the use of server push, as they are often known, or can be specified upfront, and can yield significant performance benefits. Eliminating a full roundtrip from the critical path can yield savings of tens to hundreds of milliseconds, especially for users on mobile networks where latencies are often both high and highly variable.

With that in mind, let’s review the properties and best-practices for using server push:

  • Server push, as its name implies, is initiated by the server. However, the client can control how and where it is used by indicating to the server the maximum number of pushed streams that can be initiated in parallel by the server, as well as the amount of data that can be sent on each stream before it is acknowledged by the client. This allows the client to limit, or outright disable, the use of server push—e.g. if the user is on an expensive network and wants to minimize the number of transfered bytes, they may be willing to disable the latency optimization in favor of explicit control over what is fetched.
  • Server push is subject to same-origin restrictions; the server initiating the push must be authoritative for the content and is not allowed to push arbitrary third-party content to the client. This is yet another reason why you should eliminate domain sharding and consolidate resources under the same origin for best performance.
  • Server push responses are processed in the same way as responses received in reply to a browser-initiated requests—i.e. they can be cached and reused across multiple pages and navigations! Leverage this to avoid having to duplicate same content across different pages (i.e. downside of inlining), and to help minimize the number of transfered bytes.

Note that even the most naive server push strategy that opts to push assets regardless of their caching policy is, in effect, equivalent to inlining: the resource is duplicated on each page and transfered each time the parent resource is requested. However, even there, server push offers important performance benefits: the pushed response can be prioritized more effectively, it affords more control to the client, and it provides an upgrade path towards implementing much smarter strategies that leverage caching and other mechanisms that can eliminate redundant transfers. In short, if your application is using inlining, then you should consider replacing it with server push.

Test HTTP/2 Server Quality

A naive implementation of an HTTP/2 server, or proxy, may "speak" the protocol, but without well implemented support for features such as request prioritization, flow control, and server push, it can easily yield less than optimal performance—e.g., saturate client’s bandwidth by sending large, static image files, while the client is blocked on more critical resources, such as HTML, CSS, or JavaScript.

Of course, a well-tuned HTTP server has always been important, but with HTTP/2 the client relinquishes a lot of its control over how the resources are delivered to the server. To get the best performance, an HTTP/2 client has to be "optimistic": it annotates requests with priority data and dispatches them to the server as soon as possible; it relies on the server to use the communicated dependencies and weights to optimize delivery of each response.

The performance of an HTTP/2 client is highly dependent on the implementation quality of the server. Implementing the HTTP/2 framing layer on the server is not sufficient, do your due diligence to ensure that your infrastructure supports all the necessary performance primitives.