Handling failures#

This blog is part of a series on robust system design. Check the other posts that address scalability, failures, latency and integration.

Introduction#

A large system consists of an end-user application (mobile, web) calling an exposed API. This API will translate into a number of internal and external API calls. We are considering the synchronous part of the system only: the services that are directly used by the API call.

sequenceDiagram
    participant APP as Application
    participant API as API Gateway
    participant S1 as Service 1
    participant S2 as Service 2
    participant S3 as Service 3
    APP->>API: POST /op1
    API->>S1: POST /op1
    S1->>S2: GET /info
    S2-->>S1: OK
    S1->>S3: POST /op2
    S3-->>S1: OK
    S1-->>API: OK
    API-->>APP: OK

Failures#

Services fail in multiple ways. Examples of failures:

The caller is not authorized to perform the request.
The body of the call is not compliant with the API. Some fields may be missing or wrong.
A service crashed and is not responding.
A connectivity issue occurred between two components and the response was never received.

Failures can be categorized as follows:

Permanent failures where retrying an operation will not change the outcome.
The reason can range from a bad request formatting to insufficient privileges. There is no point in retrying these failures, and they should just be sent back to the original caller. In a REST API, most of these errors are in the 4xx range.
Temporary failures where retrying an operation may succeed.
Some reasons of temporary failures are: service or machine crashes and load shedding (a circuit-breaker protecting the system by refusing load). In a REST API, most of these errors are in the 5xx range.

Timeouts are a kind of temporary failure where the outcome of the operation is unknown.
The system may have executed part or all of the operation before timing out. If a timeout happens between services, a 4xx error code is sometimes returned.

sequenceDiagram
  participant A as Client
  participant B as Service
  A-xB: POST /api/dosomething
  Note right of A:Request is lost, client times out.

sequenceDiagram
  participant A as Client
  participant B as Service
  A->>B: POST /api/dosomething
  B-xA:OK
  Note right of A:Request is executed<br/>Response is lost, client times out.

The API error codes should be documented clearly as there is no standard way of knowing if an error is temporary or permanent: 429 Too many requests and 504 Gateway Timeout should be retried while 401 Unauthorized should not.

This article explains HTTP error codes and where they should be produced. The latest HTTP Semantics specification RFC 9110 discusses how to handle errors in detail.

Once we have identified temporary failures and timeouts, the question is how and where to handle them.

Operation execution semantics#

Before deciding on how to handle failures, some definitions are required.

at-least-once: An operation is guaranteed to be executed. This is usually done with queuing systems where an operation is queued and attempted multiple times. The operation may be executed multiple times. This behavior is not desirable: a message may be sent to a person multiple times (a minor annoyance) or a funds transfer may be executed multiple times (not acceptable).
exactly-once: As the name implies, this is the ideal behavior. An operation is guaranteed to be executed exactly once no matter what fails. However, it is practically impossible to achieve. For example, Confluence, claim here that Kafka can achieve exactly-once semantics but then they add this disclaimer: “Note that exactly-once semantics is guaranteed within the scope of Kafka Streams’ internal processing only”
at-most-once: An operation is guaranteed to be executed at most once no matter how many times it is called. This is also called idempotency and it is the desired execution semantic across complex systems.

Idempotent Operations#

Operations can be executed using RPC type REST/gRPC calls, SQL Statements or object methods. This table shows the correspondence between these domains and which one is inherently idempotent. HTTP POST requests can be interpreted as either a new object creation (POST /api/employee) or an operation on an existing object (POST /api/employee/{id}/disable)

HTTP	SQL	Object	Idempotent
GET	SELECT	Get	Yes, no state change
POST	INSERT	New	No, create a new object
POST	CALL	Method call	No, side effects
PUT	UPSERT	Set	Yes
DELETE	DELETE	Delete	Yes

Idempotency implementation#

In order to implement idempotency in a service, all recent idempotency identifiers need to be stored along with the result. Upon reception of a new state-changing request, the database is checked for an existing result and simply returned if it is there. Otherwise, the request is executed, the result is stored and then returned.

flowchart LR
  S((Start)) --> A[Receive Request<br/>with ReqID]
  A --> B[Fetch object<br>with id=ReqID]
  B --> C{Existing<br/>ReqID?}
  C --> |Yes| D[Return object]
  D --> E((End))
  C --> |No| F[Execute Request<br>Store object]
  F --> D

Here is a sample exchange with a lost response and a client retry.

sequenceDiagram
    participant A as Client
    participant B as Service
    A->>B: POST /api/dosomething<br/>ReqID=XYZ
    Note over B:Fetch Object->Not Found
    Note over B:Execute Request<br/>Store object
    B--xA:OK
    Note over A:Client times out<br/>Resends request
    A->>B: POST /api/dosomething<br/>ReqID=XYZ
    Note over B:Fetch Object->Found<br/>Return object
    B-->>A:OK

Cascading services#

A single client API call may involve multiple services. All the cascaded API calls should also be idempotent. As In some cases, external services have a specific idempotence identifier that is not compatible with the one the client generated. In this case, a caching mechanism should be put in place to map between the two identifiers.

sequenceDiagram
    participant C as Client
    participant A as Service A
    participant B as Service B
    C->>A: POST /api/dosomething<br/>ReqID=XYZ
    Note over A:Fetch Object->Not Found
    Note over A:Generate ReqID for B<br/>Create object, State=InProgress, BreqID=ABC
    A->>B: POST /service-b/dosomething<br/>ReqID=ABC
    Note over B:Execute Request<br/>Store object
    B--xA: OK
    Note over C:Client times out<br/>Resends request
    C->>A: POST /api/dosomething<br/>ReqID=XYZ
    Note over A:Fetch object->Found in progress
    A->>B: POST /service-b/dosomething<br/>ReqID=ABC
    Note over B:Fetch object<br/>Return
    B-->>A:OK
    Note over A:Update object<br/>Return object
    A-->>C:OK

Timeouts#

In a synchronous system like the one discussed here, having sensible timeout values is critical. If the timeout value is too small, the operation might be executed but the result never received. The client will retry the operation this adding to the system load.

sequenceDiagram
    participant A as Client
    participant B as Service
    A->>B: POST /api/dosomething<br/>ReqID=XYZ
    Note over B:Fetch Object->Not Found
    Note over B:Execute Request<br/>Store object
    Note over A:Client times out<br/>Resends request
    A->>B: POST /api/dosomething<br/>ReqID=XYZ
    B--xA:OK<br/>Initial reply, dropped
    Note over B:Fetch Object->Found<br/>Return object
    B-->>A:OK

If the timeout is too large, resources may be tied for a long time slowing the system down. This slowdown manifests itself with very low system utilization as all sessions are just waiting for the timeout to expire.

sequenceDiagram
    participant A as Client
    participant B as Service
    A->>B: POST /api/dosomething<br/>ReqID=XYZ
    Note over B:Fetch Object->Not Found
    Note over B:Execute Request<br/>Crash
    Note over A:Client times out<br/>Resends request
    A->>B: POST /api/dosomething<br/>ReqID=XYZ
    Note over B:Fetch Object->Not Found
    Note over B:Execute Request<br/>Store object
    B-->>A:OK

So the timeout should be set to the lowest possible value to catch an actual crash but no less.

In addition, any system is usually configured with a maximum number of concurrent requests. The more requests are tied with a service timing out, the less are available to other requests. This will cause the system to refuse extra traffic although it has enough capacity to process it.

Suppose

N is the number concurrent requests
r is the request rate in requests/second
T is the timeout in seconds

$$ N = r T $$

As N is fixed, if T is high, the actual served requests r will be low.

To take a numerical example, if N = 5000 concurrent requests, T = 30 seconds, then: $$ r = N / T = 5000/30 = 167\ rps$$ If we are able to decrease the timeout to 10 seconds, the throughput becomes: $$ r = N / T = 5000/10 = 500\ rps$$

Timeouts for cascading requests#

The rule above will apply to cascading services with each timeout being slightly bigger that that of the downstream service. So if we have this scenario:

flowchart LR
  A[Application] --> API[API Gateway]
  API            --> S1[Service 1]
  S1             --> S2[Service 2]
  S2             --> S3[Service 3]

The timeouts should be the minimum such that:

$$ T_{Application} > T_{API} > T_{Service1} > T_{Service2} > T_{Service3} $$

Conclusion#

Failures across all components are bound to happen, the key is to handle them in a robust way. Follow these rules to maximize performance, correctness, and availability while handling failures:

H1. Do not perform retries if an API call fails. Let the final caller perform the retry.

H2. Make all requests idempotent using client-generated request identifiers.

H3. Ensure timeouts are the smallest possible values and in decreasing order from caller to final callee.