diff --git a/src/en/clean-copy/03-[Work in Progress] Section II. The API Patterns/09.md b/src/en/clean-copy/03-[Work in Progress] Section II. The API Patterns/09.md index 41b8814..1bb6e99 100644 --- a/src/en/clean-copy/03-[Work in Progress] Section II. The API Patterns/09.md +++ b/src/en/clean-copy/03-[Work in Progress] Section II. The API Patterns/09.md @@ -1,8 +1,8 @@ ### [Atomicity of Bulk Changes][api-patterns-atomicity] -Let's return from *webhooks* back to developing direct-call APIs. The `orders/bulk-status-change` endpoint design, described in the previous chapter, raises an interesting question: what should we do if some changes were successfully processed by our backend and some were not? +Let's transition from *webhooks* back to developing direct-call APIs. The design of the `orders/bulk-status-change` endpoint, as described in the previous chapter, raises an interesting question: what should we do if some changes were successfully processed by our backend while others were not? -Let's imagine the partner notifies us about status changes that happened with two orders: +Let's consider a scenario where the partner notifies us about status changes that have occurred for two orders: ``` POST /v1/orders/bulk-status-change @@ -11,7 +11,7 @@ POST /v1/orders/bulk-status-change "order_id": "1", "new_status": "accepted", // Other relevant data, - // let's say, estimated + // such as estimated // preparation time … }, { @@ -24,19 +24,199 @@ POST /v1/orders/bulk-status-change 500 Internal Server Error ``` -The question is how to organize this “umbrella” endpoint (which is in fact a proxy to process a list of nested sub-requests) if changing one of the two orders emits an error. We might propose at least four different options: - * A. Guarantee atomicity and idempotency. If any of the sub-requests is unsuccessful, other changes are not applied as well. - * B. Guarantee idempotency but not atomicity. If some sub-requests fail, repeating the call with the same idempotency key results in no action and leaves the system exactly in the same state (i.e., the unsuccessful calls will never be carried out, even if all the obstacles were removed, until a new call with a new idempotency key is performed). - * C. Guarantee neither idempotency nor atomicity and process the sub-requests fully independently. - * D. Do not guarantee atomicity and forbid retries completely by requiring passing the actual resource revision (see the “[Synchronization Strategies](#api-patterns-sync-strategies)” chapter). +In this case, if changing the status of one order results in an error, how should we organize this “umbrella” endpoint (which acts as a proxy to process a list of nested sub-requests)? We can propose at least four different options: -From general considerations, it looks like the first option suits public APIs best: if you can guarantee atomicity (which might be challenging from the scalability point of view), do it. In the first revision of this book, we recommended sticking to this solution unconditionally. + * A. Guarantee atomicity and idempotency. If any of the sub-requests fail, none of the changes are applied. + * B. Guarantee idempotency but not atomicity. If some sub-requests fail, repeating the call with the same idempotency key results in no action and leaves the system exactly in the same state (i.e., unsuccessful calls will never be executed, even if the obstacles are resolved, until a new call with a new idempotency key is made). + * C. Guarantee neither idempotency nor atomicity and process the sub-requests independently. + * D. Do not guarantee atomicity and completely prohibit retries by requiring the inclusion of the actual resource revision in the request (see the “[Synchronization Strategies](#api-patterns-sync-strategies)” chapter). -However, if we take a look at the situation from the partner's perspective, we learn it is not as straightforward as one might decide at first glance. Let us imagine that the partner implemented the following functionality: - 1. Partner's backend processes notifications about incoming orders through a *webhook*. - 2. The backend makes inquiries to coffee shops whether they are ready to fulfill the orders. - 3. Periodically, let's say once every 10 seconds, the partner collects all the status changes (i.e., all responses from the coffee shops) and calls the `bulk-status-change` endpoint with the list of the changes. +From a general standpoint, it appears that the first option is most suitable for public APIs: if you can guarantee atomicity (despite it potentially poses scalability challenges), it is advisable to do so. In the first revision of this book, we unconditionally recommended adhering to this solution. -Imagine that in the third step, the partner got an error from the API endpoint. What would developers do about it? Most probably, one of the following solutions might be realized in the partner's code: +However, if we consider the situation from the partner's perspective, we realize that the decision is not as straightforward as one might initially think. Let's imagine that the partner has implemented the following functionality: + 1. The partner's backend processes notifications about incoming orders through a *webhook*. + 2. The backend makes inquiries to coffee shops regarding whether they can fulfill the orders. + 3. Periodically, let's say once every 10 seconds, the partner collects all the status changes (i.e., responses from the coffee shops) and calls the `bulk-status-change` endpoint with the list of changes. - 1. Unconditional retry of the request: \ No newline at end of file +Now, let's consider a scenario where the partner receives an error from the API endpoint during the third step. What would developers do in such a situation? Most probably, one of the following solutions might be implemented in the partner's code: + + 1. Unconditional retry of the request: + ``` + // Retrieve the ongoing orders + const pendingOrders = await api + .getPendingOrders(); + // The partner checks the status + // of every order in its system + // and prepares the list of + // changes to perform + const changes = + await prepareStatusChanges( + pendingOrders + ); + + let result; + let tryNo = 0; + let timeout = + DEFAULT_RETRY_TIMEOUT; + while ( + result && + tryNo++ < MAX_RETRIES + ) { + try { + // Send the list + // of changes + result = await api + .bulkStatusChange( + changes, + // Provide the newest + // known revision + pendingOrders.revision + ); + } catch (e) { + // If there is an error, + // repeat the request + // with some delay + logger.error(e); + await wait(timeout); + timeout = min( + timeout * 2, + MAX_TIMEOUT + ); + } + } + ``` + + **NB**: in the code sample above, we provide the “right” retry policy with exponentially increasing delays and a total limit on the number of retries, as it should be implemented in SDKs. However, be warned that real partners' code may frequently lack such precautions. For the sake of readability, we will skip this bulky construct in the following code samples. + + 2. Retrying only failed sub-requests: + ``` + const pendingOrders = await api + .getPendingOrders(); + let changes = + await prepareStatusChanges( + pendingOrders + ); + + let result; + while (changes.length) { + let failedChanges = []; + try { + result = await api + .bulkStatusChange( + changes, + pendingOrders.revision + ); + } catch (e) { + let i = 0; + // Assuming that the `e.changes` + // field contains the errors + // breakdown + for ( + i < e.changes.length; i++ + ) { + if (e.changes[i].status == + 'failed') { + failedChanges.push( + changes[i] + ); + } + } + } + // Prepare a new request + // comprising only the failed + // sub-requests + changes = failedChanges; + } + ``` + + 3. Restarting the entire pipeline. In this case, the partner retrieves the list of pending orders anew and forms a new bulk change request: + ``` + do { + const pendingOrders = await api + .getPendingOrders(); + const changes = + await prepareStatusChanges( + pendingOrders + ); + // Request changes, + // if there are any + if (changes.length) { + await api.bulkStatusChange( + changes, + pendingOrders.revision + ); + } + } while (pendingOrders.length); + ``` + +If we examine the possible combinations of client and server implementation options, we will discover that approaches (B) and (D) are incompatible with solution (1). Retrying the same request after a partial failure will never succeed, and the server will repeatedly attempt the failing request until it exhausts the remaining retry attempts. + +Now, let's introduce another crucial condition to the problem statement: imagine that certain issues with a sub-request can not be resolved by retrying it. For example, if the partner attempts to confirm an order that has already been canceled by the customer. If a bulk status change request contains such a sub-request, the atomic server that implements paradigm (A) will immediately “penalize” the partner. Regardless of how many times and in what order the set of sub-requests is repeated, *valid sub-requests will never be executed if there is even a single invalid one*. On the other hand, a non-atomic server will at least continue processing the valid parts of bulk requests. + +This leads us to a seemingly paradoxical conclusion: in order to ensure the partners' code continues to function *somehow* and to allow them time to address their invalid sub-requests we should adopt the least strict non-idempotent non-atomic approach to the design of the bulk state change endpoint. However, we consider this conclusion to be incorrect: the “zoo” of possible client and server implementations and the associated problems demonstrate that *bulk state change endpoints are inherently undesirable*. Such endpoints require maintaining an additional layer of logic in both server and client code, and the logic itself is quite non-obvious. The non-atomic non-idempotent bulk state changes will very soon result in nasty issues: + +``` +// A partner issues a refund +// and cancels the order +POST /v1/bulk-status-change +{ + "changes": [{ + "operation": "refund", + "order_id" + }, { + "operation": "cancel", + "order_id" + }] +} +→ +// During bulk change execution, +// the user was able to walk in +// and fetch the order +{ + "changes": [{ + // The refund is successful… + "status": "success" + }, { + // …while canceling the order + // is not + "status": "fail", + "reason": "already_served" + }] +} +``` + +If sub-operations in the list depend on each other (as in the example above: the partner needs *both* refunding and canceling the order to succeed as there is no sense to fulfill only one of them) or the execution order is important, non-atomic endpoints will constantly lead to new problems. And if you think that in your subject area, there are no such problems, it might turn out at any moment that you have overlooked something. + +So, our recommendations for bulk modifying endpoints are: + 1. If you can avoid creating such endpoints — do it. In server-to-server integrations, the profit is marginal. In modern networks that support [QUIC](https://datatracker.ietf.org/doc/html/rfc9000) and request multiplexing, it's also dubious. + 2. If you can not, make the endpoint atomic and provide SDKs to help partners avoid typical mistakes. + 3. If implementing an atomic endpoint is not possible, elaborate on the API design thoroughly, keeping in mind the caveats we discussed. + +One of the approaches that helps minimize potential issues is developing a “mixed” endpoint, in which the operations that can affect each other are grouped: + +``` +POST /v1/bulk-status-change +{ + "changes": [{ + "order_id": + // Operations related + // to a specific endpoint + // are grouped in a single + // structure and executed + // atomically + "operations": [ + "refund", + "cancel" + ] + }, { + // Operation sets for + // different orders might + // be executed in parallel + // and non-atomically + "order_id": + … + }] +} +``` + +Let us also stress that nested operations (or sets of operations) must be idempotent per se. If they are not, you need to somehow deterministically generate internal idempotency tokens for each operation. The simplest approach is to consider the internal token equal to the external one if it is possible within the subject area. Otherwise, you will need to employ some constructed tokens — in our case, let's say, in the `:` form. \ No newline at end of file diff --git a/src/ru/clean-copy/03-[В разработке] Раздел II. Паттерны дизайна API/09.md b/src/ru/clean-copy/03-[В разработке] Раздел II. Паттерны дизайна API/09.md index 309ade6..ecad2c6 100644 --- a/src/ru/clean-copy/03-[В разработке] Раздел II. Паттерны дизайна API/09.md +++ b/src/ru/clean-copy/03-[В разработке] Раздел II. Паттерны дизайна API/09.md @@ -83,9 +83,10 @@ POST /v1/orders/bulk-status-change } } ``` + **NB**: в примере выше мы приводим «правильную» политику перезапросов (с экспоненциально растущим периодом ожидания и лимитом на количество попыток), как это следует реализовать в SDK. Следует, однако, иметь в виду, что в реальном коде партнёров с большой долей вероятности ничего подобного реализовано не будет. В дальнейших примерах эту громоздкую конструкцию мы также будем опускать, чтобы упростить чтение кода. - 2. Повтор только неудавшихся подзапросов + 2. Повтор только неудавшихся подзапросов: ``` const pendingOrders = await api .getPendingOrders(); @@ -148,11 +149,11 @@ POST /v1/orders/bulk-status-change } while (pendingOrders.length); ``` -Если предположить, что операция `bulkStatusChange` может с какой-то вероятностью окончиться частичной неудачей (ну или частичным успехом, если вы оптимист), который может быть исправлен перезапросом, то, как мы видим, варианты организации сервера (B) и (D) не годятся, т.к. безусловный повтор частично неудачного запроса никогда не будет успешным. +Если мы проанализируем комбинации возможных реализаций клиента и сервера, то увидим, что подходы (B) и (D) не работают с решением (1), поскольку клиент будет пытаться повторять заведомо неисполнимый запрос, пока не исчерпает лимит попыток. -Теперь добавим к постановке задачи ещё одно важное условие: предположим, что иногда ошибка подзапроса не может быть устранена его повторением. Например, партнёр пытается подтвердить заказ, который был отменён пользователем: если такой подзапрос есть (фактически — в коде партнёра есть какая-то не часто происходящая ошибка в логике, о которой он пока не осведомлён), то атомарный сервер, реализованный по схеме (A), моментально партнёра «накажет»: сколько бы он запрос ни повторял, *валидные подзапросы не будут выполнены, если есть хоть один невалидный*. В то время как неатомарный сервер, по крайней мере, продолжит подтверждать валидные запросы. +Теперь добавим к постановке задачи ещё одно важное условие: предположим, что иногда ошибка подзапроса не может быть устранена его повторением — например, партнёр пытается подтвердить заказ, который был отменён пользователем. Если в составе массового вызова есть такой подзапрос, то атомарный сервер, реализованный по схеме (A), моментально партнёра «накажет»: сколько бы он запрос ни повторял, *валидные подзапросы не будут выполнены, если есть хоть один невалидный*. В то время как неатомарный сервер, по крайней мере, продолжит подтверждать валидные запросы. -Это приводит нас к парадоксальному умозаключению: гарантировать, что партнёрский код будет *как-то* работать можно только реализовав максимально нестрогий неидемпотентный неатомарный подход к операции массовых изменений. Однако и этот вывод мы считаем ошибочным, и вот почему: описанный нами «зоопарк» возможных имплементаций клиента и сервера очень хорошо демонстрирует *нежелательность* эндпойнтов массовых изменений как таковых. Такие эндпойнты требуют реализации дополнительного уровня логики и на клиенте, и на сервере, причём логики весьма неочевидной. Функциональность неатомарных массовых изменений очень быстро приведёт нас к крайне неприятным ситуациям. +Это приводит нас к парадоксальному умозаключению: гарантировать, что партнёрский код будет *как-то* работать и давать партнёру время разобраться с ошибочными запросами, можно только реализовав максимально нестрогий неидемпотентный неатомарный подход к операции массовых изменений. Однако и этот вывод мы считаем ошибочным, и вот почему: описанный нами «зоопарк» возможных имплементаций клиента и сервера очень хорошо демонстрирует *нежелательность* эндпойнтов массовых изменений как таковых. Такие эндпойнты требуют реализации дополнительного уровня логики и на клиенте, и на сервере, причём логики весьма неочевидной. Функциональность неатомарных массовых изменений очень быстро приведёт нас к крайне неприятным ситуациям: ``` // Партнёр делает рефанд @@ -173,10 +174,10 @@ POST /v1/bulk-status-change // до кофейни и забрать заказ { "changes": [{ - // Рефанд проведён успешно + // Рефанд проведён успешно… "status": "success" }, { - // А отмена заказа нет + // …а отмена заказа нет "status": "fail", "reason": "already_served" }] @@ -206,7 +207,7 @@ POST /v1/bulk-status-change "cancel" ] }, { - // Операции по разным + // Группы операции по разным // заказам могут выполняться // параллельно и неатомарно "order_id": <второй заказ>