Batch vs single APIs
The question is whether one should default to having a batch/bulk API (scatter-gather) instead of asking the client to make single parallel calls.
Batch API pros:
- Reduce network bandwidth and latency
- Performance gains in batching (reduce network bandwidth and latency)
- Underlying data might be in the same block of data that the OS might fetch, allowing you to exploit localities.
- In a typical RPC processing path, there are a lot of technical layers before the actual application functionality code is executed. These do become significant if you have to invoke 100s of single record APIs to service one end-user request. e.g. protocol parsing, authentication, authorization, metrics initialization/tracking, logging, configuration, experimentation etc. Each of these things have setup, processing, and teardown within each request-processing lifecycle, and they end up becoming a significant percentage of your CPU cycles consumed if your application business logic code becomes very small.
- Gives the caller the choice of using this or sending items one at a time
- Micro-batching is a very standard and well-accepted technique in both user-path APIs as well as non user-path (data processing) APIs. All of the negative points about instrumentation for monitoring, error handling etc are quite easy to address once and it is not an overhead on a per feature API implementation basis.
- In any kind of content serving application (lots of product listings, search result listings, friends/contacts, merchants, orders/transactions) where there is seemingly a lot of records to fetch/format/serve and UX can tolerate a degradation in result set (partial results, less attributes per result etc.), this type of batch API makes a lot of sense.
- Most use-cases are batch fetches and there are far fewer use cases for single item fetch APIs. Usually, single item fetch APIs are used in the context of a self profile fetching scenario (like my own profile page, or my settings page) and those usually have a lot less traffic than the pages that serve a lot of records together on a single page.
Batch API cons:
- Batching complicates reporting partial failures and the solutions tend to be ad hoc and not play well with general-purpose mechanisms like automatic retries.
- Batching complicates monitoring, since you can’t interpret most built-in RPC metrics without also considering the batch size. This requires you to introduce a number of custom metrics where otherwise the defaults would have sufficed.
- Variable batch sizing introduces randomness into memory/CPU use, again complicating monitoring/provisioning.
- In the limit of large data, two random keys will probably not map to the same underlying partition anyway, so this benefit may be overstated in general. Non-random access (like range scans) could probably be better handled with purpose-built APIs
- Batch APIs introduce synchronization points which could end up hurting performance overall. Consider this example: you need to fetch n items from service A and run them through service B. Each service has a mean latency of 50ms and a tail latency of 200ms, with the tail being driven by slow IO for specific keys regardless of batching. For moderate n, the batching mean latency would be ~400 ms since both batch calls would take 200ms; for non-batched, it would be ~250 ms, since any given key is unlikely to be slow for both A and B.
- Puts more burden on the clients to handle partial failures correctly, and that can be error prone.
- Marginal gains in efficiency does not justify complicating the API surface.