We have a workflow which has 32 jobs, each job uses 6 docker images from AWS ECR which means we are doing 192 logins at the same time for 1 build. Multiply that by N number of concurrent builds, and it’s easy to see how we’d run into throttling limits.
Per the documentation (below):
the throttle for the GetAuthorizationToken action is 4 transaction per second (TPS), with up to a 200 TPS burst allowed
the throttle on the GetAuthorizationToken operation cannot be increased on a per-account basis
To handle throttling errors, implement a retry function with incremental backoff into your code.
To avoid needing to retry, the token should be reusable for a certain period of time.
However, I admit I don’t know how compatible that is with Circle’s native way of instantiating containers, since you may not be able to set anything up prior to Circle doing its thing.
I have a greater flexibility with my own (perhaps unusual) configuration as I do manual pulls of the images I use, and then start them all with Docker Compose inside a single Docker image. This means I could, if I wished, park the images in a workspace with docker save and then docker load them when required. I can also put sleeps between pulls, if the registry provider throttling is triggered.
You can’t persist images using workspaces or cache. This is a pretty bad issue, and it’s certainly complicated. I’m very confident you’re the first to hit this issue.
Talking to AWS about your rate-limit could be a faster turnaround. I don’t know what’s going to be involved in fixing this from our end… but it’s a very legitimate (and frustrating!) bug report so we’ll certainly work on addressing it.