Hi! I’m a product manager here at CircleCI. I wanted to provide a small FYI for transparency’s sake, no action required. CircleCI is commencing the roll-out out of a new implementation of how the Docker Layer Caching (DLC) feature works “under the hood”. Users should see minimal impact.
If you see any anomalies with how your jobs that enable DLC are executing, please file a support ticket immediately and do not hesitate to comment on this thread below.
New Implementation Details:
At a high level, it uses a sparse file as the main tool for achieving the DLC functionality. Volumes will no longer be used. The fundamental process of creating a cache (or using one if it already exists), and then caching image layers to be used in a subsequent job, has not changed between the old and new implementations of DLC.
Some notable differences between the two implementations:
Each job uses the same cache for a given project, and the cache uses a “last write wins” strategy from the most recent job.
There is a “DLC set-up” step at the beginning of each job that uses DLC. Users are not charged for the “DLC set-up” step.
At the end of each job, the cache upload is done asynchronously and does not prevent the workflow from continuing to progress. This means that jobs within the same workflow are unlikely to access a cache uploaded from an upstream job. Users are not charged for this “DLC teardown” step.
Because each job now downloads the latest version of the cache, there is no limit to how many jobs can be pulling the cache at the same time. This eliminates one of the limitations outlined with the old implementation below and should result in more cache hits for users.
The new implementation I mentioned above was not rolled out in early July.
The rollout is commencing now. Based on internal testing, the majority of jobs should see no negative impact. If you are having issues with DLC, please comment on this thread and open a support ticket via https://support.circleci.com/hc/en-us
With this new approach, I don’t see a volume ID printed in the logs. It looks like it’s now downloading an img. Does this remove the single-machine concurrency that the previous method had?
I’m seeing around an extra minute for a DLC job. The setup is 10-20 seconds slower on average and there’s now a 30+ second Teardown step. I’ve opened a support ticket.
Hey @cdhunt, sorry for the delayed response. The new implementation has a few nuances to it:
Yes, you should be seeing increased cache-hits as each job is downloading the latest version of the cache so there is no limit to how many jobs can be pulling the same cache at the same time. We’ll make this more obvious in our docs and communication to users once we’re past this “experimental” phase.
There is indeed a longer set-up and teardown step. As a result, we’re intentionally not charging users for neither the set-up nor teardown steps (a change in the behavior from the old implementation). That’s called out in the step output UI with an info icon.
Overall, for most customers we’re seeing relatively faster wall-clock time even with the increased set-up time. This is likely due to (1) increased cache-hits and (2) the fact that “under-the-hood” we’re using a more robust, reliable, and performant machine provisioning service with this new DLC implementation.
The teardown step does not prevent workflows from continuing to progress.
Hope that clears up some of your questions. Feel free to comment if you have more or use the support ticket that you opened.
For all who are following this thread, the new implementation of DLC is now rolled out to all users running jobs with a Linux VM resource class. The only exceptions at this time are:
GPU resource classes
Jobs that use Android
We will be rolling this out to DLC jobs that use Remote Docker in the next two months.
I have seen some fairly steep regressions with most if not all of my jobs that utilize DLC in what I must admit is absolutely not the use-case you had envisioned for the feature. However it has been working great for around 2 years now so I wanted to make sure you were aware of my regression following these new changes.
The project(s) are private so I put in a support ticket that contains all the relevant details.
Request #117715
Replying here in hopes that the right people can get eyes on the details.
Hi @sebastian-lerner, we rarely have any build concurrency in our workflows, so the DLC Startup time is a big regression for us. Any chance we could get a configuration to get the old behavior back? That download time is killing me when i’m trying to iterate quickly.
@jarnoux unfortunately at this time there is no way to revert jobs to the old DLC behavior. In the long-term, we have identified targeted optimizations that we believe will help reduce the impact of the download time. No timeline to share on implementation of those optimizations unfortunately.
@sebastian-lerner - We are seeing ‘spin up environment’ times of 45-60 seconds now, along with ‘spin up container environment times’ around 30 seconds, adding up to almost 90 seconds of startup when we used to have closer to 15 seconds total. These changes are a large regression for us as well, and significantly impact our ability to run quick builds rapidly. Please let me know if I can provide more details here. I am hoping we can go back to something closer to the old behavior here as well, asap.
@robg Understood, for the vast majority of customers, we see the increase in set-up time offset by faster overall job runtime while the job is actually executing (outlined here). Are you seeing overall job and workflow runtime significantly degraded?
Keep in mind that the final step (DLC teardown) does not block the workflow from continuing.
@sebastian-lerner - Yes, I’m seeeing overall job time significantly degraded. A job that used to run, end to end, in 53 seconds, now runs for 1 min 49 seconds - So this has more than doubled our overall run time. I’d be happy to privately share any details that would help you investigate this further - let me know how I can do that.
“Spin up environment” is taking 45 to 50 seconds for us lately. When the actual job itself only takes about a minute, this means almost doubling the total job duration. I can’t say for sure when this started but I only recently noticed it happening.