Network-bound nodes?

networking
2.0

#1

Firstly, let me say I love the 2.0 product, which we’ve been happily using for about a month.

We’ve been seeing some dramatic variation in build time, which seems to imply some fairly intense network congestion or contention for some build nodes. The behavior is that some builds take much, much, longer to do things like pull down docker images, clone git repos, and send build context to remote docker engines.

On a “fast” node, the git clone takes ~10s, but on a slow node it takes 3-4 minutes. The build time compounds, because we also do dockerbuilds which require sending context over to a remote docker engine, and there’s a tight correlation between slow git clones and slow docker builds. It all points to network contention on some nodes, but not on all of them.

On a “fast” node, we get builds on the order of 3-4 minutes, which is fine for my team, but on a slow node that also happens not to have cached our build image, it can take nearly 20. We gate merging of PRs on successful builds, so it’s potentially a real productivity hit for us.

Did something change in the network configuration or utilization in the past week? It’s gotten much worse recently.

Thanks!


#2

For reference, when looking at the git clone logs, the network throughput varies from ~30MiB/s to ~500KiB/s


#3

Nothing changed on our end but you’re certainly not the only one experiencing the pain. We were just discussing it internally because we’re individually impacted as well - even locally. AWS is having some Route53 issues today and it’s impacting a lot of AWS clients; seems like Docker Hub is one of those.

There’s not much we can do right now about the slow Docker pulls and such, but we’re definitely doing everything we can to improve reliability across the board with 2.0.


#4

Thanks for the quick response. The real crap-shoot seems to be the git clone, which can take anywhere from 5s to 3m.


#5

Are you caching your source? That’s really the only thing you can do to mitigate the pull time.


#6

Pull time, ordinarily, hasn’t been a factor for us; only in the past week or so has it gotten slow and not on all builds. Is it expected that there would be such a range of network throughput from node to node? Some nodes are two orders of magnitude slower than others.

Will look into source repo caching, just curious about the network stuff, it seems to be a significant source of unpredictability.


#7

We will be looking into rate-limiting and other possible factors.


#8

(one problem with source caching, fwiw, is that if the node is network bound, restore/save can take as long as the pull would have)


#9

We’ve also started seeing network issues since Monday afternoon (about 3pm PDT). In our case, do a lot of network-bound activities at once during deploy, including building several docker images in parallel using setup-docker-engine; we started seeing some builds deadlock, and think they’re behaving badly under load.

Let us know if you figure out what’s changed on the network side of things!


#10

Yeah, things got a lot worse in the past few days.


#11

Our ops team just told me they found the problem and they’re working on a fix. They didn’t have an ETA but it’s their primary focus.


#12

This should be fully resolved now. If you see any strange network shenanigans, definitely let us know.


#13