Recommendation for providing access to a 200GB dataset?

One of our builds in CircleCI involves running tests on a dataset 200GB in size. Currently that data is downloaded from a third-party site, but the download tends to cause builds to timeout.

What strategy would best suit granting on-disk access to this very large data? Should the CircleCI cache be used? Or perhaps pre-bake the data into a Docker image?

Could you put the data into Git LFS, and then clone? I think most of the hosted Git providers would ask you for a usage fee based on the bandwidth bill you’d be racking up for them, but I don’t imagine it would be expensive.

That said, I wonder whether you could run your tests piecemeal. I am not sure CircleCI will give you a disk big enough to hold that. If so, I wonder if you could use a stream from AWS S3?