I’m Mingyang Yin—a PhD student at the University of Waterloo, supervised by Prof. Shane McIntosh. I am posting to request access to the CircleCI build records for my research project. The study we are planning is about the impact of CI runner size (such as CPU and memory) and how developers are using it. After comparing several CI services, we found that unlike many competitors, CircleCI allows users to configure the runner sizes. This makes Circle CI an ideal platform for us to conduct our study.
We tried to use the CircleCI API to collect data about open-source projects that use the platform, but we found that permissions do not allow us to access projects that we do not own.
Would you be willing to grant us access to this data? We will only use the data for academic purposes.
Thanks for considering my request! Don’t hesitate to ask if you have any questions or if you need any additional information from me.
Your best bet will be to open a ticket via the support page
https://support.circleci.com
I do not work for CircleCI, but you may want to take a look at their privacy policy here
https://circleci.com/legal/privacy/
This will give you an idea of the complications of data rules and laws CircleCI operates under. Once you combine USA, EU, UK, Swiss and California rules on data collection and transfers things get rather complicated.
As a side note and depending on the amount of resources you have available you may want to read up on self-hosted runners.
The reason being is that any stats collected by CircleCI systems are not going to be consistent over time as
CircleCI may choose to change the level of equipment they use from vendors such as AWS and Google. As long as such changes only improve things CircleCI does not need to even make any formal announcements about such changes.
Both AWS and Google may change the systems that they provide their services on. Again any improvements do not need to be documented.
The open source code base changes over time.
The open source projects can change the project structures and the tasks performed during a build at any time.
The open-source projects are unlikely to be changing the size of runners they are using, so you will not gain access to data for a project that allows you to compare between different runners.
The result is that timings between runs are more an indication of performance, rather than a concrete value that you can build upon.
So this is where using a self-hosted runner comes in. If you can source a system with any form of virtualization software to deploy one or more defined runners you can then fork the open-source projects and vary all the build environment parameters in a controlled way, with everything being repeatable for the life of your project.
Thanks for your reply. While using self-hosted runners for evaluation can help us get repeatable results, we do think historical build data are meaningful because they reveal the real cost of CI. I will take your advice and open a ticket.
Another thing to note is that most opensource projects that build on CircleCI are unlikely to be optimised for cost as CircleCI provides 400,000 free credits per month to open-source projects. It is more likely that anyone using this offer will optimise for ease and speed of use.