How to measure the impact of AI software development systems

How to measure the impact of AI software development systems

Empowering your developers with an AI software development system, which they can use to automate more tasks in their job, will improve engineering outcomes and quality of life for developers at your organization. For example, ML-enhanced code completion was shown to improve developer productivity at Google and GitHub.

Below we outline how you can begin to understand and estimate the return on investment of rolling out an AI software development system to your team.

Step 0: Measure developer productivity and experience

Before you can measure the impact of AI software development systems, you need to have baseline metrics on developer productivity and experience at your organization.

If you don’t already have your own approach, then the SPACE framework for understanding developer productivity is a good place to begin. The five dimensions of framework are

  • Satisfaction and well-being
  • Performance
  • Activity
  • Communication and collaboration
  • Efficiency and flow

It’s recommended to capture several metrics across at least three dimensions of the framework. One of these metrics should include perceptual measures such as survey data, so that you include people's lived experiences.

Step 1: Understand AI dev system usage

The first step once you roll out your AI dev system is to then measure adoption. Your development data will reveal a lot of information about usage at your organization:

  • Usage rates of different models (e.g. DeepSeek Coder 33B vs. Code Llama 70B)
  • Usage rates of different features (e.g. tab-autocomplete vs. “chat” experiences)
  • Acceptance rates of suggestions (e.g. ghost text, /edit, “apply this” button, etc)

If no one uses the system, then rolling it out won’t have any impact. Before you move onto the following steps, you need to make sure you get enough adoption by developers at your organization.

Step 2: Gather developer perspectives on flow experiences and code quality

Once you have developers using the AI dev system, then you should begin talking to them and asking about their experience. Over time, this could even turn into formal user interviews with developers to gather qualitative evidence. Eventually though, you are likely going to want to survey developers in your organization to get a more representative, clearer picture.

During the GitHub Copilot technical preview, researchers at GitHub did a large-scale survey to try to quantify its impact on developer productivity and happiness, which can be helpful to read when figuring out how to design your survey.

You‘ll likely also want your survey to include questions about the impact of your AI dev system on code quality at your organization.

As an example, researchers at Harvard and Purdue did a study on the Usability of Code Generation Tools Powered by Large Language Models.

Step 3: Monitor impact on engineering outcomes

Business results are a function of the performance and reliability of the software you create, deploy, and maintain. Introducing an AI dev system to your organization should not only impact the developer productivity and experience metrics that you track but also engineering outcomes.

If you don’t already have your own approach for measuring engineering outcomes, then DORA metrics are a good place to start:

  • Deployment Frequency: how often you successfully release to production
  • Lead Time for Changes: the amount of time it takes a commit to get into production
  • Change Failure Rate: the percentage of deployments causing a failure in production
  • Time to Restore Service: how long it takes you to recover from a failure in production

To learn more about DORA metrics, here is a useful primer, and here is a resource that might be helpful for considering the downstream effects of introducing an AI dev system.

Real-world examples

We are in the early days of AI software development systems. That said, some platform development engineers have already begun to share their experiences measuring impact. Here are two talks worth watching from September 2023: