In this blog I want to take you, the reader, with me on my quest on finding a comprehensive CI / CD practice for a moderate to big corporate environment. I personally think efficiency is reached when your practices relate to other processes in the organisation. So In this blog, it isn’t that much about tooling but more of the interconnections between the tooling and the practice.
The tools
As you might know I am a Microsoft developer, so the tool of choice is Azure DevOps. I implement apps and api’s, so add Azure, Xamarin and appcenter.ms to the list of tools too.
My development process of choice is a ‘real’ Agile environment using Scrum, DevOps and Google design sprints. Agile is all about team trust and team responsibilities. Good CI/CD is very difficult if the team isn’t trusted with the task at hand.
The pipe
Basic strategy
Everything that get’s build/produced is under code control. So when I say ‘code’ it could be all kinds of development work, not limited to programming apps/api’s.
We work on master alone. And only (automated) pull requests gets merged to master (CI).
The pull request is part of our Scrum Definition of Done, and thus gated with a review and a 2-pair eyes policy. Plus additional checks incorporated into the build like unittests and SonarQube analysis.
In the CD pipe we always incorporates an automated test step, in short AT. Postman for .Net (Core) API’s, appcenter.ms Xamarin SpecFlow UI testing for apps, Protractor tests for Angular web apps.
There are different CI/CD builds pairs for different components in the landscape. We strive to have a 1 on 1 relation between a build definition and a component. But there could be a couple of codebases in one repository, especially with the ‘start small, specialize later’ approach most of our projects take.
If we use components, we want to use as much ‘native’ as possible. If we make components ourselves and they are shared components, we avoid interconnections between products: a change in one component can’t automatically result in a few broken (non build-able) products elsewhere.
Make sure all this stuff is easily automated and easy to maintain.
In my field of work this all leans heavy on mechanisms in Azure DevOps. They do supply a lot of tooling to get these things like triggers and gate policies automated, have integrated components versioning strategies, and all these things actually work out-of-the-box.
But don’t over automate. Seriously. Rigorous gating checks and extensive templating breeds laziness. Agile is about people, and using the right tools for the right jobs. I think the key to working fast isn’t about automating alone but more about a balancing act between doing things yourself and let others do the work for you.
Work
All work starts with the Scrum Sprint. From there on we define stories. These stories get refined, and when planned for sprinting the tasks get defined. The tasks are then used to attach one branch to when there is a need for a code change.
So in practice, each task that has a code change that triggers the whole CD pipeline. So for each part of your code you know the delta in stability because of the rigorous checks that the AT environment will do.
Make sure you know the impact of the business side (features/stories) on your product.
With our practice we know for each task and story how it impacts our total codebase and how well the new code integrates into the old.
I hear some of you thinking: ‘but each merge into master gets released if you do CD’.
Being practical, I know that not all of the code can be released as soon as it seems to work. I think it is perfectly fine to gather work at the A environment and review it during the Sprint Demo, and start release it afterwards. But I want to add to this that this is for functional reasons only, like instructing users, instructing 1st/2nd line support or getting functional approvals from involved parties.
Testing
My CI/CD strategy is build on testing: fail fast and fail often.
Early stage checks should be red often, because I believe that you only learn (and grow) through failure. The CI builds should fail the most, the TA environment should gate 99.9% of the others failures. In acceptance, there must be only functional issues that could halt the pipe.
I don’t believe in a dedicated T environment for manual testing that much. All manual tests are to be automated. If you need some testing, or didn’t automate something yet, our A (Acceptance) environment is supposed to be used for manual (but non-intrusive) testing.
A word on special case testing with prepared databases and such: because we actively monitor the functional coverage of early testing like unittests, and combine this with automated integral testing through API and App testing, there is low to no need for prepared manual special case testings.
I do not believe in total coverage of all kinds of tests all the time. The determined coverage should be the outcome of a risk assessment to determine what parts need what kind,width and dept of coverage.
To make sure there is enough coverage, I rely on a very practical rule.
“I always make sure I trust the code that stands ready on the A environment. I even trust it that much that I dare to start a production release and walk away and go home knowing everything works as intended (and have the support phone with me).”
If I don’t pass this challenge, it is time to revise testing decisions.
And a thing on loadtesting. (TLDR: most of the times it is not worth the investment.)
I only use it on special occasions to get directions, we don’t have it automated in our pipe. To me it is just too errorprone. It is hard to analyse the code to get a meaningfull coverage, and very hard to get production like benchmarking (especially with monthly changing consumer demands that have way more impact than code changes). It is also very expensive to do this kind of testing, because you want to run this on production like environments with production loads to have some meaning.
And why all this if we are in control of support, have live production health monitoring, cloud scaling, and a very fast delivery pipe to correct code if something bad happens to the performance due to bugs?
Monitoring
For me monitoring isn’t restricted to the CI/CD pipe. The released product is the most essential part of the CI/CD, because it tells the real story about the state of the product that is in the CI/CD pipe.
So here are some CI/CD metrics. For us the most important is the state of the pipe (is there a bug detected in code) and where is what code change waiting. This is live on display on dashboard.
But I found out that a lot of the management is interested in CI/CD metrics too because they want to feel in control and thus monitor things like failed build ratio’s or defect ratio’s (features released vs production incidents).
Another part in metrics is our Azure monitoring. We monitor live production performance such as latency, throughput and error rates. These are essential values for product support, but also helpful to determine the quality of the code in production (and thus in the CI/CD pipe). It is difficult however to select the right measurements, and even more difficult to define alert thresholds (because you need to know what action to take, and most of the time you just can’t do a thing about it).
And the last but the most important: business metrics.
Because I think value driven development is the best there is, we want to know what the added value is of each Sprint the team makes. We monitor this by analyzing key points in our products and showing the results on a live dashboard. It is hard to get right, and we aren’t nearly there. But right now a lot of the roadmap is decided on input from these metrics.
Back to CI/CD: you want to know the value of your individual released parts. This can only be done by measuring some kind of business value and relating this to changes made in time.
A note on support: Our metric selection is suited for live support of a real product, but only on a business hours (8/5) support contract. There is 3rd party 1st line support, but they don’t have monitoring. So in essence, I strive for pro-active monitoring during business hours and pro-active error support.
My wishlist
Well, as you might guess I still want to add or change a few things.
- Better measurements on the added value of releases (to the customer / business). Currently our measurements are not consistently implemented, and when implemented it is most of the time in a customer-facing component. We should add a monitor to report API relevance, the added value of an expensive fail fast-fail often pipe, etc.
- Better reporting to management on how we perform. I want to have metrics/processes that have an eye for the individual team and its struggles, but at the same time raises the bar so you keep improving. (No, no failed integration build metrics… don’t waste your time on that)
The real deal
This is all from a real-life situation: my current work at the NS.
Here the team dictates the CI/CD strategie (so different teams have different strategies). The team also dictates the Scrum process: when to accept work, how to implement, how to release (ship) it, and how to do maintenance.
To facilitate the team (and only facilitate) there is a so called ‘cloud’ team supplying tooling, and there are architects available for consulting. The boundaries supplied by these entities are relatively free: it puts emphasis on the teams responsibility to deliver outstanding work.
The amount of freedom is monitored by all involved parties. Not all development teams receive the same amount of freedom. That’s oke with me, because trust is something you need to earn. And my team has earned it!