Engineering is often about solving problems. According to Damon Edwards, in DevOps Kaizen, reliance on non-standard or manual work from individuals or teams that perform heroics are considered waste. However, heroics are often a necessary part of people’s daily work. These heroics (e.g. nightly 2:00 AM problems in production) need to be alleviated or eliminated.
Engineering is also a technical journey, and in this case below it is transformative. Considering waste categorized by Edwards, ideally any manual work that can be automated should be automated, self-serviced, and available on demand.
Here’s an overview:
- Identifying a Problem
- Finding a Starting Point
- Defining Milestones
- Architecting a Solution
- Estimating Budgets
- Building Prototypes
- Continually Refining Risks
- Nearing Completion
- Envisioning What Is Next
The following article shares some lessons and practices that can be used to craft a blueprint for streamlining project lifecycles. This will relate practices with examples of one manual deployment process in dire need of embracing automation. If you are not sure about deployment processes, check out a previous article on how to Decouple Deployments from Releases.
Identifying a Problem
Problems may not be readily apparent at first. They may start as small nuisances, growing gradually and becoming severe when affecting an entire process or client flow.
Problems can stem from staffing or siloed, tribal knowledge limiting contributions from other engineers. In application architectures, a problem can be like this case:
“Suppose we have an application that is largely monolithic, each client has its own server in a datacenter, and deployment processes are on documents spread over shared drives. Clients operate 24x7 in a healthcare setting. DevOps teammates schedule deployments during off-hours or when clients deem appropriate. 40% of unplanned system outages over the last year were the result of human error during those off-hours deployments. The team halted all deployments for the month of December.”
Now looking at this example it may not be that big of a problem for 5 clients, but what if there were 50? With that many clients, this problem can be solved by increasing staffing; a non-trivial solution for most, especially with a risk of insufficient onboarding or turnover. The heroics performed by DevOps, regardless of unplanned outages, is not scalable. This case could eventually generate a silent majority of support, having a high degree of influence over the organization. The problem is hard to solve and assuming hiring is out of the question, it is a good idea to determine a place to start correcting actions with automation.
Finding a Starting Point
With problems of such a large impact, like the unscalable one mentioned above, finding a starting point can take some time. Engineers may have gripes about the process, or leadership cannot provide bandwidth due to prior commitments, or maybe there is not enough business value to gain. While business continues as usual, error budgets continue to shrink, and precarious processes are performed poorly. Knowing the risk of keeping the status quo, determining a starting point will help gain support from others to provide the capacity to begin the journey of problem solving.
Business value is a great persuasive tool. When problems are not directly client-facing, it can be difficult to come up with a value proposition. A good starting point for automating deployment processes can be just listing out the risks of continuing as-is and flipping them around.
- With stronger risk management, less errors made in the process results in application quality
- Reducing the amount of time spent on manual heroics allows teams to focus on more strategic and value-added activities
- Automated resources execute more effectively, decreasing the error budget reserved for manual operations and those begin to add cost savings
- Eliminating waste and improving operations of DevOps in turn improves the of flow of clients
Understanding the risks and value opportunities is the first step. To further gather support to get started, define manageable milestones to improve the group’s confidence and begin to alleviate concerns. It will also help to capture metrics throughout to prove the risk and burden are alleviated by improving the whole deployment process.
Defining Milestones
When beginning any large project, engineers need measurable and attainable goals to help track and report progress. Milestones present a clear picture of how the solution will be achieved. There are probably acceptance criteria or possibilities of stopping for a while, but in every project, there is a start and finish often with some X to mark the spot.
Ask how it will be known that the project is complete. Instead of asking “When will this project be done?”, ask “When do we start seeing value?” Try focusing on the “what,”, and not the “how” when defining a to-do list. Get together with others and write down what must be done individually. Finally, define that exit criteria.
Below is a foundation for milestones to get started:
- Kickoff & Planning Capacity
- Current Process Assessment
- Architecture Design & Tool Selection
- Prototype
- Review, Refine & Determine MVP
- Test in Lower Environments
- Documented Knowledge Sharing
- Plan Rollout
- Promote to Production
- Validation & Optimization
It can start with simply wanting to know how to track progress and setting a goal with checkpoints along the way. Defining milestones like these will help determine a more accurate timeline or help provide reasons for uncertainty when presented with a project deadline.
Architecting a Solution
To move quickly with architecting solutions, collaborate in design board tools, and communicate regularly in shared chat channels. Standards may keep designs on rails, so keep designs simple first and iterate to a more refined document.
With short timelines, familiar tools may seem like the appropriate choice. Ultimately, the solution's requirements will offer a clear picture of what can be used. For a case of transforming a manual deployment process, it can take months to years. Requirements like availability, reliability, and uptime are critical for ensuring scheduled automated deployments go smoothly.
Be flexible with decisions when designing a solution or picking tools. Whether there is a common understanding of choices, there may be disagreements when aligning on the surrounding context like goals or the consequences of those making choices. Tooling choices are better made when understanding the current process. Here is a sample set of stages of a deployment process to give an idea of what tools might be needed:
- Confirm change requests with clients 1 week before
- Collect and Prepare feature changes for delivery and staging 1-3 days before
- Notify end users warning of a pending update window 15 minutes before
- Trigger the deployment process at the scheduled date/time
- Execute smoke tests after deployment process complete
A few expectations for automating deployment processes are to improve stability of the preparation of a release cycle, as well as sure up any error-prone steps or common problem areas that existed previously. There may be multi-cloud integrations like:
- Cloud CICD services for builds and releasing artifacts
- Cloud-managed serverless clusters ready to serve containerized app images
- Legacy solutions that run on old versions of Java or Ruby
- Databases backed up or replicated to various scalable services
- Approval gates of production change requests
- Email servers for notifying clients of scheduled maintenances
- Notification systems for alerting users in application sessions
- Monitoring policies for alerting DevOps if something goes wrong
- Metrics observability for KPI measurement
All of these can be in different cloud providers requiring secure credentials, but they are all critical to ensure success at each stage of deploying software.
Once an agreement is reached with standards for tooling and designs capturing the basic requirements, risk assessment of the current situation, and expectations it is time to estimate budgets and begin some prototypes.
Estimating Budgets
Budgets are often not a concern for most engineers. It is those higher up in leadership that will look for cost optimizations or forecasting monthly to yearly spending. On top of estimating hosting and operational costs of the various multi-cloud workflows mentioned when architecting solutions, it is also important to account for costs such as:
- Tools and licenses not already acquired
- Rates for additional contributor or support channels
- Sandboxing or prototyping environments and resources
- Non-production environments and how they size up compared to production
- Ramp-up in costs for continued client adoption
Maybe budgets will start high at first, due to the need for duplicating infrastructure to modernize or maybe migrating from one cloud provider to another. Estimate on the higher side, even if confidence is high. And remember that managers and leaders can assist by providing insight into previous expenses.
Building Prototypes
Alongside defining milestones, you can be prototyping. This can be conceptual and low code first. The goal with prototyping is to demonstrate key features and interactions, but it does not all have to be connected. It is about building confidence on top of the prior planning and design.
Prototyping new software workflows may be tricky depending on access levels within cloud providers with certain organization policies. While teams can separate duties, anticipate a few blockers on making progress, like having to wait for tertiary requests to be filled to either get more access or establish new connectivity between services.
In the example case, a new cloud database to store client ops metadata is kept private. Part of an initial design intends to improve data models and preload some of the data during prototyping to keep API development swift. A small VM could be created on the same private network, which engineers could connect to through secure shells. Cloud providers offer more secure alternatives like identity federation proxies to avoid extra costs of VMs, but it may be more complex to orchestrate. With prototyping, it is about choosing paths to minimize blockers, keeping the project within the limitations of the contributors.
The benefits of prototyping for automating deployment processes start to become apparent:
- Early visualization and actualization of the automation pipeline
- Pair programming sessions are beneficial
- Cadences for check-ins and quick internal demoes for feedback, progressively becoming less manual
- Identifying and collaborating on risks
- Staying aligned with the objectives
- Estimates become more accurate
Continually Refining Risks
Roadblocks are inevitable in large projects. These blockers can extend timelines or change stakeholder decisions for continuing the development journey. Adding an initial buffer to milestones or timelines can help when capacity is impacted. In any case, be transparent with the project group when progress is impacted.
Some common events that come up that may cause unplanned blockers are holidays or PTO and higher priority planned features. Throughout the project lifecycle, be sure to review timelines in anticipation of such events. If an individual is feeling stuck, consider where other progress might be made in the project:
- Refine designs or documentation
- Collaborate, plan, and agree on alternate objectives (pick from “Nice-to-haves” instead of “Needs”)
- Break down tasks further to focus on smaller aspects
- Request feedback from stakeholders for the current state
- Seek out involvement from other teams
- Ask managers how else to contribute
For the new automated deployment process in the cloud to communicate with client servers in datacenters, networking channels must be opened. A quick solution would be to open more ports on each client’s firewall. With future opportunities of the uplifting remaining client servers out of datacenters, such as replicating real-time databases or new cloud-native client solutions, a larger investment of establishing a direct tunnel from cloud to datacenter would be favored. For this example, if this requirement is discovered much later in the project timeline, then the timeline would be extended.
Nearing Completion
The final stages of this automation project signify the need for meticulous attention to detail. It is a time of anticipation and readiness for the next steps. As teams approach completion, they focus on perfecting the project to meet the desired outcomes.
Monitoring strategies for analyzing performance bottlenecks and alerting mechanisms for off-hours schedules are a key factor in automated processes. Looking back at the problem, the manual process is not scalable given a limited team size and prone to errors performing deployments during off-hours. Consider how effectively teams would be able to respond and resolve deployment issues (those heroics mentioned at the top of the article). These types of strategies include:
- Health checks of client servers in datacenters
- Connectivity tests for networking tunnels
- Smoke tests for basic functionality and core APIs
- Process logging analysis for errors, warnings, or unusual patterns
Envisioning What Is Next
When contributions to projects are complete, it does not always mean those projects are finished. Before taking the final steps to roll out projects, it is good to gather and reflect on what has worked well and what did not. Evaluate the team dynamics, and ask those involved, “If this had to be done again, how would it be approached differently?” Rate the KPI metrics established to compare new processes.
Problems may start small and grow, affecting daily processes until a critical mass is reached. Issues can arise from staffing, siloed knowledge, or outdated architectures. Address problems with a clear starting point, focusing on business value and risks of the current process. Set achievable goals to track progress, defining clear milestones for the solution. Collaborate on designs, select tools, and calculate costs for tools, licenses, environments, and client adoption, estimating on the higher side initially. Prototype key features to build confidence, anticipate blockers, and align with project objectives. Expect roadblocks, refine designs, collaborate on alternate objectives, and review timelines to address unexpected events. Adapt to new requirements, shift project scopes, and identify areas for enhancement in processes and technologies.
The journey of this transformation described in this article may span several years. New challenges may arise to further accelerate these new ways of working across the organization. Business value drivers may continue to scale up, requiring more delivery and technology excellence, forming new opportunities which will reshape the team's operating models. Envisioning the next challenge may focus on the culture across the business. For this deployment automation case, the problem led to an opportunity to improve efficiency, reduce errors, enhance reliability, and increase speed in deploying software, ultimately contributing to a more efficient and effective software deployment future.
Edwards, Damon. “DevOps Kaizen: Find and Fix What Is Really Behind Your Problems.”