The year 2022 was when we saw an end to the pandemic border closures as the world slowly started to open up again. Travel resumed, events picked up momentum and the volume of Zoom calls was reduced ever so slightly.
Although the pandemic wreaked havoc across many industries, it did provide an opportunity for savvy organisations to build up their digital muscle and expedite projects that may not have seen the light of day had it just been business as usual over the past few years.
While it is great that digital projects have gotten off the ground, I have personally observed that there is room for improvement in terms of engineering execution.
Below are four key predictions I see gaining momentum in 2023 and beyond
Full lifecycle observability: Insights in context
Full lifecycle observability is about two things: Fixing problems and getting ahead of them. By embedding observability earlier in the software lifecycle, engineers can fix performance problems that they ordinarily would not recognise until the code is running in production. Sounds simple enough but few organisations have achieved it.
See also: Keys to achieving human-centred automation testing
Traditionally, developers write code and ship it, then someone else (like a site reliability or quality assurance engineer) is tasked with running it and fixing performance issues. With full lifecycle observability, engineers are responsible for the performance, quality and reliability of their code because they have all the telemetry data in front of them, in context.
They can understand how their code will behave in production because the data is no longer kept in a silo within a separate team or tool. They have end-to-end observability across all their environments so they are able to make informed decisions and catch problems before they reach production. Think about full lifecycle observability as not just observing production but observing the development, test, and staging environments through the one unified lens, accessible to all teams.
FinOps: Cost as a golden signal
See also: Human element still important for effective mass communication
Given the current macroeconomic environment, cost optimisation — right down to the cost per transaction — is going to be more and more important for organisations. If a business is doing thousands of transactions per minute, what is the financial flow-on effect of one small tweak? Is it positive or negative? Today’s major decisions are based on the traditional golden signals of latency, throughput and errors but what is missing is cost.
Companies are starting to spin up FinOps departments which are responsible for ensuring that the budget is being consumed appropriately. However, in my opinion, that is going backwards unless that data is shared. We cannot tell engineers that they need to be responsible for their code right up to deployment unless they know the cost implications.
By shifting left and enlisting cost as a golden signal, engineers have a cost dimension to consider when planning what to work on and the shipping code. But costs cannot exist in a vacuum — like every other golden signal, it needs context. Costs alone do not tell you much. It tells you what you are spending on, but it does not tell you what that spending supports.
To understand the weight of the cost, you need a third dimension: The relationship that cost has to the business and the importance of the business function that it is supporting.
This is how Amazon Web Services operates. They have armies of data scientists that look at all these different facets and experiment by tweaking certain levers and analysing the potential cost savings. They then complete a sensitivity analysis to assess the probability of those outcomes unfolding.
Security is the second missing golden signal
Security is another golden signal that is currently largely absent from the start of the development process. Most vendors have built security mechanisms for security professionals, not developers, so engineers are conditioned to outsource the responsibility of secure code to the security team and expect them to catch any vulnerabilities.
To stay ahead of the latest tech trends, click here for DigitalEdge Section
If instead, we provided sufficient signals to engineers with controls and policies in place, for example, engineers would not be able to merge code unless security thresholds were met. It makes security a forethought in the development process, rather than an afterthought.
Some might argue that these additional steps might be overbearing for engineering teams but ultimately, if the code is not meeting security or even cost thresholds, it will just yield a delayed response.
Instead of fixing issues upfront, they need to fix them after they have been deployed. These kinds of security controls would streamline software development because without these safeguards, the issues will still have to be fixed. It is just a longer, inefficient process with more risk while internal politics play out.
Preventative observability: Catching a falling knife
Crisp or critical path analysis of large-scale microservices architectures, is a concept that was born within Uber to address minor performance degradation, which was very difficult to pinpoint using existing technologies.
To address it, Uber developed a tool that uses machine learning to identify latent performance issues in very complex architectures. For example, you might have a transaction that touches 1,000 different services. A handful of those services have a minor deviation in their performance. If those services are on the critical path performance, they can be significantly impacted even though in isolation each service looks fine. So they open-sourced it all as Critical Path Analysis or Crisp.
While Crisp sounds like the holy grail, in reality, it is probably years away from being introduced by commercial vendors. However, some level of machine intelligence that reduces the need for human capital is available and I would argue, essential.
To prevent problems, engineers need to identify small deviations before they turn into much larger problems. This requires leading indicators that are hard for humans to see just by looking at a chart, or expected to be picked up by regular anomaly detection methods. By introducing technology like Crisp with specialised algorithms and machine learning, we can reduce the human capital required to detect and remediate latent performance issues in large-scale microservices architectures.
Peter Marelas is the chief architect for APJ at New Relic