IT experts worldwide, whether they know it or not, owe a great deal to the aviation industry and the vast body of research accumulated in a domain where failure is not an option. The following article is by no means meant to be an exhaustive summary (the sky is the limit!), but showcases a couple of notable historic moments where IT and aeronautics concerns interlock. These can be used as a handy metaphor or thinking tool to help tackle the problems we all regularly run into as software architects, business analysts or project managers.
So here goes!
The control stick – an argument for agile
Based on the great article by Roger Sessions: link
There have been many great arguments voiced in support of an iterative, agile approach to software development. The one that resonates with me the strongest comes from a time, when IT as we know it didn’t yet exist.
Enter Air Force Colonel, John Boyd, a master aircraft designer and one of the best dogfighters in military history. Operating a fighter jet in one-on-one combat is an extremely complex task involving the analysis and evaluation of data coming at you from a variety of sources (not to mention the obvious influx of adrenalin associated with someone else trying to shoot you down). Boyd managed to summarize this in a simple mental model, that you might know as OODA. Basically, a fighter pilot executes the following four stages in rapid “adaptive” cycles:
- He Observes his surrounding (including the cockpit gauges and the world outside)
- He Orients himself in regard to this state
- He Decides on a next course of action
- He Acts on that decision
But what good does this do us? Especially from an IT standpoint? The part, where it becomes truly interesting, is when we hone in on a very special case study from the Korean War: the F-86 battling the MiG-15. The F-86 would win dogfights against the MiG-15 roughly 90% of the time. The problem both sides could not get their heads around was obvious: the MiG-15, at least on paper, was undoubtedly the superior aircraft. It was much faster, more nimble and provided better visibility, yet it lost in a consistent manner. Boyd needed to pin down the reason for this and use it to US military’s advantage in the long run.
After many hours of in-depth research it turned out that the reason was trivial: the F-86 was equipped with a hydraulic control stick, whilst the MiG-15 had a manual one. Because of this, MiG-15 pilots participating in a dogfight became increasingly fatigued and with time took longer and longer to complete new maneuvers. Maybe they could OODA better, but from a certain point the F-86 could OODA faster*. Cycle time was the key.
Since the 1950s OODA has come out of the military barracks and earned itself a place among the most popular business strategy tools. It offers empirical evidence for favoring business agility over business perfection on any given occasion. The same can be attributed to software development, where shortening the feedback loop by means of “agile” is basically the only way not to get outmaneuvered in today’s fast-paced, IT-propelled economy. Sadly, it’s still not unusual to see giant, “big bang”-based transformation projects crashing down like the old MiG-15 to the unwarranted amazement of everyone on board.
* Which does not imply you should act as fast as possible – your decision might be not to act until the “last responsible moment”, once you have sufficient information to actually ACT ON.
The F16 fighter jet – beware of disguised requirements
Requirements gathering for a software system is usually just half of the story. Requirements validation is where the actual fun starts – we need to ensure that each item is specific, realistic and testable (among other things). I bet at some point most of you have had to tackle requirements for a system to be “fast” or “user friendly”. While you can filter out these kinds of requirements fairly quickly, there is a class of problems that are far more mischievous and difficult to spot – I’m talking about “solutions disguised as requirements”. Again, the history of aviation provides us with a classic anecdote to back this up.
Harry Hillaker and his team of engineers tasked with designing the F16 Falcon had a really tough nut to crack. Besides being under a lot of pressure from their superiors, they basically had to deal with what seemed to be a contrary set of requirements: they were to create a plane that is cheap and lightweight, while being able to develop a speed of up to 2.5 Mach. Based on basic physics alone, this was virtually impossible to pull off. Luckily, the first thing the team did was to ask the Air Force one fundamental question: “why?”
Question: Why does the jet need to Mach 2-2.5?
Answer: In order to easily escape from combat.
With the root of the problem fleshed out, the team could propose a much cheaper alternative that would still satisfy all critical needs, basically focusing on the F16’s acceleration on agility instead of max speed. Their inquisitiveness ultimately prevented millions of dollars going down the drain for needles R&D efforts.
It’s pretty obvious how this all ties back to the requirements engineering discipline in modern-day IT. Next time the business folk ask for an application to be “written in Angular”, be ready to apply the 5 whys technique known from lean manufacturing – usually it takes no more than 5 questions to get to the root problem behind every requirement.
The perfect cockpit – there is no “one size fits all” architecture
Heavily borrowed from the book “Microservice Architecture” by Nadareishvili, Mitra, McLarty and Amundsen
“85% of Statistics are False or Misleading”
– World Science Festival 2010
As software architects we often fall into the trap of crafting “one size fits all” designs, that we are confident will satisfy the needs of all potential customers. This is especially the case, when we’ve just jumped on the bandwagon of an emerging technology or architecture style. A lesson the US Air Force learned the hard way was: the prototypical average customer you design for…does not exist.
Back in 1926, when the army designed its first cockpit, it used standard dimensions based on an average derived from the physical measurements of hundreds of male pilots. In 1950 an inquiry into the causes of an increasing number of pilot errors led to the notion that pilots had gotten bigger since 1926 and the cockpit design needed to be refactored.
Lt. Gilbert S. Daniels, a major in physical anthropology, was assigned the gargantuan tasks of measuring over 4000 pilots with respect to 140 dimensions. Based on his former research at Harvard, Daniels harbored doubts about “averages”, which inspired him to go beyond the task at hand. He wanted to find out how many of the 4063 pilots actually were “average”. He selected 10 core dimensions and applied an ample rule that qualified each pilot that would fall into the middle 30% range for a given dimension average. An “average pilot” would need to score 10 out of 10, meaning he would measure within the appropriate range in terms of all 10 dimensions. Once Daniels finished crunching the numbers, the result was surprising even to him…
Now please consider as an additional caveat that all pilots had already been pre-selected because they appeared to be average-sized. Wow.
This brings us to the jaggedness principle, as coined by Todd Rose: if we design as though the average person is average in every measure, we will ultimately design a system that caters for no one. How does this “flaw of averages” translate into IT, you might ask? Well, on an architecture level this poses an argument against one-size-fits-all frameworks in favor of detailed Domain-driven Design (DDD) conducted with each individual customer. It teaches to exercise caution before we shove a well-marketed COTS package down the throats of our baffled IT folk. In terms of basic requirements engineering, it persuades us to avoid mean values and use buckets and percentiles instead (“a request should be processed below 50 ms 90% of the time”). Above all, it teaches us about the variety of our surrounding and the need for continuous adaptation.
Postmortem and just culture – battle the blame game
By now many of you must have heard the term “postmortem” in the context of reliability engineering, as adapted by leading technology companies (from Google through Etsy to Spotify). According to the definition maintained by Google SREs, a postmortem is “a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring”. While the practice in itself is worthwhile and sound, its true power emerges when coupled with the principle of “blameless culture”. In a climate of finger pointing and “cover your ass” mentality a postmortem would rarely focus on addressing the incident route cause and instead become a tool for shifting blame around the teams and individuals involved. Constructive postmortems create a framework for strengthening the entire organization and approaching problems as an opportunity to learn, not sweep things under the rug or create “front stories” on top of facts. The basic message is: “you didn’t crash the system, you uncovered its flaw”.
Google has developed a whole range of techniques for easing postmortems into the corporate environment, including peer reviews, “postmortems of the month”, “postmortem reading clubs” or “wheel of misfortune” role-playing games. Etsy on the other hand annually gives out a “3-armed sweater” award to the engineer who committed the most surprising error and fosters a culture of active postmortem mailing lists (so called PSAs).
Making the given party accountable for preparing a postmortem sure beats being yelled at by your manager. The former is most often the result of a “Theory X” style of management and relentlessly falling into the trap of the “fundamental attribution error”:
The fundamental attribution error (…) is the claim that in contrast to interpretations of their own behavior, people place undue emphasis on internal characteristics of the agent (character or intention), rather than external factors, in explaining other people’s behavior. [wiki]
Unsurprisingly, blameless culture, to a significant extent, originated from the avionics industry. Since the stakes here are extremely high (human safety), it becomes obvious that covering up faults in order to avoid blame can have tragic consequences – much more grave than in the case of a random IT system malfunction. Assuming you can’t completely engineer failure out of a system, human errors (gross negligence and criminal acts aside) are seen as opportunities to improve safety while the “blast radius” is still small. Some examples proving these statements are not groundless babble:
- The Aviation Safety Reporting System (run by NASA) grants immunity from prosecution to any flight crew that reports on a hazardous incident.
- From the ICAO Safety Management Manual: (…) The State has established an independent accident and incident investigation process, the sole objective of which is the prevention of accidents and incidents, and not the apportioning of blame or liability. Such investigations are in support of the management of safety in the State.
- From the EUROCONTROL webpage: The EU not only formally enacted the concept of a just culture as part of EU law, with the introduction of Regulation (EU) No 691/2010, but it also introduced elements of this culture in Regulation (EU) No 996/2010 governing air accident and incident investigation, which also addresses the need to achieve a balance between the objectives of the judiciary in determining whether criminal intent was involved, and the need of the aviation industry to be able to run a real-time self-diagnostic system without unnecessary interference from the justice system.
The flight of AF 447 – overreliance on automation
Inspired by the fantastic podcast by 99%Invisible, which can be found here: children of magenta, pt. 1
The last example strikes a more somber note, as it shows that even with the best intentions and safety controls the airline industry cannot fully avoid tragic mistakes. Furthermore, the safety mechanisms put in place to minimize human error can effectively backfire, if we fall into the trap of over-relying on them. This was the case for the transatlantic Air France Flight 447 scheduled from Rio de Janeiro to Paris on the 31st of May 2009.
But let’s start from the very beginning. The first rudimentary “auto-pilot” was invented in 1912, allowing the plane to fly straight and level without requiring human action. In the 1950s autopilots could already be instructed to fly along a defined route. By the 1970s a complete shift of paradigm was already underway. As per “99% Invisible”:
(…) even complex electrical systems and hydraulic systems were automated, and studies were showing that most accidents were caused not by mechanical error, but by human error. These findings prompted the French company Airbus to develop safer planes that used even more advanced automation. (…) Airbus set out to design what they hoped would be the safest plane yet—a plane that even the worst pilots could fly with ease. Bernard Ziegler, senior vice president for engineering at Airbus, famously said that he was building an airplane that even his concierge would be able to fly.
As a result, Airbus fitted its planes with a “fly-by-wire” (FBW) system that, unlike an auto-pilot, didn’t even require human input. Sensors would send signals to the system, which would in turn stabilize the aircraft and prevent hazardous operations outside the plane’s so called “service envelope”. Basically, this would intercept any unsafe human action and avert pilots from accidently entering an aerodynamic stall. Unlike the FBW system fitted on Boeing airplanes, the one on the Airbus could not be turned off manually. It could, however, turn itself off in case of an error. Unfortunately, this is exactly what happened on the tragic flight of AF 447.
According to Wikipedia, which summarizes BEA’s report on the crash:
(…) temporary inconsistencies between the airspeed measurements – likely due to the aircraft’s pitot tubes being obstructed by ice crystals – caused the autopilot to disconnect, after which the crew reacted incorrectly and ultimately caused the aircraft to enter an aerodynamic stall from which it did not recover
The “fly-by-wire” system effectively disengaged and was failing to offer protection against aerodynamic stall, leaving the pilots confused and unprepared for the scenario. Der Spiegel vividly describes the situation on board:
One alarm after another lit up the cockpit monitors. One after another, the autopilot, the automatic engine control system, and the flight computers shut themselves off.
Since, according to the BEA report, the crew lacked practical training in manually handling the aircraft both at high altitude and in the event of anomalies, it took no more than 5 minutes for the plane to crash into the Atlantic, along with the crew and all passengers.
This tragic accident sets a clear message to exercise caution when implementing any kind of complex automation:
- Beware of hubris – independent of how smart your team is and how complex and fail-safe your design is, your systems will fail. This is the unavoidable truth. Working in IT gives us the liberty to reinforce this notion within our organization through introducing errors in a deliberate manner. Think the Simian Army. Josh Evans at Netflix equates this to vaccine, where antibodies are injected into the organism in order to help it fight off real threats in the future. While Netflix relies on applying randomness to the infrastructure, Google on the other hand conducts annual, multi-day DiRT (Disaster Recovery Testing) exercises – the objective is to ensure company-wide business continuity not only in case of a unforeseen critical system loss but manpower loss as well.
- Don’t let your skills erode – The AF 447 tragedy illustrates the sad paradox outlined by author and aviator, William Langewiesche: “We appear to be locked into a cycle in which automation begets the erosion of skills or the lack of skills in the first place and this then begets more automation”. From an IT standpoint you should prevent your ops from falling out of touch with how the underlying system actually works and how it behaves under stress. Google SRE culture addresses this through a set of practice like: reverse engineering classes, breaking real production servers, disaster role-playing and on-call rotation. I for one firmly believe that, at least at the time of writing, a human working side by side with a computer will always beat a human and computer alone (see this article for proof, especially if you’re into chess). This is why we can’t remove the human from the equation just yet, or ever.
- Avoid cognitive capture – Charles Duhigg, on the other hand, calls the scenario out as an example of “cognitive capture” – a phenomenon in which we fall into the trap of focusing on sensors and instrumentation instead of our immediate surrounding. How many times, while driving a car, have you caught yourself focusing on your speedometer and GPS instead of the road and traffic signs? The same goes for IT operations. You should always remember that your monitoring infrastructure is as fallible as the systems they are monitoring. This is why you should always be able to access your system directly at analyze the data right at the source.
- Remove “alert noise” – over the years a plethora of cutting-edge, pluggable monitoring software has emerged in the IT space, allowing us to trigger multi-channel alerts in response to complex event chains. Unfortunately, little or no thought has been devoted to the notion of managing alerts themselves and we are just now starting to play catch up (see alerta.io as an example). While a lack of alerting infrastructure is a recipe for disaster, the other end of the spectrum, “alert spam”, may cause you existing infrastructure to become virtually useless. Hundreds of transient alerts flashing on your dashboard and a giant bulk of e-mail notifications piling up in your inbox (the place alerts usually “go to die”) will most likely cause a really significant alert to go unnoticed. In light of this it is highly recommended to:
- perform sanity checks of your monitoring setup
- provide a tool to de-duplicate, prioritize and fan out alerts in real-time
- fine-tune your alerting rules each time you miss a vital piece of information due to “alert noise” – usually in response to a postmortem exercise
I really hoped you enjoyed this article. Don’t forget to leave a comment.