When the intuition points the wrong way (2/2)

When the intuition points the wrong way (2/2)
People who are not security experts are often misguided by their intuition and feelings.

In a first post on the subject, I explained how people often end up in the wrong direction when they try to define their security needs. It is only after a little work session that they understand their mistake but even then, some remain reluctant to admit it because they know all others will do the same mistake as they just did.

Here is another example that is even more common. In fact, despite all the environments I have been in and all the people I worked with, I still did not met a single one who knew about this one. Here, the need to express and define is availability. Here, people will do and pay a lot to protect against a low risk while not doing much, if anything, to protect against the much higher risk.

The consequence of a downtime

There are three patterns that a loss of availability can follow. First two are easy enough to understand. They are the exponential loss and the linear loss. They start low, grow and keep growing.

What puzzles everyone is the third pattern : the logarithmic loss. That one starts very high and decreases over time.

Whenever I present this, people always react the same way :

What ? The loss decreases ?!? How is that possible ? What is lost is lost. How can you reduce your losses by remaining down ? You do not change the past.

The best way to explain these 3 patterns is to start with this last one and illustrate each case with an example. So a case of logarithmic loss would be an assembly line.

Imagine the assembly line working normally. The object moves from one station to the next where each one adds its little part. At the end of the line comes a completed and functional object, whatever the line is building. And then.... BOOM!, the line stops.

Even if it stops for only a few seconds, it's too late. Everything in progress must be removed and discarded. Paint and glue started drying and must be cleaned. Even after being restarted, not a single item will get out until each station has been completed once. Just for a few seconds of downtime, loss is very high.

But now that the line is stopped... Go check what is making that strange noise in station No 3. Now is the time to do a proper fix on station No 5 instead of the dirty patch we improvised last week. While at that, go do the regular maintenance on station No 7 that was planned for next week.

A logarithmic loss is a case where a stop is so harmful that once done, it turns to an opportunity. The same way a risk is a cost / a loss, an opportunity is a gain, a benefit. Should the line restarts too quickly without seizing any opportunity, the only outcome of the incident will be the loss and nothing else. Should one takes time and seizes these opportunities, it will give some benefits to help compensating the loss.

A case of linear loss could be a hydro power dam, one without a reservoir. Should something prevent the dam to produce any power, the loss will start at zero and grow at a regular (linear) rate. Water will flow without producing energy and once that water is passed, it will never come back.

In IT, most losses are exponential. These are cases where whatever can't be done at a specific time will be buffered and re-processed at a later time. Clients will come back later, tasks will run later, ... At first, loss is small and don't grow much. The thing is, patience will grow thinner and thinner and if you can't re-process everything in time, the consequences will increase drastically and very fast (exponentially).

This model is essential to understand because by instinct, people do the exact opposite of what they should.

What people do wrong when trying to do high availability

When they decide to do HA, people start by deploying solutions that will prevent an asset to go down. Not only that, they often stop working the case when they are done with that first HA solution. Unfortunately, that is the proper plan for a logarithmic loss, not an exponential one.

Should one faces a logarithmic loss, then this logic is right. Do all you can to avoid stops but once down, no need to worry about recovering fast. The thing is, that case is almost inexistent in IT. It is quite the opposite.

Because losses are exponential, to go down is not great but not that bad. What is critical is to ensure the recovery before the exponential skyrockets to high losses. As such, the focus must be on recovery much more than on preventing the stop in the first place.

When doing HA the wrong way, one will pay for a solution that will help prevent a small loss, the one at the beginning of the exponential curve. On top of that pretty small benefit, the environment will remain exposed to high risk by not ensuring recovery when needed. So high security cost added to high risk cost. Not great at all.

Instead, efforts should be to ensure recovery will be completed in time and to the best state possible, before an incident grows out of control. That will limit the cost of the solution while reducing the cost of the higher risks. If one is still not happy after that then it will be time to look at solutions that will prevent downtime in the first place.

Again, to explain and enforce such practices is the role of a senior security advisor. When the needs are properly defined and solutions are aligned with them, security costs will be kept to a minimum while risks will also be reduced to a minimum. That is the proper way of doing security.