5.4 Summarize risk management processes and concepts

  • Risk Types
    • External
    • Internal
    • Legacy Systems
    • Multiparty
    • IP Theft
    • Software Compliance / Licensing
  • Risk Management Strategies
    • Acceptance
    • Avoidance
    • Transference
      • Cybersecurity Insurance
    • Mitigation
  • Risk Analysis
    • Risk Register
    • Risk Matrix / Heat Map
    • Risk Control Assessment
    • Risk Control Self Assessment
    • Risk Awareness
    • Inherent Risk
    • Residual Risk
    • Control Risk
    • Risk Appetite
    • Regulations that Affect Risk Posture
    • Risk Assessment Types
      • Qualitative
      • Quantitative
    • Likelihood of Occurrence
    • Impact
    • Asset Value
    • Single-Loss Expectancy (SLE)
    • Annualized Loss Expectancy (ALE)
    • Annualized Rate of Occurrence
  • Disasters
    • Environmental
    • Person-Made
    • Internal vs External
  • Business Impact Analysis
    • Recovery Time Objective (RTO)
    • Recovery Point Objective (RPO)
    • Mean Time to Repair (MTTR)
    • Mean Time Between Failures (MTBF)
    • Functional Recovery Plans
    • Single Point of Failure
    • Disaster Recovery Plan (DRP)
    • Mission Essential Functions
    • Identification of Critical Systems
    • Site Risk Assessment


Now we are going to look at sources of risk and what we can do about them.  There are two key definitions that we need to understand.  The first is the likelihood.  Likelihood is the percent chance that a risk will happen.  The second is the impact.  Impact is the harm that the risk will cause.

People assume that all risks are bad, but in the project management world, a risk can be a good thing or a bad thing.

Risk Types

We can categorize risks into some common areas

  • External.  An external risk is one that the organization can’t control, and likely one that cannot be easily predicted.  They include COVID-19, natural disasters, and economic collapse.

  • Internal.  An internal risk is one that the organization can control.  They include employee fraud, safety hazards, and product defects.

  • Legacy Systems.  A Legacy System is a computer system or application that is critical to the organization but that cannot be upgraded. 

    Many companies run applications written in old programming languages.  There are few people who understand the languages enough to make changes to these applications.  These applications might be able to run on newer hardware. 

    The three main risks are that

    • We can’t protect the application from security breaches

    • The hardware that the application runs on could fail, and we don’t have replacement hardware that is compatible with the application

    • We can’t make changes to the application because nobody knows how

  • Multiparty.  A multi-party risk is one that affects multiple organizations.  Common multi-party risks are ransomware.  The ransomware might infect computers in one organization and then spread to computers at other organizations.

  • IP Theft.  The theft of intellectual property is a risk for all businesses.  There are four types of intellectual property

    • Copyright.  A copyright is a right to produce a work like a book, a movie, a video, a song, etc..  A copyright starts when you create the work, and you don’t have to register the copyright to own the copyright.  A copyright expires sixty years after you die.

      If you are in the media business or software business, then copyrights are important to you because you make money selling copies of your product. 

      People will steal your copyrighted works because they either don’t want to pay for the product or because want to resell it without your permission.

    • Patent.  A patent is a right to sell an invention.  You can patent an invention, a business process, or a drug.  You must apply for a patent and the process to obtain one is lengthy.  A patent expires after twenty years.

      If you have a great invention, then people will try to copy it and sell it.  People might make the copy slightly different to not infringe the patent.  Enforcing the patent is difficult.

      A patent troll is somebody who buys many patents and sues people when they try to sell something that infringes on his patents.  The patent troll has no intention of using the patents.  The US Supreme Court has held that the behavior of patent trolls is illegal, but sometimes, it is cheaper to settle with the patent troll then to try and fight them.

      Trademarks are easier to enforce.  If you develop something and patent it, you should also put your trademark on it.  This will help consumers associate your brand with the product.

      When you patent something, the whole world will know about it, so you should not patent an idea that you want to keep a secret.

    • Trademark.  A trademark is a right to a logo, name, motto, sound recording, etc.  A trademark is something that identifies your product or service to customers.  You must register your trademark, and when you do it lasts forever.

      If you do not protect your trademark else it becomes diluted, and then you will have trouble stopping people from using it.

    • Trade Secret.  A trade secret is a process that has commercial value.  It might be a manufacturing process, a secret recipe, or a business process. 

      You must do whatever you can to keep your trade secret a secret.  If you don’t, and somebody else finds out and uses it, there is no legal way to stop them.

      Foreign governments and agencies try to steal trade secrets.

  • Software Compliance / Licensing.  A large organization will use many types of computer programs and applications.  Each application may have a different type of license.  A program may be licensed on a per user, per machine, per processor, or per concurrent use basis. 

    You might have to pay for each license separately.  It is important to keep track of all the licenses so that you do not buy extra ones.  If a user leaves the organization or replaces his computer, then you should remove any software licenses so that they can be reused.

    When you have lots of users, the software publisher might give you an “enterprise” license.  In other words, install the program wherever you want and keep track of how many users you have.  If you pay for 10 licenses but you are using the software in 20 places, and the publisher finds out, then you might pay a fine.

Risk Management Strategies

How can we avoid the risk?

  • Acceptance.  Risk Acceptance means literally just that.  We accept the risk.  If the impact and/or likelihood of the risk is low, then spending money to fight it might be a bad idea.

    For example, if you are crossing the street and the risk is getting hit by a bus, then acceptance means accepting the risk that you will get hit by the bus.

  • Avoidance.  We can avoid the risk.  That means not doing the activity that causes the risk.  We use risk avoidance when we can’t accept the risk, the activity is not critical, and the cost of transferring the risk is too high.

    For example, if you are crossing the street and the risk is getting hit by a bus, then avoidance means not crossing the street.  If you don’t cross the street, you won’t get hit by the bus.  But you also won’t

  • Transference.  We can transfer the cost of the risk to a third party.  This is commonly known as insurance.

    For example, if you are crossing the street and the risk is getting hit by a bus, then transference means that you will cross the street, but you will buy personal injury insurance.  The insurance will pay you for your injuries, but it won’t put you back to the state you were before the accident.

    • Cybersecurity Insurance.  Cybersecurity insurance is common to have.  Check your commercial general liability policy and make sure it includes cybersecurity risk.

  • Mitigation.  We can do things to reduce the likelihood that the risk will happen or the impact that it will cause.

    For example, if you are crossing the street and the risk is getting hit by a bus, then mitigation means reducing the risk or the impact.  You can check both ways before crossing the street.  This reduces the risk that you get hit by the bus.  You can also wear body armor when crossing the street.  This reduces the impact of the risk.  If you do get hit, it might hurt less.

Risk Analysis

Some tools that we can use to identify the risks and their impacts

  • Risk Register.  The risk register is a list of risks that our organization faces.  Each risk in the risk register has a description, probability (likelihood), and impact.  For each risk, we can write out what we will do to mitigate the risk. 

    We list our risks in a table format.  For example,


RiskDescriptionProbabilityImpactMitigation Strategy
Data LossData loss due to hard disk drive failureLowLowUse a storage appliance with RAID
Data LossData loss due to a fireLowHighUse a high availability system to back up data to a second location
Data LossData loss due to theftMediumMediumUse physical and logical security to prevent theft

In my line of work, we use a risk register when evaluating worksite safety.  Each day, each worker must write out all of the tasks that they will complete and the resulting risks.  Then they must think about how they will mitigate each risk, and what the probability/impact will be afterwards.

For example, if I am installing a switch, my risk register might look like this

TaskDescriptionProbabilityImpactMitigation StrategyRisk After Mitigation
Unbox switchHand injury from box cutterLowLowWear cut resistant glovesLow/Low
Remove old switchHand injury from sharp edgesLowHighWear cut resistant glovesLow/Low
Climb ladderFall off ladderMediumMediumWear boots when climbing ladderLow/Low
 Back injuryLowLowEnsure proper posture when climbing ladderLow/Low
Rack mount new switchHand injury from sharp edgesLowHighWear cut resistant glovesLow/Low
Power on switchElectrocutionLowHighCheck outlets and power cords to ensure that they are not frayedLow/Low


If I can’t reduce a risk to make it low enough, then that task is not safe.  If I can’t perform a task safely no matter what, then I shouldn’t perform the task.  We might need to use a Risk Matrix when creating the Risk Register.

  • Risk Matrix / Heat Map.  The risk matrix plots the impact of the risk against the likelihood.  We can then classify each risk based on its likelihood and impact.

    For example

 LowMediumHigh
Very LikelyHighVery HighVery High
LikelyMediumHighVery High
PossibleLowMediumHigh
UnlikelyLowLowMedium
Very UnlikelyLowLowMedium


The number of impact categories can vary.  The number of likelihood categories can vary.  We can develop a different type of risk matrix for each risk.  In my example, the probability and impact are qualitative, but we can also create a quantitative matrix.  We can also classify the impact according to multiple categories (for example we can separate the impact as the financial impact, the impact to our reputation, the impact to health and safety, etc.).

For example

 Less than $1000 damageMore than $1000 but less than $100,000 damageMore than $100,000 damage
Greater than 90%HighVery HighVery High
50% to 90%MediumHighVery High
10% to 50%LowMediumHigh
2% to 10%LowLowMedium
Less than 1%LowLowMedium
 Minor injury to one person will resultMinor injury to multiple people will resultSerious injury will resultDeath will result
Greater than 90%HighVery HighVery HighVery High
50% to 90%MediumHighVery HighVery High
10% to 50%LowMediumHighHigh
2% to 10%LowLowMediumMedium
Less than 1%LowLowLowMedium
  • Risk Control Assessment.  The Risk Control Assessment is an evaluation of how well our controls are working to prevent risks.  It is performed by a third party.  The two questions are

    • Do we have adequate controls to prevent the risks?  If we don’t identify our risks first, then the control assessment is meaningless.

    • Are our controls implemented properly?

  • Risk Control Self Assessment.  The Risk Control Self Assessment is an evaluation of how well our controls are working, but it is performed internally.  It might be biased, because we are less likely to give ourselves a negative evaluation, but it may be cheaper and faster than an assessment by a third party.

  • Inherent Risk.  The Inherent Risk is the level of risk that is present in an activity before we do anything about it.

  • Residual Risk.  After we apply controls (a way to mitigate the risk), the remaining risk is the Residual Risk.

  • Control Risk.  The control risk is the risk that a control will fail.

  • Risk Appetite.  The Risk Appetite is our willingness to take a risk.  Our risk appetite might be affected by

    • Our industry.  Some industries might take more risks than others.  A technology company might take more risks than a law firm or a hospital.

    • The age of our business.  An older, well-established business might be less willing to take risks.  A newer business might be willing to risk everything to grow or succeed.

    • The investors.  When the business is owned by other people, we might not be willing to risk their money.  Or they might not be willing to let us risk their money.

    • The government.  The government might not let us take certain risks.

  • Regulations that Affect Risk Posture.  The government may set regulations that limit the amount of risk we can take. 

    For example, if we are a bank, then we might not be able to put our money in risky investments.  If we are a nuclear plant, we might not be able to take a risk with our security.

  • Risk Assessment Types

    • Qualitative.  A qualitative risk assessment evaluates risks through the use of categories.  As a result, it may be more subjective.

    • Quantitative.  A quantitative risk assessment uses data and numbers to evaluate a risk’s likelihood and probability.

  • Likelihood of Occurrence.  The likelihood is the percent chance that a risk will occur.  It might be measured as a percentage or it might be a category.

  • Impact.  The impact of a risk is how the risk will affect the business and to what extent.  For example, a risk may only have a financial impact, but the financial impact is $1,000,000.

Areas where a risk can impact a business

  • Life.  A life impact is physical or psychological harm to a human being or animal.  This could cause the individual to suffer or die.  The life impact may be short term or long term.  A life impact may be to an employee, to a vendor, or to third parties.  Life impacts usually also cause financial impacts (the organization will be forced to compensate those who are harmed) and reputational impacts.  For example, an explosion at a chemical plant could kill and injure employees and contractors who work at the plant.  The resulting pollution would introduce harmful chemicals to the environment, which could cause long term cancer rates to increase in the surrounding population.

    • Property.  A property impact is damage to property such as a building, equipment, or property belonging to third parties.  For example, an explosion would cause damage to the plant.  A property impact is also a financial impact.  It may take a long time to repair or replace the damaged property.  Some property cannot be repaired (for example a historical building that burns down).

    • Safety.  Impact to safety is when there is a risk of personal injury.  When a work environment is not safe, work must stop until the deficiency is corrected.

    • Finance.  A risk that causes a strictly financial impact is rare.  Usually, the financial impact is the consequence of some other impact.  For example, damage to property usually results in a financial impact when it is repaired or replaced.  An example of a strictly financial impact is a bad investment.

    • Reputation.  The impact could also be reputational.  If the organization was found to be negligent in creating a risk (that resulted in harm to people or damage to property), then its reputation is also impacted.  If the organization’s operations are disrupted, and it is no longer able to meet customer demands, then its reputation is also impacted.

  • Asset Value (AV).  How much is the asset worth?  When we are thinking about the impact of the risk, we need to know what the risk will damage.  And when we know what the risk will damage, then we need to know how much those items are worth.

  • Single-Loss Expectancy (SLE).  The SLE is how much loss we can expect when the risk takes place.  The Exposure Factor (EF) is the percent change in the Asset Value that will occur with the risk.

    If an asset is worth $1,000,000 and a fire reduces its value by 50%, then the Exposure Value is 0.50

    Thus

    SLE = AV x EF


    Or

    $500,000 = $1,000,000 x 0.50
  • Annualized Rate of Occurrence (ARO).  The ARO tells us how many times per year a risk will occur. 

    If the ARO is greater than 1, then the risk will take place more than once per year, on average.  If the ARO is less than 1, then the risk will take place less than once per year, on average.

    For example, if the ARO is 5, then we expect to have the risk five times per year.

  • Annualized Loss Expectancy (ALE).  The ALE tells us how much money we will lose each year due to the risk.

    For example, if we expect a theft to cost us $5,000 (SLE), and we have five thefts per year (ARO), then the ALE is

    ALE = SLE x ARO

    $25,000 = $5,000 x 5

Disasters

There are several types of disasters.

  • Environmental.  An environmental disaster is something like a fire, flood, hurricane, earthquake, or other type of inclement weather.

  • Person-Made.  A person-made disaster is something that a human did.  It could include an act of terrorism or an arson.

  • Internal vs External.  An external disaster is one that takes place outside the facility but impacts our business.  For example, if there is a hurricane and it takes out a power plant, that is an external disaster.  If, as a result, we don’t receive enough power to continue operating, then we are impacted.

    An internal disaster is one that takes place within our facility.  For example, there is a hurricane that takes out our facility.

Business Impact Analysis

  • Recovery Time Objective (RTO) and Recovery Point Objective (RPO).  The RPO is the maximum loss of data that a business could endure and survive.  How many days worth of data can a business afford to lose?  For example, a business could tolerate up to one week worth of data loss.  If data was backed up on Monday and data was lost on Thursday, then the business could survive.  The RPO tells us how often we must back up our data.


RTO is the maximum time a business could survive while waiting for business operations to be restored.  How fast does a business need to recover?  For example, if business could tolerate disruption at its factory for four hours, and the factory shut down for three hours, the business could survive.  If the factory shut down for six hours, the business would not survive.  The RTO tells us how fast we need to react to restore the business.

The shorter the RTO and RPO, the more expensive a disaster recovery plan will be.  RTO and RPO can vary from business to business and can vary within different units of the same business.

  • Mean Time to Repair (MTTR).  MTTR is the Mean Time to Repair.  The mean time to repair is the average time a device is repaired.  For example, if a server breaks down, how many hours or days will it take until it is restored?

The MTTR is the time to resolve to the issue, not the time to respond the issue.  For example, if a server fails, and a repairman arrives within three hours, but it takes an additional hour to troubleshoot and repair the issue, then the repair time is four hours.

The shorter the MTTR, the more expensive it will be.

Different types and severities of incidents can have different response times.  The organization must weigh the response time against the impact to the business.  Critical incidents may require response times measured in hours while trivial issues may allow response times measured in days or even weeks.

The system’s availability is the time that it is available.

  • Mean Time Between Failures (MTBF).  MTBF is the Mean Time Between Failures.  This is the average amount of time between failures of a device.  Some devices can be repaired, and some devices cannot.  For example, a hard disk drive that fails irreparably after 300,000 hours has a MTBF of 300,000 hours.  A computer server that fails (but can be repaired) after 100,000 hours and then again after 300,000 hours has an MTBF of 200,000 hours (average of 100,000 and 300,000).

The MTBF of electronics and industrial equipment is usually measured in hours (since devices that are not in use are less likely to fail).  Some devices have a failure rate that is measured in cycles or per use (for example airplanes are rated based on the number of times they take off and land, and not the amount of time they spend in the air – take offs and landing put more stress on the airplanes’ components than the actual flying).

A manufacturer should disclose the MTBF on each device that they manufacture so that customers can make informed purchasing decisions.  If a device is inexpensive but has a high failure rate, then the long-term cost may be much higher.

If an organization purchases hard drives with a 300,000-hour MTBF, but they have deployed 100,000 hard drives, then they can expect that (on average) one hard drive will fail every three hours (or about eight per day).

Electronic devices usually fail on what is called a bathtub curve (high failure rate at the beginning and end of their life span, and low failure rate in the middle).

  • Single Point of Failure.  A Single Point of Failure is a component that, when it fails, will bring down an entire system.  A Single Point of Failure can be a physical object or a process in an organization.

Example of a single point of failure

  • Motherboard in a server; if the motherboard fails, the entire server stops functioning

    • Router in a computer network; if the router fails, the entire network stops functioning

    • Having only one accountant to approve accounts payable invoices; if the accountant is sick, the business will be unable to pay vendors

The organization should carefully identify single points of failure in its equipment and in its processes.  It should replace each single point of failure with a redundant system, when possible.

Redundant systems are more expensive than non-redundant systems.  Some systems cannot be made redundant, but there are usually workarounds.  The organization may choose to accept the risk associated with the single point of failure.  The business must understand the risk associated with a single point of failure.

Examples of redundant systems

  • Server with multiple power supplies (even in the event of the failure of a power supply or power source, the server continues to operate)

    • Having two servers run in parallel with a load balancer; the failure of one server will not affect other servers or the application that they are running

    • Having multiple individuals trained to cover specific roles, such as dispatchers, accounts payable clerks, system administration, and engineers.  These individuals should be in geographically different areas, to protect from natural disasters.

  • Disaster Recovery Plan (DRP).  Disaster Recovery is a process where an organization can resume normal operations in the event of a disaster (natural disaster, strike, data loss, fire, war, ransomware attack, or protest).  An organization must

    • Plan out a cost-effective disaster recovery plan considering all the different causes of disruption.  For example, an organization located in Florida should consider hurricanes, but an organization in Wyoming should not.

    • Identify the amount of downtime the organization can accept before having to resume normal operations.  An organization such as an insurance company may not accept any disruption to its operations.  A retail store may accept a disruption of one or two weeks.  The shorter the disruption, the more expensive the recovery plan.

    • The organization should practice the disaster recovery plan, holding regular drills with the key responders.

    • The organization should review and revise the disaster recovery plan to take advantage of new technologies and consider new threats.

    • The more effective the disaster recovery plan, the more it will cost.  The disaster recovery plan may cost the organization, even when no disaster has taken place.  For example, maintaining a second office for emergency use may cost the organization tens of thousands of dollars per month.  Is the potential harm caused by the disaster (multiplied by its likelihood) more expensive than the cost of maintaining the office?
  • Mission Essential Functions.  A Mission Essential Function is a component of a business that must always operate.  It is fundamental to the existence of the business (the business will shut down without it).

A business must identify all its mission essential functions.  A business should also identify vendors who are essential to its own functions.  The business must protect all its mission essential functions and ensure that they remain operational in the event of a disaster, or that they can be quickly restored.  Other non-essential functions can wait.

Examples of mission essential functions

  • Hospital emergency room

    • Manufacturing facility

    • Security at a nuclear power plant

    • Control room for a power grid

Examples of non-essential functions

  • Janitorial staff

    • Human resources

    • Marketing

  • Identification of Critical Systems.  The business must identify systems that are critical to its operation.  These are systems that provide essential functions.

If a critical system fails, then the business will not be able to operate.  A critical system may be spread across multiple offices or states.

  • Site Risk Assessment.  A Site Risk Assessment is a risk assessment for a specific location.   When an organization has multiple physical locations, it must conduct a separate assessment for each one.  A general risk assessment can’t properly evaluate the risk for each location.  Each facility may have different weather patterns, different assets, different crime rates, different types of workers, etc.