Industrial Safety and Cybersecurity for Critical Installations
Published on : Tuesday 09-02-2021
Complex industrial installations are technologically advanced and require control and safety systems where connectivity is starting to be a necessity, says Piotr Ciepiela.
In recent years, we have experienced a significant number of the successful cyber-attacks on critical infrastructure (CI) facilities. Those widely recognised and discussed cases (such as attacks to power grids and multiple critical manufacturing companies impacted by ransomware) have brought the new cyber risk to the attention of CI owners and societies whose safety depends on proper operations of this infrastructure.
As a result, we have now observed a significant rise in investments and new regulations aimed at protection of CI installations. The management and cybersecurity departments of CI operators are keenly aware how cyber-attacks can result in unavailability of essential services they support, by impacting basic process control systems with such components like DCS (Distributed Control System), PLC (Programmable Logic Controller) controllers or SCADA (Supervisory Control And Data Acquisition) class applications.
However, there is still a limited understanding by non-automation professionals that for large and complex installations like refineries or power grids, there is another layer of industrial control systems, the so called Safety Integrated Systems (SIS) protecting us from large malfunctions, which would result in ecological catastrophes and loss of life. These very reliable, basic systems have a single purpose – to switch off technological installation in a safe manner if technological process parameters go beyond acceptable levels.
In practice, those systems are the last active line of defence which can allow us to avoid a catastrophe, such as a refinery explosion. Beyond SIS, there are passive measures dedicated to minimising the impact, such as explosion-proof infrastructure. Ensuring proper reliability of these critical systems is a goal of the whole significant part of automation engineering, which is part of a larger domain called functional safety. There’s an imperative to pay more attention to the possible exposure of SIS to the cyber threats and how functional safety specialists can consider those threats in their reliability analysis for systems they design and deploy to keep people and systems safe.
Redundancy in designing Safety Integrated Systems
Redundancy of components has been a widely applied measure to increase reliability of safety related automation systems. In industrial safety systems, the essential idea of redundancy is activation of redundant pairs in such way that in case of element’s failure, the danger for (or from) an industrial process can still be effectively detected by another device. Most often, a redundant pair can be found among the processors’ modules and power supply, up to the deployments where entire systems with communication modules and input/output cards are duplicated for safety reasons.
The level of redundancy is variable and mainly depends on the process – the risk-based approach is utilised where the probability of process malfunction and potential losses in case of SIS failure are considered. To ensure that the whole system provides reliability on the expected level, reliability of every single component in this system determining its proper operation need to be taken into account. The most used factor to describe reliability of components and systems is Mean Time Between Failures (MTBF): namely, the average time of non-failures work determined by the manufacturer, based on the extensive testing and statistical analysis. Expressed in hours, MTBF defines the time period after which the probability of reliability drops to 36.8%. It may to calculate the probability of device will function with a given period time without failure with MTBF.
As described above, information about expected reliability is based on the statistical analysis and testing. In fact, safety systems engineers are designing and deploying SIS on the basis of this information and safety is very dependent on its accuracy. If the information in not valid, technological installation may have systems that do not provide the proper certainty of valid reaction in case of potential catastrophic process disruption, which can be caused by human error, mechanical issue, or basic process control system disruption. The cause of the SIS malfunction can vary. However, the fact is that the reliability data does not and cannot include the possible impact of cyber threats. And redundancy cannot improve the situation in that case – if one component has vulnerability that can be exploited, the second component of the same type will have it too. The easy solution would be to have SIS deployed in such a way to completely avoid the risk, by not connecting it to the network. However, complex industrial installations are technologically advanced and require control and safety systems where connectivity is starting to be a necessity.
Hazard and operability studies
Catastrophic industrial accidents in the past caused by human error, improper design or operations have resulted in development of analysis methods dedicated for ensuring functional safety. One of those, proven in use, is the Hazard and Operability Studies (HAZOP) method. This method is based on a systematic review of design assumptions and the technological processes to identify all possible deviations of parameters and supports selection of proper countermeasures to lower the risk of such events to acceptable levels. HAZOP is a standardised risk analysis method for conducting analyses according to the methodology described in IEC (International Electrotechnical Commission) 61882 standard.
The safety of industrial processes is based on independent protection layers. These layers help ensure that a single incident will not cause immediate hazard to human life and health, environment and production. Individual protection layers, in addition to the Independence attribute, must also have adequate Reliability to provide the proper levels of Risk Reduction Factors. The protection layers can be divided into Prevention Layers, which are reducing the likelihood of occurrence of hazards and Mitigation Layers which are reducing the impact of failure.
Random vs. systematic failures
Incorrect operation of individual protection layers can result from random hardware failures or systematic errors. To reduce the probability of random failures, devices with adequate reliability and hardware redundancy are used. In order to reduce systematic errors, appropriate system design processes are used, based on systematic review and verification.
The vulnerabilities of OT (Operational Technology) systems have recently become one of the most serious causes of system malfunction. System vulnerabilities can be classified as the most serious systematic errors that can cause a degradation of risk reduction factor for protection layers.
Cyber threats for protection layers
In modern industrial plants, most systems are connected to the Ethernet network or industrial networks. This allows for easier management of production efficiency. However, as a result, any system or asset vulnerability can be exploited to cyber-attack. If several systems have the same vulnerabilities, then not only one but few systems may be attacked simultaneously. This can lead to a situation in which none of the protection measures will be able to perform its function correctly.
As a result, a serious industrial accident may occur, causing loss of life, environmental pollution or financial loss. It is therefore important that risk assessments related to process safety also address the cybersecurity issues of OT systems.
S-HAZOP vs. process HAZOP: Proactive prevention of cyber-attacks
1. HAZOP analysis of the process allows the identification of hazards resulting from variations in process parameters such as pressure, flow, temperature.
2. S-HAZOP (Security HAZOP) allows the identification of how cyber-attacks can affect process safety, and how security incidents may override process parameters and lead to
major industrial accidents.
S-HAZOP is a variation of HAZOP which can be used to identify and classify the vulnerability of OT systems to cyber-attacks. This method helps to recognise which security measures should be implemented, which system components should be covered by these security measures and how these security measures decrease risk levels across the facility. A team consisting of security professionals assisted by staff members from manufacturing identifies potential risks and their parameters, such as the possibility of occurring and effects on production. Based on the parameters, a risk matrix is created, such as by utilising IEC 62443 standard security controls – which is the leading industrial cybersecurity standard. The matrix indicates which security measures are not sufficient and what technical and procedural controls (such as periodic maintenance or testing) should be deployed to lower the risk. Based on the S-HAZOP analysis, a team of professionals can propose security solution tailored specifically to the facility and its potential problems. S-HAZOP utilises best practices developed by functional safety and cybersecurity experts to enable protection of SIS against modern cyber threats.
OT asset management
Another aspect of proper industrial control systems management is starting to be increasingly important: it is OT asset management. Well recognised in the corporate and IT world, asset management brings visibility and proper understanding of actual equipment utilised in production. Despite this, it is still quite rare that proper asset inventory is being created and maintained in the OT environment. Assets should be digitalised to accelerate the change management process procedure. In addition, today’s inventories usually account for only main systems, and do not consider the overall data structure and all components like HMI s (Human-Machine Interface)/engineering stations or smaller industrial systems. To protect all devices from being hacked, full visibility and management of assets it is essential to perform S-HAZOP and effective OT security.
Piotr Ciepiela, EY Global Cyber Architecture, Engineering & Emerging Technologies Leader, has over 14 years of experience managing international, complex OT and IoT security projects. A globally recognised leader in critical infrastructure and industrial control systems (ICS) security, Piotr is the co-founder of operational technology (OT) and IoT teams at EY. He participates in the creation of international OT and cybersecurity standards, supports various governments in critical infrastructure protection and has contributed to current regulations, methods and standards.