Building high availability into industrial computers

As mission critical applications become more commonplace, so will the need for high-availability computing.

Industrial computer applications and specifications have evolved significantly over the last 30 years. The first industrial computers provided human-machine interface (HMI) support and supervisory control and data acquisition (SCADA) for automated machinery within factories. Today, machine-to-machine (M2M) applications and big data analytics have created a need for rugged computing in outdoor and harsh environments. Industrial computing applications now range from solar farm monitoring to parking lot kiosks.

The first industrial computers were more robust than their commercial-grade counterparts, but were not purposely designed for mission critical applications. Today, features such as ingress protection rating, vibration testing, and temperature ratings help qualify a computer as industrial, but there’s little industry standardization on how to measure and report reliability, availability, and serviceability.

In 1984, IBM released the first personal computer marketed for industrial applications. The IBM 5534 was a beefed-up, dark-brown-colored version of the traditional light-beige IBM XT computer. Features like the double-fan cooling, hardened metal casing, high-output power supply, thermal sensor, and lockable drive door cover made the 5534 more suitable for factory floor applications.

By the 1990s, industrial computers flourished, running mostly SCADA and HMI software applications. There was a growing movement to replace programmable logic controller (PLC)-based control systems with PC control software. By 2000, most industries abandoned the notion of PC-based control due to operating system instability and version changes. Today, we find that PLC and distributed control system hardware have remained the control platforms of choice, while the industrial computer is the preferred platform for SCADA and HMI applications. Use cases for industrial computers include security appliances and remote authentication servers. However, there is uncertainty about whether the industrial PC is an appropriate choice for mission critical process applications.

IFS

Linux distributions, such as Red Hat, CentOS, and Ubuntu, have addressed operating system stability and lifecycle concerns. However, there are fewer commercial off-the-shelf industrial applications for Linux systems than forMicrosoft Windows-based operating systems. Virtualization hypervisors, such as VMWare vSphere, Microsoft Hyper-V, and Stratus Technologies everRun software, provide fault tolerance between the application and the hardware. Designing a fault-tolerant system exclusively with applications software, operating systems, or hypervisors adds complexity, while redundant elements intended to increase reliability can lead to more points of failure.

A simplified approach to designing a highly dependable industrial computing system starts with the hardware. Power supplies, fans, memory, and disk storage cause most computer failures, especially when exposed to excessive heat, dust, and electrostatic discharge (see Figure 1). To mitigate these failure modes, mission critical system designers should consider three key measures: reliability, serviceability, and availability.

Reliability

Reliability is the probability that a device will perform its required function under stated conditions for a specific period of time and is quantified as the mean time between failures (MTBF). Manufacturers typically determine this number from product testing, product modeling (e.g., MIL-HDBK-217 or Telcordia), or measuring product failures.

While testing and modeling prior to product launch provide useful estimates, these approximations do not correlate well with field performance experienced by end users. Field failure data provide greater MTBF accuracy, assuming the manufacturer accounts for all field failures.

The common method for calculating the MTBF from field data is:

MTBF = Number of products in service during a 12-month period/Number of observed failures during a 12-month period

For example, a 100-year MTBF implies that for every 100 products in service for one year, one hardware failure will occur. The larger the sample size, the more accurate the MTBF.

Eliminating the top causes of hardware failures significantly increases MTBF. Top causes of hardware failures include fans, physical media drives, error-correcting code (ECC) memory, and conformal coating.

Figure 2: Heat sinks and conduction heat pipes eliminate cooling fans, which frequently fail. Courtesy: Schweitzer Engineering LaboratoriesFans: Microprocessors typically rely on fan cooling for fast clock rates and wide bus architectures. Fans wear out, frequently fail, and pull dust and debris into the computer chassis. Internal dust creates a thermal blanket over printed circuit boards, resulting in trapped heat that causes premature component failures. Industrial computers are typically exposed to higher ambient temperatures than their commercial counterparts, so a cooling method is still required. Passive cooling techniques, such as finned convection heat sinks and conduction heat pipes, eliminate fans and their associated failures (see Figure 2).

Physical media drives: In general terms, solid-state drives (SSDs) have three times the MTBF of magnetic, spinning hard-disk drives (HDDs). Because SSDs contain no moving parts, there is no chance of a mechanical failure. SSDs also outperform HDDs in locations with vibration and elevated temperatures. When selecting an SSD, it is important to consider the difference between single-level cell (SLC) and multilevel cell (MLC) technologies. SLC SSDs offer 30 times more writes and increased data retention over lower-priced MLC SSDs.

Error-correcting code (ECC) memory: Electromagnetic interference inside a computer can cause a single bit of dynamic random access memory to flip to its opposite state. This phenomenon may cause an unnoticeable pixel change on a screen or a devastating system crash. ECC memory employs a parity checksum algorithm to check and correct for these errors.

Conformal coating: The protective coating or polymer film that conforms to the circuit board topology is called “conformal coating.” It protects electronic circuits from harsh environments that contain moisture or corrosion-causing chemical contaminants.

Serviceability

Serviceability measures how quickly a computer can return to operation after a system fault and is measured by the mean time to repair (MTTR). This is a more challenging value to calculate than MTBF because MTTR depends on the time to obtain a spare, how the site is manned, and how the computer is configured. Depending on user conditions, the MTTR can range from seconds to weeks.

Strategies designed to reduce MTTR include redundant power supplies, redundant drives, and out-of-band management (OOBM).

Redundant power supplies: While some industrial computers have extremely reliable power supplies, many do not, necessitating a redundant supply. For computers with high power supply reliability, the benefit comes from the power input diversity. Powering each supply from a different external source, such as an ac wall outlet and a dc uninterruptible power supply (UPS), ensures that the computer will never lose power if one source fails. Further, maintenance personnel can eliminate the MTTR altogether by hot-swapping power supplies without a disruption to the system.

Redundant drives: A redundant array of independent disks (RAID) is a data storage virtualization technology that enables multiple drives to mirror data to each other, but act as one logical drive. RAID can offer significant reduction of the MTTR because either drive can continue functioning in the event of a single drive failure.

OOBM: This strategy involves a group of technologies that allows the owner of remote computer assets to perform many maintenance and recovery tasks, such as restoring an operating system or performing a system reboot, over the network. Without OOBM, system specialists have to travel to the remote computer. By eliminating the need to travel to a remote site, OOBM significantly reduces the MTTR.

Availability

Availability is a function of reliability and serviceability and defines the percentage of time when the system is operational. It is expressed with the equation:

Availability (A) = MTBF/MTBF + MTTR

Maximizing availability requires increasing the MTBF and decreasing the MTTR. A common way to express availability is in terms of “nines,” or downtime. A five-nines availability of 99.999% may sound like a good design goal; however, five minutes of downtime may be catastrophic to a process or enterprise (see Table 1).

A paradox appears when adding redundant components, such as power supplies and drives, to increase availability leads to a decreased MTBF because of the potential for added component failures. This is why it is critical to keep redundant designs simple and comprehensive.

Although classifying computers as “industrial” is still highly subjective, there are objective metrics to consider when choosing the level of availability required by an application. As mission critical applications become more commonplace, so will the need for high-availability computing. By optimizing MTBF and MTTR, computing systems can meet or exceed the availability requirement.

Tim Munro is market manager, Computing Systems at Schweitzer Engineering Laboratories (SEL Inc.). He has worked in the industrial automation industry for 25 years and has expertise in motion control, PLCs, SCADA systems, and networks. He has degrees in electrical engineering technology and control systems engineering technology.

Automation Update