When it comes to data-center downtime, technological malfunctions, such as equipment failure, or an unexpected event, such as a natural catastrophe, come to mind. According to a study by Uptime Institute (http://uptimeinstitute.org/), however, 70 percent of data-center downtime is caused by human error, which can be avoided by following some simple steps.
Training and Procedural Plans
Ongoing training emphasizing a holistic approach to data-center management is essential to preventing human error. Ensure all individuals with access to a data center, including information-technology (IT), emergency, security, and facility personnel, have basic knowledge of equipment so that it is not shut down by mistake.
Having a documented method of procedure (MOP) mitigates, if not eliminates, the risk associated with performing maintenance. Do not limit an MOP to one vendor, and ensure backup plans are in place in case of unanticipated events.
Sometimes, data-center managers get too comfortable operating systems and deviate—intentionally or unintentionally—from procedure, inadvertently shutting down equipment. Keeping operational procedures up to date and adhering to them is critical.
Also critical is adherence to secure-access policies. Organizations without data-center sign-in policies run the risk of security breaches. Requiring escorts for visitors, such as vendors, enables data-center managers to know who is entering and exiting their facility at all times.
Emergency 'Off' Buttons and Labeling
Emergency "off" buttons generally are located near doorways in data centers. Often, they are not covered or labeled and are mistakenly pushed, shutting down power to the entire data center. Unintentional shutdowns can be avoided by labeling and covering emergency "off" buttons.
Incorrectly labelled protection devices, such as circuit breakers, can have a direct adverse impact on data-center load. For a power system to be operated correctly and safely, all switching devices and the facility one-line diagram must be labeled correctly. Procedures for double-checking device labeling should be in place.
Food, Drinks, and Other Contaminants
Food and drinks should be prohibited in data centers. Liquids pose the greatest risk of shorting out critical computer components. The best way to communicate your data center's food-and-drink policy is to post outside of the door a sign stating the policy and how vigorously the policy is enforced.
Poor maintenance of indoor-air quality can cause dust particles and debris to enter servers and other IT infrastructure. Much of the problem can be alleviated by having all who enter a data center wear antistatic booties or placing a mat outside of the data center. Another good practice is to avoid packing and unpacking equipment inside of a data center. Moving equipment inside of a data center increases the chances that fibers from boxes and pallets will end up in server racks and other IT equipment.
Conclusion
By implementing these best practices, data-center managers can decrease significantly the chances of data-center downtime caused by human error. Implementing training programs and exercises emphasizing a holistic approach to data-center management is a great starting point to avoiding one of the main factors contributing to data-center downtime.
A final note: I recommend reading Uptime Institute's "Tier Standard: Operational Sustainability," a set of specifications aimed at helping data-center managers increase uptime. The report addresses human error and how it can impact long-term performance. To download, go to http://bit.ly/efib0O.
As manager of technical training for Emerson Network Power Liebert Services, Mark Cousino develops and delivers power-, cooling-, monitoring-, and battery-related training programs intended to maximize uptime. In 2009, he was named Employee of the Year for his role in the conception, design, construction, and grand opening of the Emerson Network Power Learning Center.
Did you find this article useful? Send comments and suggestions to Executive Editor Scott Arnold at [email protected].