5.2.3 CHART System Maintenance
We have identified three major components of the CHART system requiring specific maintenance plans: field equipment, the communications network and the central system hardware and software. The field devices and the communications network that are installed in the field operate in a physical environment that is hostile to electrical and electronic systems. Equipment is subjected to extremes in weather including widely varying temperature, precipitation and wind. It is also subject to damage and disruption from electrical disturbances, construction, accidents and vandalism. Because many of the field elements are the direct interface to the travelers, the CSC/PBFI Team understands that reliability and rapid restoration are essential. Communications equipment and system hardware and software located at the operations centers are typically not subjected to such environmental extremes; however, they must support a substantial workload and must provide virtually continuous operations.
Maintenance for field equipment is currently handled by MDSHA maintenance staff and under existing contracts with equipment vendors and/or maintenance contractors. Maintenance for the communications system is provided under the Statewide Network contract. In-house resources provide hardware and software maintenance for the central systems with as-needed assistance from system providers.
18.104.22.168 Maintenance Sub-Systems
The following are the major sub-systems that have been identified as candidates for operations and maintenance support:
- Variable message signs
- Traveler advisory radio
- Weather stations
- Service monitoring
- Billing reconciliation
- Electronics (switches, routers, modems)
- Management system
- COTS (including database, operating system, network, development tools)
- CHART II
- Computer equipment
- CCTV Monitors/video wall
Clearly, the task of operating and maintaining the current system is a challenge. Funding is always a major consideration, but proper planning can alleviate many of the problems and concerns associated with system maintenance. MDSHA and the CSC/PBFI Team have both had experience with different maintenance options. We will utilize that experience and our familiarity with emerging practices to build a long-term strategy for CHART II.
22.214.171.124 Backup and Recovery
One of the key ingredients to long-term O&M success is the ability of the system to recover from outages and emergency situations. The following discussion will demonstrate how the CSC/PBFI design will address this critical issue thus making O&M easier and less costly to manage over the life of the system.
126.96.36.199.1 Failure and Backup Policy
The CSC/PBFI Team has designed the hardware architecture to be extremely fault tolerant and redundant with a goal to maximize system availability. The failures will be monitored, and a system put into place to automatically notify the appropriate personnel upon an agreed upon level of failure. Once a failure is detected, an attempt will be made to restart the subject device. Settings that can be modified will control the thresholds at which devices are considered truly failed and when alarms will be triggered.
The system is designed with disaster recovery in mind. With a combination of systems backups, data backups, data redundancy and network redundancy, the system can be restored and run from alternate sites in order to continue operations during disaster situations. System backups and restoration procedures will be put into place in order to enable the rebuilding of the infrastructure. Daily backups will be accomplished in order to rebuild data lost due to disaster situations. State-of-the-art disk storage technology, RAID 5, will be used to add redundancy to the storage hardware. The network infrastructure provides redundancy and allows use of alternate sites in the case of a significant problem at the SOC.
188.8.131.52.2 Failures and Alarms
Failures and alarms are monitored, logged, and managed as they occur. Certain devices such as field controllers and VMS signs are intelligent devices and will raise alarms, as conditions require. The system will log and pass along alarms as required. Communications with all devices will be monitored. When a device stops responding, a failure will be noted. All failures and alarms are logged for future analysis.
Once a failure or alarm is detected, an attempt to resume operation of the device will be initiated. Thresholds will be maintained for each device to determine the severity of the failure and/or alarm. Based on the threshold, an alarm will occur and proper personnel will automatically be paged in accordance with the system configuration. For example, the threshold for when to consider a detector station failed may be set at 3 consecutive failure to respond errors. Individual errors would be logged, but the alarm would not be triggered until the threshold is reached.
The proposed system also includes a function that monitors all of the software processes that operate in the system. If a software process fails, an attempt to restart it will be triggered automatically.
The architecture has been designed so that all servers are redundant. The redundancy provides a backup method of processing should a processor fail. Provisions have been made for redundancy in networking, power supplies, cooling units, and processors. The networking is redundant through the use of multiple network interface cards. Separate network interface cards are also included to service dedicated mirror links. The power supply is hot pluggable to allow the installation of a new power supply without powering down the unit. Servers include redundant fan modules to ensure proper cooling in the event that the prime cooling unit fails.
The processing capabilities of the servers are redundant through automatic fail-over operations in the event that the primary server fails. Mirroring of the data between the primary and backup servers are accomplished through dedicated links to eliminate any network degradation that the additional I/O could cause.
184.108.40.206.4 Failure Tolerance
Additional design features have been added to allow for tolerance from failures. These include state-of-the-art hard drive architecture, and power systems. The hard drives are hot swappable operating in a RAID-5 configuration. This allows for both performance gains and recovery from disk failures. The power system is fault tolerant through the use of an uninterruptable power supply (UPS) along with monitoring software that will shutdown the servers in a predefined manner when an extended power failure occurs. A communications management system that will provide monitoring of the communications network is being developed under a separate contract.
220.127.116.11.5 Backup and Restoration
Methodologies are being designed into place that will allow for the restoration of both the computer system infrastructure and the data being produced. In order to rebuild the infrastructure for any computer, binary images along with applicable documentation will be produced and delivered to MDSHA. This will allow operators to restore any individual computer or server in an estimated 30-minute window. Data will be backed up using Digital Linear Tape Drives and Computer Associates Arcserv for Novell or Arcserve for Windows NT software. The backups will be accomplished on a daily basis and thus will be scheduled to run automatically.
The process of restoration of both the infrastructure and the data should be tested by actually restoring machines at least once a year to ensure that the process of backup and recovery still works. More frequent backup/restore tests should be implemented for the system data. All data and infrastructure media should be rotated off-site for storage and management with the most recent versions at an off-site location, and previous versions on-site.
This extensive backup and recovery system is designed to insure the CHART II system will be available at the times it is most needed. O&M will be performed in an environment where the personnel on the job will be secure in the systems ability to produce in times of need.