How to Maintain z/OS Availability During System Enhancements: Our Recommended Tactics, Tools, and Services
You don’t need to lose z/OS availability when you install system enhancements.
With the right strategy, you can keep your mainframe and its applications running at peak operations while you perform maintenance, fix bugs, and deploy updates.
And we wrote this series to give you that strategy.
This is part two in our series on maintaining z/OS availability during enhancements.
In part one, we outlined our strategy to maintain availability. In it we explored:
- Why z/OS systems are so prone to problems, and why you need to establish a dedicated risk-mitigation strategy when performing system enhancements.
- The primary risks you will face when performing z/OS enhancements.
- What steps you must follow to when performing z/OS enhancements to mitigate your risks and avoid unexpected outcomes.
You can read part one here.
In this final part of our series, we’ll dig into the details of our strategy.
- Our 6 best methods to maintain availability during z/OS enhancement.
- The 4 primary tools we recommend to maintain z/OS availability.
- The services we offer to anyone who wants to improve their availability.
Table of contents
As we previously outlined, to maintain availability while enhancing your z/OS infrastructure, you will need to perform a complex set of planning and preventative measures that include monitoring, analysis, and resource management.
To do so as efficiently and effectively as possible, we recommend the following smart technological solutions and methods.
We consider Parallel Sysplex critical for maintaining z/OS availability, and we will explain how to use it in much greater depth later in this article. But at a topmost level, know this — before Parallel Sysplex was created application availability was equal to system availability. If your system went down, your applications went down, every time. But with Parallel Sysplex, you can keep your applications up and running even if your system goes down. This solution allows you to seamlessly share data between multiple operating system images without losing their integrity.
Here’s how it works. If something changes in one of your mainframe’s components and causes an unexpected failure, then Parallel Sysplex can redirect the load and execute procedures from applications on that failing mainframe to a stable system. In addition, Parallel Sysplex can also distribute workloads across multiple mainframes during peak hours to maintain optimal z/OS infrastructure performance even when nothing has gone wrong, per se.
We consider Parallel Sysplex to be essential for maintaining z/OS performance at all times, and we will dig further into our recommended use of this solution shortly.
By deploying multiple data stores, you will be able to create clearly defined disaster recovery plans, and you will be able to more easily recover your data if you do experience a system failure.
With the right strategy you can even maintain continuous availability with a single data center. You simply have to mirror all disks from their primary data store to a secondary data store, which will make full volume dumps available if a disaster occurs. With this strategy you will lack automation, but you can bring it back in by implementing GDPS Metro HyperSwap Manager at the same time.
Here’s a practical example of what that looks like and how these solutions can work in a real-world disaster scenario.
For one of our banking clients we configured Parallel Sysplex and data sharing for their SAP workload. In this setup we combined dynamic workload balancing with their GDPS solution. With all of this in place, if our client experiences a failure on their primary disk subsystem the controlling system will avoid an outage by automatically initiate a HyperSwap, and transparently switching all of the systems in their Parallel Sysplex over to their secondary volumes.
With this method, you will redirect traffic and system load to your currently working components, and it will allow you to artificially create failure situations to periodically test your system.
This method allows you to automatically restore uncompleted operations by automatically redistributing traffic and performing load balancing when one of your system’s components failures. If you properly set this up, then when a problem occurs your system will automatically flip your in-progress operations from one mainframe to another.
You will always need to establish solid z/OS system monitoring. Now, system monitoring will not prevent any problems on its own, but it will augment or even drive your other methods of maintaining availability.
The right system monitoring method will:
- Make it possible to see what operations are being performed within a certain time period.
- Show you statistics on resource usage, compared against critical indicators of system loads.
- Help you identify which of your components have the highest risk of overloading your system.
- Alert you to any sudden jumps in system resource usage.
- Report on system operations to help you schedule your high-load system enhancements during safe times.
Finally, we recommend using existing software — or building homegrown solutions — establish control over external access to your mainframe’s data.
The data access control software will:
- Control access to your mainframe’s data and resources.
- Prevent unwanted access to your mainframe’s data and resources.
- Make a record of all access to your mainframe’s data and resources.
IBM has developed and provided a wide range of tools that make it easier to keep your z/OS systems available. In addition, there are a few external software and hardware-based tools you can use to supplement IBM’s suite of solutions.
Primarily, we recommend the following tools and solutions. We use all of them with our own clients, and know that they deliver results in complex real-world scenarios.
First, we use a few software and hardware-based tools to make remote copies of our client’s data to ensure data availability as quickly as possible after a disaster occurs.
The primary data replication tools and methods we use and recommend include:
- Metro Mirror: This tool makes sure that any updates to your primary data volumes are automatically synched to your remote volumes, and that all interactions related to these changes are done between disk subsystems.
- XRC: This method uses a z/OS component named System Data Mover (SDM) to retrieve updates from your primary disk subsystem, and to then apply those changes to your secondary volumes.
- Global Mirror: This tool also mirrors your data asynchronously, but unlike XRC it performs all interactions between your disk subsystems and not by SDM.
Establishing your data replication tools and methods is not enough on its own. You also have to test your tools and methods by simulating various hardware and software failure scenarios — before you suffer an incident.
As we mentioned previously, we consider GDPS essential for maintaining system z/OS availability when you are performing system enhancements. GDPS delivers business IT resilience and disaster recovery, and mitigates problems around high application availability, data integrity and data replication solutions.
We believe you can only achieve the highest possible availability levels — and avoid application component loss — by using GDPS to run several failure-isolated copies at all times. To do so, you must be able to share integrated data on the record level and dynamically send incoming requests across your available servers. GDPS is able to do this by linking separate systems in the cluster via multiple software and hardware components. This allows GDPS to eliminate single points of failure while presenting all of its systems as a single image to your applications and end users.
You can further protect your systems from downtime by utilizing the GDPS Continuous Availability tool to switch workloads between two Sysplex during either planned or unplanned outages.
In addition, we recommend using IBM System Automation with GDPS. By combining these tools, you will be able to manage automations within your workloads and systems, you can ensure your critical policy repository functions properly, and you can further accelerate recovery by configuring automated disaster recovery for many different scenarios.
We use three monitoring tools and methods in our day-to-day support for our clients. These monitoring tools help to ensure the data center’s GDPS maintains effective performance levels and high availability.
- IBM OMEGAMON family: With this tool you can monitor and manage the performance and availability of individual z/OS systems, as well as the workload performance and resource utilization of the Parallel Sysplexes in which they participate. We mainly use OMEGAMON to monitor Db2 threads and to perform batch reporting — usually accounting reports and useful exception reports that trigger alarms when a transaction has exceeded a threshold. We schedule these batch reporting jobs to automatically run and save themselves at regular intervals so we always have the data we need to investigate and prove online or past performance issues.
- Resource Measurement Facility (RMF): With this performance monitor you can collect data for long-term performance analysis, capacity planning, root cause analysis, and to provide our clients with evidence of performance issues. RMF gives you comprehensive visibility into the many data types within your z/OS infrastructure, and allows you to analyze both system performance and transactions. We often use RMF reporting for Enclaves to see if individual transactions are performing well or if they are hogging system resources.
- System Management Facilities (SMF): With this feature you can collect and record system and workload-related metrics like CPU usage, direct access volume activity, configurations, system security, and other system resources and application activities. These records form the foundation of performance and security reports.
These three tools work well together. Some of our favorite combinations and use cases include:
- We often analyze RMF reports to improve our Parallel Sysplex’s configurations. With these metrics we can improve performance of Coupling Facility, z/OS itself, and subsystems like Db2, CICS, MQ, and application transactions.
- We isolate and fix system degradations by using RMF III Interactive Performance Analysis to identify which of our workloads are delayed and why.
- We regularly dump and collect SMF records to use later on. If we identify a bottleneck we can run RMF batch reporting for the bottleneck’s timeframe, and then see which indicators exceeded their normal, historical values. We can then also look on Sysplex to look after Coupling Performance indicators — such as the number of synchronous and asynchronous operations, and the average time for one operation in CF links. (Any performance degradation in CF creates performance degradation across the data center — it must be avoided.)
- We often combine two or more reports to develop a more comprehensive picture of performance incidents. For example, we might run OMEGAMON for Db2 statistics report to check the health of the z/OS system as a whole, and then run an RMF for Enclaves or accounting report to see resource usage by dedicated workload. By combining these two, performance problems have nowhere to hide.
As a note: We also use Workload Manager (WLM) to make sure the client’s critical workloads always receive the resources they need, in their order of priority. WLM allows us to define and adjust the performance goals for each of our client’s workloads — based on their own SLAs — and to then make sure different classes of transactions don’t interfere with each other, and that each critical workload always receives the resources it needs, when it needs them.
Finally, we use and recommend the IBM QRadar Security Intelligence Platform family of products. IBM QRadar has been a top-ranked SIEM product for over 11 years, and they provide a unified architecture to integrate a wide range of security capabilities, including:
- Security Information and Event Management (SIEM)
- Log management
- Anomaly detection
- Incident forensics
- Configuration and vulnerability management
We use the IBM Security zSecurity suite of products, many of which feed into QRadar. This suite provides cost-effective security administration, threat detection, and automated audit and compliance reporting.
We use this suite similar to our other monitoring tools. We use it to collect both online and offline (historical or bath) reports. We use the suite’s Security Event Management (SEM) component to perform real-time monitoring, to manage events, and to configure notifications. We use its Security Information Management (SIM) component for longer-term storage, analysis, and reporting of events.
Within the suite, we use zSecure Alert and Audit to perform security operations on the mainframe itself. We capture events on the mainframe, and then forward them to a central cross-platform solution. (zSecure is able to send both alerts and enriched, non-collated events, depending on your requirements.)
zSecure Adapters for SIEM and zSecure Audit ship with out-of-the-box event feeds into IBM QRadar SIEM. It supports basic z/OS events and RACF, Integrated Cryptographic Services Facility (ICSF), Db2, CICS, z/OS Communications Server, IBM Security Key Lifecycle Manager (ISKLM), WebSphere Application Server (WAS), and other events.
The main source for events is — you guessed it — SMF records. SMF enriches the information by correlating it within your security database and system environment, then transfers it as a SIM feed using file polling triggered by the SIEM solution.
This information can also be sent in near real-time if you use the Syslog protocol. To do so, you can follow one of two methods:
- Use the SMF in-memory (INMEM) resource feature.
- Or use the zSecure SMF Collector. Note that the INMEM feature requires the use of SMF log streams (as opposed to data sets).
You can also establish real-time monitoring through zSecure Alert, which sees SMF records before they have been written to disk, and can also listen to WTO messages. zSecure Alert can aggregate events across time intervals and trigger alerts only when it sees that certain thresholds have been exceeded. (A pluggable component IBM zSecure Alert DSM is provided by QRadar SIEM.)
By combining RMF, Omegamon, QRadar SIEM, and zSecure, you are able to monitor everything — including security events — to find anything that could impact the availability of your data center.
While any organizations can use any of the strategies, tactics, and tools that we have outlined and recommended to maintain their z/OS availability, we also recognize that many organizations lack the internal resources to put all of these approaches to work.
To help organizations that need to maintain z/OS availability but whose teams are already stretched thin, we offer a wide range of relevant services. We’ve offered these for over 25 years, and know we can bring them successfully to your organization.
- Configuring Parallel Sysplex from scratch with two aspects — software, and hardware, including PR/SM, LPARs, Coupling Facility (CF), CF structures, channels (links), and subchannels.
- z/OS system installation and configuration for the execution in a Parallel Sysplex environment, including management of Couple Data Set (CDS) and polices.
- Networking, Sysplex Distributor, Virtual IP Address (VIPA) management. With Sysplex Distributor, clients receive the benefits of workload distribution provided by Workload Manager (WLM).
- Installation and configuration of System Automation for z/OS on control systems.
- Installation and configuration of z/OS components and SW products for the execution in Parallel Sysplex such as CICSplex and Db2 Data Sharing.
- Installation and configuration of customer software. For instance, for our customer, which is a bank using SAP, we manage Db2 Data Sharing as an SAP database with SAP Sysplex Failover Support is configured.
- Regular maintenance of the operating system and SW products with well-established scenarios and approaches of SW upgrades for providing high continuous availability.
- Storage management for the usage across systems in the Parallel Sysplex environment.
- Setting disaster recovery (DR) scenarios using Data replication technologies.
- Installation and configuration of QRadar and IBM zSecure, as well as cooperative incident investigation and remediation recommendations.
In short: If you don’t know where to start with all of this work, if you are unsure of which availability features will help you most, or if you simply need help bringing these availability features to life — then reach out today.
At IBA Group we have delivered z/OS availability in the most demanding of environments, where uptime is most important — including large banks and transportation companies — and we have learned how to deliver complete and high-quality system configurations, monitor, and implementations of any size.
You may also be interested in
- How to Break Your Cycle of Constant Firefighting and Embrace Proactive Management
- How to Break Your Cycle of Constant Firefighting and Embrace Proactive Management (Part Two)
- How to Maintain z/OS Availability During System Enhancements: Our Core Strategy
- The Key to Large-Scale Mainframe DevOps Transformation: Building the Right Dedicated Team
- Choosing the Right First Steps in DevOps Mainframe Transformation
- Improving the Initial Automation and Creating a “Two Button Push” Process
- How to Select and Implement the Right Tools for Mainframe DevOps Transformation
- Picking the Right Tools: A Deep Dive Into the Most Popular Options for Mainframe DevOps