How to Maintain z/OS Availability During System Enhancements: Our Core Strategy

April 20, 2021 | VALERI ARANOUSKI

It’s a common problem for Db2 administrators.

You are caught in a bind.

On the one hand, you must continuously enhance your z/OS infrastructure by performing regular system maintenance, fixing bugs, and deploying software updates.

On the other hand, every time you perform one of these enhancements you risk compromising the performance of your z/OS infrastructure and the programs that run on it.

You can’t stop performing these enhancements. They keep your infrastructure fundamentally stable, they provide your end users with new critical features, and they ensure your applications remain supported by their developers.

But you also can’t make a mistake when performing these enhancements. One mistake from one enhancement can create a significant performance drop. Over time, these performance drops can create millions of dollars of losses, while damaging your organization’s reputation, and hurting you and your team’s credibility.

In short: You must find a way to continue to perform your enhancements while mitigating the risk of mistakes, downtime, and unpredictable outcomes.

We wrote these articles to show you how to maintain z/OS availability during system enhancements.

This is part one of a two part series. In it, we will explore:

Why z/OS systems are so prone to problems, and why you need to establish a dedicated risk-mitigation strategy when performing system enhancements.
The primary risks you will face when performing z/OS enhancements.
What steps you must follow to when performing z/OS enhancements to mitigate your risks and avoid unexpected outcomes.

Why z/OS Breaks During Routine System Enhancements

Let’s be clear about one thing — we are strong supporters of z/OS.

A mainframe working with z/OS will offer organizations the most reliable operating system available to host and run their most mission-critical applications.

But z/OS is also a very complex operating system. To maintain peak performance, it relies on many interrelated components and utilizes a suite of equally complex software products and user applications, like CICS and Db2.

Any change to any one of these components or applications can create unforeseen impact on any of the system’s other components or applications — and that can lead to unexpected performance loss for any program being run on the z/OS mainframe.

Consider this practical example, using Global Locks in Db2 in z/OS.

If you make a change to your Global Locks, you may impact your z/OS mainframe’s:

Coupling Facility firmware (CFCC).
Sysplex Services for Data Sharing and Cross-system Coupling Facility (XES/XCF) in z/OS.
Internal Resource Lock Manager (IRLM) of Db2 for z/OS.
The Db2 for z/OS product itself.
And, finally, any application that produces the Db2 lock and might propagate the change across other Db2 members of its Data Sharing group.

For another practical example, consider the Db2 optimizer component.

The Db2 optimizer’s algorithms are so complex — and the access path for SQL queries depends on so many factors — that even one Product Temporary Fix (PTF) can disrupt the component’s sensitive balance and create unexpected, far-reaching performance degradation.

In sum: Even a minor PTF change, or a small configuration update, or a new deployment of application code, or really any change to any of your z/OS components can break your system and generate performance degrading outcomes.

Ultimately, the smallest change and smallest resulting problem can cascade into:

Failure to realize performance improvement from your enhancement.
Failure to solve bugs from your enhancement (and potentially creating new performance losses on top of it).
Unplanned outages for your individual software components or for your system as a whole.
Configuration resets (for example parameters can be dynamically applied to the system and lost during IPL).
The true nightmare — data loss or dropped software availability.

You can’t leave these outcomes up to chance. You must do everything you can to reduce the potential for problems — every time you perform a z/OS enhancement.

Here’s how.

First Thing’s First: Learn the Most Common z/OS Enhancement Issues

Before you can mitigate your risks, you first need to know what they are. And the biggest and most common z/OS enhancement risks all come down to one thing — high loads.

In most cases, z/OS enhancement problems happen because whoever planned the enhancement accidentally created excess system load or increased their transaction volumes past a safe threshold.

Specifically, most z/OS enhancement problems are created by:

Creating high load during rush hours. If you perform enhancements during periods when a lot of your system resources are already being actively used, then you will likely overload some component within your system.
Creating high load due to poorly written legacy programs. If you are using modern approaches in software development this won’t be a problem. But if you are still using a lot of legacy systems then they can spike resource usage.
Creating high load due to human error. If you have younger mainframe professionals take over for your experienced mainframe experts, then it’s possible they will lack the know-how to give high-quality software support.
Creating system overload due to regular maintenance. If you perform your enhancements at the same time some of your components are being updated, maintained, or are otherwise unavailable, you can cause excess loads.
Creating system overload due to poor configuration. If your system or any of its components are configured the wrong way by inexperienced mainframe professionals then you can even cause problems during light load hours.

There are other examples, but these are the most common reasons why you might experience a problem during an enhancement due to excess system loads.

In addition, there is one more common source of problems during system enhancements — cybersecurity incidents. It’s possible to experience system outages due to hacker attacks, or from compromised third-party applications, if you aren’t continuously monitoring these issues and addressing them when they occur.

Unfortunately, each of these errors are easy to make, even by experienced mainframe professionals. The z/OS infrastructure and the applications that it runs are so complex, and it’s difficult to analyze your infrastructure’s availability in real-time.

Even worse, when you do run reports to define your infrastructure availability those reports often don’t correlate well with each other, forcing you to interpret them separately.

All of this makes it very difficult to both define the full picture of your infrastructure’s current availability at any given moment, and to determine which factors you must consider when planning capacity optimization or enhancement.

However, even though it’s difficult to plan seamless z/OS enhancements, that doesn’t mean that it’s impossible. With the right strategy, you can diligently and meticulously plan your enhancements and ensure they go off smooth and incident-free.

Here’s what that strategy looks like.

How to Deploy Error-Free Enhancements: Key Steps

At IBA Group, we have spent many years performing many system enhancements for many of our clients’ z/OS infrastructure. Over these real-world deployments, we have developed a simple but effective strategy for performing these enhancements risk-free.

Our strategy is based around a few key steps.

First, we don’t re-invent the wheel or chase any shiny new approaches — we just keep it simple.

We have learned that IBM mainframes have everything you need to ensure high availability, performance, and security, even in the middle of system enhancements. You simply have to be very disciplined and meticulous about following IBM’s recommendations and best practices for performing enhancements on z/OS infrastructure.

You do have to adapt these practices to the specifics of your infrastructure, but ultimately IBM’s own recommendations have everything you need.

Second, we always develop a detailed and clear maintenance plan that assesses all possible risks.

We always take into account all of the risks we outlined above — as well as any additional risks that might be present in our client’s mainframe environment — and we develop plans to mitigate those risks.

We go beyond just creating a formal description of our planned changes, and instead detail every little element of our enhancements. This preparation is key and we really go “overboard” on it.

Here are just some of the details we might include for a regular maintenance process, such as the RSU (Recommended Service Upgrade), or the PUT (Program Update Tape), or building your own package of missing and latest PTFs:

We carefully read all HOLDDATA ACTIONs and list all actions to perform.
We classify our actions list, build an order of execution, estimate time for execution, define responsible team members, and leave important comments.
We automate as many pending actions as possible, we save commands, jobs, and scripts into datasets, and we put it all into our automation tools (and carefully describe all the above in our maintenance plans).
As a team, we collaborate and accept our enhancement approach based on our infrastructure. For instance, in Parallel Sysplex, we would define together the strategy of LPARs upgrade and workload rebalancing to ensure a high availability and no downtime provided. We have similar conversations to define and prepare rollback actions.
We correctly configure and schedule the startup time for both the mainframe itself, and for all programs running on the mainframe that might be consuming the same resources we need to perform our enhancements.
We take into account the many types of hacker attacks we might suffer when running external software out of the mainframe, and deploy tools to control traffic and use resources when integrating with third-party providers.

Third, we communicate our plan with anyone who might be involved in it or impacted by it.

Before we actually put our plan into motion and perform our z/OS enhancements, we first make sure that everyone involved knows what to do and when to do it.

We always double-check our maintenance window with our client, and we share all of the details of our plan with their teams. We discuss what we are going to do, we combine it with anything they were already planning to do, and we finally adjust our plan as needed to make as big and smooth an upgrade as possible.

Finally, we test our plan before we deploy it.

We always build a sandbox or test environment on the same level as our production environment. We perform SMP/E RECEIVE and APPLY with CHECK option to generate datasets and reports on our test. We see everything that happened during our test, we review all of the data, we notice if anything unexpected happened, and we iron out any issues before we run our enhancements in the client’s real mainframe.

This is a lot of work, but it’s worth it.

In total, we put a lot more time and effort into preparing to apply our z/OS enhancements than we actually take to apply those enhancements. We also take a similar approach for other activities like regular disaster recovery tests, changing hardware or software configurations, performing maintenance on storage, and the like.

While this is much higher effort than most enhancement deployments, we’ve simply found that a clearly detailed, communicated, and tested maintenance plan with a lot of automation and proactivity is the key to ensuring smooth, risk-free, no-stress system upgrades.

How to Deploy Error-Free Enhancements: Use Cases

Remember — our strategy is not theory. We have developed it over many years of hands-on, real-world work with our clients. We have learned that it’s worth putting in the effort upfront to avoid and mitigate potential problems during enhancements, and we know that this upfront effort mitigates some of the biggest risks you might encounter when performing your own z/OS enhancements.

Consider a few common enhancement use cases we have run into, where our strategy prevented some big problems.

System update

We were working with a local bank, who asked us to update a system they had on their mainframe. Our client used a Parallel Sysplex configuration to ensure high availability. Our team of system programers took this configuration into account when we designed our maintenance plan.

We redirected all workloads from mainframe B to mainframe A. This freed mainframe B from disconnecting from the application server, and from performing updates and IPL. Instead, mainframe A took on the entire workload that normally would have been split between the two.

After mainframe B completed its upgrade and came back online, we followed the same approach to upgrade mainframe A.

Because of these preparations, we were able to transfer all traffic and working procedures to another mainframe and avoid any visible interruptions of service.

Emergency Shutdown of the Main Storage or Logical Partition

We were updating a client’s mainframe and we had to prepare to avoid a worst-case scenario — the mainframe experiencing a power outage. We had to perform careful configuration and testing to make sure the system remained accessible even in this type of worst-case scenario. We were able to perform the proper verification level, to identify the system’s most vulnerable spots, and to define disaster and data recovery procedures to resolve any issue that might come up. By testing the system thoroughly we were able to find these problems in the sandbox and configure their remediation without risk of real-world harm.

A Sharp Spike in the Use of System Resources

We were working with a client who was at risk for sharp spikes in the use of system resources. There were several reasons why this might occur — poorly written software, poorly configured systems, external factors (when using third-party providers), and the like. We had to monitor these surges, determine if they were associated with the usual load on the system, and prevent them as soon as possible. We deployed some very useful tools for monitoring system resource usage, and for investigating these problems when they did occur.

Making Our Enhancement Strategy Work for You

At this point, you have a good top-level view of what our strategy is for performing z/OS system enhancements, and why it’s so important to put in all this work.

In part two of this series, we will dig into a lot more detail regarding how we prepare system enhancements to ensure z/OS availability, consistency, and performance.

But if you require immediate assistance, reach out to the IBA group today.