Seven Standards Every OpenShift Customer Should Have on Day 1

This blog was originally featured on the Red Hat OpenShift Blog:
https://www.openshift.com/blog/seven-standards-every-openshift-customer-should-have-on-day-1

The explosion of containers in the enterprise has been awesome. The benefits of containers have proven to be a game-changer for companies wishing to reduce costs, expand their technical capabilities, and move to a more agile, devops-driven organization. The container revolution is bringing new opportunities to companies that have not embarked on a technology refresh in for some time. Containers and Kubernetes are a brand new and completely novel way of managing your applications and infrastructure. This is not the same as the last revolutionary jump from bare metal to virtual machines as the container can eliminate large redundant portions of the software stack and change the fundamental nature of managing operating systems for an enterprise.

Many of these companies are accelerating their container journey by choosing the market leader in enterprise Kubernetes platforms, Red Hat OpenShift Container Platform. OpenShift does so many things for you on Day 1. OpenShift represents the best of the Kubernetes ecosystem, delivered in a single platform, thoroughly tested and secured. It is a complete enterprise solution with everything a company needs to get started and can eliminate the huge technical barriers and waste associated with platform building.

However, OpenShift is not a silver bullet. While its capabilities are a huge reason why companies that chose OpenShift are seeing amazing benefits and returns on the investment, one of the driving forces for these benefits is having a solid plan in place upfront. In order to help ensure success, here is a list of seven areas every customer should focus on prior to moving any workloads onto the platform.

1: Standardize Naming and Metadata

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

Everything in OpenShift and Kubernetes has a name. Every service has a DNS name and the only restriction is that it complies with DNS naming conventions. And now that your application architects have jumped off the microservice cliff, that “monolith” has been split into 1,342 separate and distinct services, each with their own database. I forgot to mention that everything in OpenShift is also either hierarchical, related, or follows a pattern. Get ready for a naming explosion. If you do not have standards ready, it is going to be the Wild West out there.

Have you figured out how services will be implemented? Are you going to have one big namespace where everyone hosts their databases? Is everyone going to put their databases in the same “databases” namespace but then put their Kafka clusters in their own namespace? Do we need a “middleware” namespace? Or should it be “messaging”? Then you get the email from that one group who thinks they are special and they always get their way and they said they want their own namespaces. We have 17 lines of business; couldn’t we prefix all the namespaces with the standard LOB prefix?

Before anything goes into production, start mapping and naming exercises. The work you do up-front will save you so much time and effort on the back end. Standardize everything. It does not matter so much that your standards are the best, but rather that they exist, are consistent, and are followed.

There is also a ton of value in metadata. Standardize what assets you want to track and make sure the appropriate resources get the metadata applied. Start with recommended labels. For example, putting a “support_email” annotation in the namespace metadata can save you precious time in getting second-level support for an outage. You can also use the information to cut down on the uber-hyphenation of resource names. Get everyone involved, from architects to operations, and start brainstorming on what is going to be needed, and get the standards in place on Day 1.

2: Standardize Your Company’s Base Images

One of the more powerful features about containers is your ability to mix and match everything in the stack. As an enterprise, you can go and pick out your favorite flavor of OS and start building stuff. It is easy, but it is also one the biggest missed opportunities for an enterprise. The really cool thing about images is the layering. You can abstract away the images from your developers and standardize at the image level.

Take a basic java application. Your application development team cannot go wrong with choosing OpenJDK, but when it comes to managing CVEs, upgrading libraries, and overall hygiene, we all know that delivering business value can sometimes make technical debt, like old versions of Java, take a backseat. Luckily, this can be easily managed and automated away by the enterprise. You can still use the power of your vendor’s base images, but you can identify and control the upgrade cycles by creating your own base images.

Using the example above, the team needs Java 11. You need to make sure that they are using the latest version of Java 11. Create a company base image (registry.yourcompany.io/java11) using the vendor base image as a starting point (registry.redhat.io/ubi8/openjdk-11). When the base image gets updated, you can “help” your application teams to use the updates. Also, you get an abstraction layer to seamlessly plug libraries or Linux packages into a standard image.

3: Standardize Health and Readiness Checks

Everything needs a health check. As a human, you should be getting one every year. As an application, the checks should be more frequent. However, there are two questions that an application needs to be answered rather than one:.

Am I running?
Am I ready?

There are tons of other application metrics that can make your monitoring lives easier, but these two will be the cornerstone of not only monitoring, but also scaling. The first question is usually answered simply because “Am I running?” is usually a function of network connectivity and the endpoint being able to return a response. The second question will be application-scoped, meaning that every application will need to answer “Am I ready?” according to its own standards. For example, an application requiring very low latencies may need to execute a lengthy process to refresh caches and warm up the JVM during startup. This application’s “running” answer could diverge from its “ready” answer by several minutes. However, a stateless REST API with a relational database could be “running” and “ready” simultaneously.

The most important point is to not diverge from these two Boolean expressions. Running means running; it is hard to be kinda running. Ready means ready and does not have shades of ready. You should not be ready for “these types” of requests but not for “those types.” It is all or nothing.

The other side of the coin is standardization. How does one check readiness? Even this basic question can create monitoring nightmares if standards are not implemented. Just look at the Quarkus standards vs. Spring Boot standards. They did not mean to diverge. It is just an age-old problem with standards. The only difference is that in this case, your company has the power to develop and enforce the standards.

Side note: Don’t make up a new standard. Pick an existing one and use it. Don’t bikeshed.

4: Standardize Logging

Speaking of monitoring, the combination of cheap storage and big data management systems has created a new monster in the enterprise: logging. What used to be ephemeral, inconsequential, just-in-case types of unstructured and downright archaic console logs have become a cottage industry in itself with data science-y types attempting to machine-learn their way to hyper-optimized operations and monitoring. But we all know the truth. As soon as someone starts to try to piece together log data from hundreds of applications, all with absolutely no standards or even forethought into the fact that this might be a possibility, you will spend an inordinate amount of dollars buying log management tools and building transformations all to just get started. That is before you even figure out that the println statements saying “Got here” and “this works” don’t have much insight into operations.

Standardize the structure. It bears repeating: It is more important to be consistent than it is to be correct. Be able to write a single log parser for every application in your enterprise. Yes, you will have one-offs. Yes, you will have exceptions you cannot control, especially from COTS apps. Do not “throw the baby out with the bathwater” on this one. Get detailed. For example, every log’s timestamp should follow the ISO standard for output; output should be in UTC, and be down to the 5 places for the microseconds (2018-11-07T00:25:00.07387Z). The log levels should be all CAPS and be TRACE, DEBUG, INFO, WARN, ERROR. Define the structure, and then define the elements with excruciating detail.

Once the structure is standardized, then have everyone follow the same flow using the same architectural designs. This goes for application logs as well as platform logs. And do not diverge from the out-of-the-box solution unless it is absolutely necessary. The OpenShift platform’s EFK stack should be able to handle all of your scenarios. It was picked for a reason, and when you are doing platform upgrades, it is one less thing to have to worry about.

5: Implement GitOps

One of the great things about OpenShift is that at the end of the day, everything is configuration or code, which means everything can be versioned and managed in source control. Literally everything. This can revolutionize the way value is delivered and eliminate the bureaucracy involved in moving to production.

As an example, the traditional model of ticketing can be fully replaced by the git pull request model. Let’s say an application owner wants to update the size of their application’s resources to accommodate some new, more heavyweight features by changing the memory constraints from 8 GB of memory to 16 GB. In previous models, the developer would have to open a ticket and get someone else to do the task, typically an “ops” person. Ops has been stuck for a long time implementing changes to the environment in a way that really adds no value to the process and, at worst, creates additional cycle time in getting things implemented. Ops can take a look at the request and do one of two different paths: They could just implement it by going out to the production environment, changing some configuration by hand, and then restarting the apps. This takes time (they have a queue) and possibly introduces the risk of human mistakes, such as “fat fingering” the value (160 GB?). But they could also question the change, which would create a circuitous set of unneeded communications regarding the cause and effect of the change and may require some management intervention.

Using GitOps, these changes go into a git repo and are communicated out as a pull-request. You can then expose the pull-requests, especially for production changes, out to a wider array of stakeholders for approval. Security is brought in earlier and can see the incremental changes. Standards can be enforced programmatically using tools in the CICD toolchain. Once the pull-request is approved, it is versioned and auditable. It can also then be tested in pre-production environments in a standard flow, eliminating human mistakes.

The biggest shift here is not going to be for the developers. They have been comfortable with the source-control model because they live in that world. The traditional system administrators and security controllers will need to get in and learn the new paradigms, and it can be a huge change for them. But once they realize the power and simplicity, it is not a hard sell.

6: Create Blueprints

The move from monolith to microservice has created more power around application patterns. The monoliths made it hard to categorize an application. It has a REST API, but it also does some batch processes. It is event driven as well. It uses HTTP, FTP, kafka, JMS, and Infinispan, but also talks to three different databases. How do you build a blueprint for something that has a myriad of enterprise integration patterns? You do not.

By taking that monolith and breaking it down, the patterns become much more simple and easier to categorize. It is now four separate applications and those have patterns:

REST API for managing data in a database
Batch process that checks an FTP site for updated data and pushes it to a kafka topic
Camel adapter that takes data from the kafka topic and sends it to the REST API
REST API that represents the summarized information gathered from a data grid acting as a state machine.

So now you have some blueprints. Each blueprint can then be standardized. REST APIs will follow Open API standards. Batch jobs will be managed as a OpenShift batch job. Integrations will use Camel. You can create blueprints for APIs, batch jobs, AI/ML, multicast applications, and whatever else you want. You can then define how they are deployed, how they are configured, and what patterns to use. Using these standards will allow your organization to quit reinventing the wheel and work on solving the novel problems, such as actual business functionality. Yak shaving is sometimes a necessary evil, but the organization’s ability to leverage previously spent efforts will create an opportunity for exponential gains.

7: Prepare for the APIs

The microservice architecture is coming. With it comes APIs. Do not wait and then decide how to manage them later on. Get in front of this.

First, you need standards. Use the Open API standards as the starting point, but get down in the weeds on this one too. There will be a balance between over standardization creating too many constraints. Quick litmus test questions: When someone uses a POST to create a new thing, are you going to return 201 or 200? Will someone be allowed to update a thing using a POST and not a PUT? What is a 400 versus what is a 500? That is the level of detail you need.

Second, you are going to need a service mesh. It is one of those concepts that is so powerful that it is eventually going to make its way into the core Kubernetes offering. Why a service mesh? Traffic is going to become an issue. You are going to want to manage both north-south traffic, but also east-west. You are going to want to offload authentication and authorization from your application to the platform. You are going to want the power of Kiali for visualizing traffic within the mesh. You want canary deployments and blue-green. You want dynamic traffic-flow control. Service mesh is a must have and is best implemented on Day 1.

Third, you are going to want a centralized API management solution. You want to be able to find and reuse APIs in a single place; a one-stop shop. Developers will want to be able to go to the API shop, search for an API, and get the documentation on how to use it. You’ll want to also manage versioning and deprecations in a uniform manner. If you are creating APIs for external consumers, this can also be the north-south endpoint for all things security and load management. 3Scale can even help you monetize the APIs. Also, at some point, an executive will want a report to answer the question, “What APIs do we have?”

In conclusion, it must be noted that identification and documentation of standards for your organization can be daunting in itself, but the act of enforcing and monitoring standards is where the bulk of the effort lives. From the start, the powerful combination of organizational entropy and natural tendency to avoid conflict with coworkers will be working against the standards. These battles are tiny and sometimes imperceptible – a label missing here, a name that doesn’t exactly follow the standard but is close enough. Standards adoption typically dies the death of a 1000 cuts, with little or no awareness by anyone in the organization. Enforcing standards is like exercise, no one really wants to sweat and strain but we all know that it’s necessary to live a long healthy life.

However, there is hope. That hope lives in automation. Every single one of the standards above can be implemented via automation. The GitOps process can check that labels and annotations exist on all the correct yaml files. The CICD process can enforce the image standards. Everything can be codified, checked and enforced. The automation can also be extended to support new standards or to change existing ones. The power in automating standards processes is that a computer doesn’t avoid conflict, it merely states facts. Therefore, with enough forethought and an investment in automation, the platform you are investing so much in today can provide much more return on investment in the future in the form of productivity enhancement and stability.