Balance innovation and agility with security and compliance
risks using a 3-step process across all cloud infrastructure.
Step up business agility without compromising
security or compliance
Everything you need to become a Kubernetes expert.
Always for free!
Everything you need to know about Magalix
culture and much more
We all know what Cloud Computing means, But what about “native”?
According to Merriam-Webster, “native” can be defined as “inborn, innate”. So, cloud-native apps can roughly be identified as software that was born in the cloud; applications that were designed from the very beginning to live on the cloud. But, this raises the expected question: what does a cloud-native application do differently to a traditional; non-cloud-native one? To answer this question, you need to be aware that running a traditional application on an infrastructure that you don’t own is a risky action.
By not “owning” the infrastructure, we mean you don’t have access to the data-centers on which the machines are hosted, you cannot make decisions as to which hardware your application is physically using, whether or not there are hardware issues and how they are being managed, etc. The cloud provider does all the heavy lifting for you with a promise that your application will remain online even if an outage occurred on the provider's side. This promise is formally referred to as Service Level Agreement (SLA). With an SLA asserting a 99.95% availability, the provider guarantees that there’s only a 0.05% possibility that your application is down due to an outage on the cloud provider's side. Translating the percentage to an actual number reveals that you can expect your business to be offline for as much as 4 hours and 22 minutes per year.
If your application is mission-critical, then the above may entail thousands of dollars in losses, harmed company reputation and, in extreme cases, lawsuits raised against you. Seems that 99.95% is not so relieving a percentage after all. It’s all the cloud provider's responsibility; they should take more measures to give you higher availability levels. Experiencing an unplanned downtime while your application is running due to cloud infrastructure issues is not your fault, Or is it?
Let’s see how Netflix was able to survive a major outage that occurred on AWS (Amazon Web service), Netflix’s cloud provider.
Netflix, the giant media streaming company is operating through AWS. On the 20th of September, 2015, AWS experienced a major service outage in its us-east-1 (North Virginia) region. The problem lasted for about five hours before everything went back to full function. Many clients were affected by this unavailability period including IMDb, Airbnb, Tinder, and Netflix. However, Netflix's losses were minimal. For them, the outage lasted only a few minutes, and the reason behind that was simple; they expected it.
Netflix has long been following the “Chaos Engineering” model. Simply explained, assume that a monkey somehow gained access to your data-center. What could it possibly do? Pull cables, power off routers and switches, and break things. To sum it up: chaos. The engineers at Netflix designed what they called “Chaos Monkey”; a program that “randomly terminates virtual machine instances and containers that run inside your production environment. Exposing engineers to failures frequently, incentivizes them to build resilient services.” Notice that Chaos Monkey is not run against development, testing, or even staging environments. It’s deployed to production. By Chaos Monkey, Netflix ensured that even if entire servers went down in the busiest hours of the day, the service will still function without having unplanned downtimes.
Later on, they introduced Chaos Kong, a program that is similar to Chaos Monkey but instead of powering off servers, it simulates shutting down entire regions. Hence, when the unexpected outage happened, they’re already prepared for it and they could easily failover to another healthy region.
Netflix is served by different AWS regions (this is a hypothetical example and not how the real Netflix infrastructure is laid out)
Netflix ensured that its application is cloud-native; it was resilient enough to remain functioning even if an entire cloud provider’s region went offline.
It’s the shared responsibility among developers and operators to ensure that the application is designed (code-wise and infrastructure wise) to withstand failures and remain running.
High availability is not a new concept. It’s been there perhaps since computers started to be used in mission-critical projects. In an active-passive design for example, a replica of the component is always on standby; it’s not serving requests. But the system can quickly failover to it once the active replica ceases to function. In more complicated designs, both replicas are serving requests simultaneously (active-active design). Such patterns were hard to implement because most applications were not highly-available by architecture. Workarounds needed to be made, third-party tools were used and - more importantly - the costs were high. Because of that, those designs treated failure as an “unexpected event”; something that’s very unlikely to happen. Only the most critical parts of the system (for example, the database) were made highly-available. On the other hand, cloud-native design treats malfunction as the rule rather than the exception. Part of the design entails that every component may fail at any time so it must be ready for that.
A little more than a decade ago, it was perfectly acceptable from a client’s point of view that a system can go “down for maintenance”. Smart devices were only smartphones, most applications didn’t require constant connectivity to the internet, and people were wondering what YouTube is all about. We read the news from the newspaper, watched movies on TV and played games on offline consoles. But today, everything is connected somehow to the internet and to each other. People are expecting the world to be always online and available. Your company’s product app experienced a few minutes of unplanned (or even planned) downtime? You might as well search for another line of business.
Achieving that level of resilience is a collective responsibility. Architects design a system that does not require to be shut down or restarted for patching or upgrading. Developers ensure that such a design is correctly implemented and tested. Operators guarantee that underlying infrastructure is never a single point of failure.
In the Industrial Revolution 4.0, you can’t afford downtimes
Now, let’s have a look at what cloud-native applications require nowadays.
As we’ve just discussed, planned or unplanned downtimes are no longer accepted. Some companies perform complex calculations to figure out the cost of downtime per minute. In major software players like AWS, Microsoft, and Google, this number is in the thousands.
It is no secret that the recent 4th Industrial Revolution is the main driver behind the Agile Software Development Methodology and DevOps. Cloud computing lowered the entry barrier for the software market to an extent that virtually anybody could build a business around a web application. As a result, competition rose to unprecedented levels. Applications are now in a state of “continuous improvement”; there are always new features that need to be rolled out, bugs appear and get fixed, then new features are required and so on. To achieve such a level of agility, practices like continuous integration, continuous delivery, and continuous deployment has to be followed. Also, the DevOps culture was born to ensure that the application lifecycle runs smoothly, efficiently and as fast as possible. To cope with all those dramatic changes, monolithic designs were obviously incompatible and inefficient to meet the new demands. Large hosts applications had to be broken down into several small, atomic, and interdependent microservices. When a new feature is ready to get deployed, only the component that needs this feature is updated rather than having to update the whole application. The microservices pattern allows even more agility. A cloud-native application is typically built using microservices architecture.
Today, a smart device is no longer a fancy name for a cell phone. Tablets, smart TVs, Apple TV, car audio systems, and many others are all potential clients for your application. Users expect the same experience they have on their desktops, even when using other devices. A cloud-native application is capable of running on most devices with the same quality level. Following the microservices pattern, the application can be designed to have one backend that provides the core functionality and several frontend components. For example, a cloud-native application should support at least IOS and Android systems in addition to the desktop version. Many technologies were born to address this need e.g React and React Native.
Let’s see how the above concepts could be applied to a hypothetical job-search cloud-native application. We’ll cover one workflow as an example; the user registration process. So, assume that our user opens her phone, downloads the application and wants to test it for the first time. The client application would contact its backend as follows:
The above workflow can be depicted using the below diagram:
Note that the above scenario is narrated at a very abstract level. The exact implementation largely depends on the infrastructure type. For example, each microservice can be deployed to a computer instance, with the load balancer managed by the cloud service provider. It can also be deployed to a Docker container and have an orchestration system like Kubernetes to manage it. We also didn’t cover how the services should communicate with each other. Again, this depends on your specific environment and business requirements. For example, you can use synchronous communication by sending and receiving direct API calls. Alternatively, you can use a message queuing system like RabbitMQ and follow the asynchronous communication model.
So far we discussed all the merits of using cloud-native apps and the microservices design. However, as with any other technology, pattern, or method. It is not a silver bullet that would solve all the world's problems. There are some scenarios where using the cloud-native design may work against you. Let’s have a brief look at those use cases: