We all know what Cloud Computing means, But what about “native”?
According to Merriam-Webster, “native” can be defined as “inborn, innate”. So, cloud-native apps can roughly be identified as software that was born in the cloud; applications that were designed from the very beginning to live on the cloud. But, this raises the expected question: what does a cloud-native application do differently to a traditional; non-cloud-native one? To answer this question, you need to be aware that running a traditional application on an infrastructure that you don’t own is a risky action.
Why Running Non-Cloud-Native Applications On The Cloud is Risky?
By not “owning” the infrastructure, we mean you don’t have access to the data-centers on which the machines are hosted, you cannot make decisions as to which hardware your application is physically using, whether or not there are hardware issues and how they are being managed, etc. The cloud provider does all the heavy lifting for you with a promise that your application will remain online even if an outage occurred on the provider's side. This promise is formally referred to as Service Level Agreement (SLA). With an SLA asserting a 99.95% availability, the provider guarantees that there’s only a 0.05% possibility that your application is down due to an outage on the cloud provider's side. Translating the percentage to an actual number reveals that you can expect your business to be offline for as much as 4 hours and 22 minutes per year.
If your application is mission-critical, then the above may entail thousands of dollars in losses, harmed company reputation and, in extreme cases, lawsuits raised against you. Seems that 99.95% is not so relieving a percentage after all. It’s all the cloud provider's responsibility; they should take more measures to give you higher availability levels. Experiencing an unplanned downtime while your application is running due to cloud infrastructure issues is not your fault, Or is it?
Let’s see how Netflix was able to survive a major outage that occurred on AWS (Amazon Web service), Netflix’s cloud provider.
Netflix And Chaos Engineering
Netflix, the giant media streaming company is operating through AWS. On the 20th of September, 2015, AWS experienced a major service outage in its us-east-1 (North Virginia) region. The problem lasted for about five hours before everything went back to full function. Many clients were affected by this unavailability period including IMDb, Airbnb, Tinder, and Netflix. However, Netflix's losses were minimal. For them, the outage lasted only a few minutes, and the reason behind that was simple; they expected it.
Netflix has long been following the “Chaos Engineering” model. Simply explained, assume that a monkey somehow gained access to your data-center. What could it possibly do? Pull cables, power off routers and switches, and break things. To sum it up: chaos. The engineers at Netflix designed what they called “Chaos Monkey”; a program that “randomly terminates virtual machine instances and containers that run inside your production environment. Exposing engineers to failures frequently, incentivizes them to build resilient services.” Notice that Chaos Monkey is not run against development, testing, or even staging environments. It’s deployed to production. By Chaos Monkey, Netflix ensured that even if entire servers went down in the busiest hours of the day, the service will still function without having unplanned downtimes.
Later on, they introduced Chaos Kong, a program that is similar to Chaos Monkey but instead of powering off servers, it simulates shutting down entire regions. Hence, when the unexpected outage happened, they’re already prepared for it and they could easily failover to another healthy region.
Netflix is served by different AWS regions (this is a hypothetical example and not how the real Netflix infrastructure is laid out)
Netflix ensured that its application is cloud-native; it was resilient enough to remain functioning even if an entire cloud provider’s region went offline.
It’s the shared responsibility among developers and operators to ensure that the application is designed (code-wise and infrastructure wise) to withstand failures and remain running.
But This is Old News; All Large Applications are Designed That Way!
High availability is not a new concept. It’s been there perhaps since computers started to be used in mission-critical projects. In an active-passive design for example, a replica of the component is always on standby; it’s not serving requests. But the system can quickly failover to it once the active replica ceases to function. In more complicated designs, both replicas are serving requests simultaneously (active-active design). Such patterns were hard to implement because most applications were not highly-available by architecture. Workarounds needed to be made, third-party tools were used and - more importantly - the costs were high. Because of that, those designs treated failure as an “unexpected event”; something that’s very unlikely to happen. Only the most critical parts of the system (for example, the database) were made highly-available. On the other hand, cloud-native design treats malfunction as the rule rather than the exception. Part of the design entails that every component may fail at any time so it must be ready for that.
Architecting, Developing, And Operating in The Cloud Age
A little more than a decade ago, it was perfectly acceptable from a client’s point of view that a system can go “down for maintenance”. Smart devices were only smartphones, most applications didn’t require constant connectivity to the internet, and people were wondering what YouTube is all about. We read the news from the newspaper, watched movies on TV and played games on offline consoles. But today, everything is connected somehow to the internet and to each other. People are expecting the world to be always online and available. Your company’s product app experienced a few minutes of unplanned (or even planned) downtime? You might as well search for another line of business.
Achieving that level of resilience is a collective responsibility. Architects design a system that does not require to be shut down or restarted for patching or upgrading. Developers ensure that such a design is correctly implemented and tested. Operators guarantee that underlying infrastructure is never a single point of failure.
In the Industrial Revolution 4.0, you can’t afford downtimes
Now, let’s have a look at what cloud-native applications require nowadays.
As we’ve just discussed, planned or unplanned downtimes are no longer accepted. Some companies perform complex calculations to figure out the cost of downtime per minute. In major software players like AWS, Microsoft, and Google, this number is in the thousands.
Microservices, Agile, And Continuous Everything
It is no secret that the recent 4th Industrial Revolution is the main driver behind the Agile Software Development Methodology and DevOps. Cloud computing lowered the entry barrier for the software market to an extent that virtually anybody could build a business around a web application. As a result, competition rose to unprecedented levels. Applications are now in a state of “continuous improvement”; there are always new features that need to be rolled out, bugs appear and get fixed, then new features are required and so on. To achieve such a level of agility, practices like continuous integration, continuous delivery, and continuous deployment has to be followed. Also, the DevOps culture was born to ensure that the application lifecycle runs smoothly, efficiently and as fast as possible. To cope with all those dramatic changes, monolithic designs were obviously incompatible and inefficient to meet the new demands. Large hosts applications had to be broken down into several small, atomic, and interdependent microservices. When a new feature is ready to get deployed, only the component that needs this feature is updated rather than having to update the whole application. The microservices pattern allows even more agility. A cloud-native application is typically built using microservices architecture.
Build Once, Run Anywhere
Today, a smart device is no longer a fancy name for a cell phone. Tablets, smart TVs, Apple TV, car audio systems, and many others are all potential clients for your application. Users expect the same experience they have on their desktops, even when using other devices. A cloud-native application is capable of running on most devices with the same quality level. Following the microservices pattern, the application can be designed to have one backend that provides the core functionality and several frontend components. For example, a cloud-native application should support at least IOS and Android systems in addition to the desktop version. Many technologies were born to address this need e.g React and React Native.
A Practical Example: Cloud-Native Job App
Let’s see how the above concepts could be applied to a hypothetical job-search cloud-native application. We’ll cover one workflow as an example; the user registration process. So, assume that our user opens her phone, downloads the application and wants to test it for the first time. The client application would contact its backend as follows:
- The user opens the homepage of the application, which displays the latest jobs. The jobs component is served by two replicas behind a load balancer.
- The user needs to search for a suitable job. Only registered users can do that, so the jobs component redirects the client to the registration component.
- The registration service is handled by two application instances behind a load balancer. The user enters her information and credentials. The client posts the data to the registration service.
- The registration service stores the data to a stateful component that saves the user information and redirects the user to the login page (the login app).
- The user supplies the newly created credentials to the login app, the login app creates an authentication token and stores it on another stateful service.
- Finally, the login app redirects the user to the home page where she can search for jobs.
The above workflow can be depicted using the below diagram:
Note that the above scenario is narrated at a very abstract level. The exact implementation largely depends on the infrastructure type. For example, each microservice can be deployed to a computer instance, with the load balancer managed by the cloud service provider. It can also be deployed to a Docker container and have an orchestration system like Kubernetes to manage it. We also didn’t cover how the services should communicate with each other. Again, this depends on your specific environment and business requirements. For example, you can use synchronous communication by sending and receiving direct API calls. Alternatively, you can use a message queuing system like RabbitMQ and follow the asynchronous communication model.
When Not To Design Your Apps To Be Cloud-Native?
So far we discussed all the merits of using cloud-native apps and the microservices design. However, as with any other technology, pattern, or method. It is not a silver bullet that would solve all the world's problems. There are some scenarios where using the cloud-native design may work against you. Let’s have a brief look at those use cases:
- Sometimes, the application is too compact with very little functions, that breaking it into disparate components can overkill and add unnecessary layers of complication. For example, embedded systems.
- Microservices - by definition - breaks the whole application into small interconnected components. Because of network latency and other factors that are inherent to distributed systems, it may take a few seconds or even minutes for all the application services to be consistent with each other. In our job application example, it may take a few seconds for all the application parts to be notified that a new user was added and to act accordingly. Obviously, the application design should take that into consideration. So, once the user clicks on the signup button, the application should not respond with a success message unless all the concerned services acknowledge receiving and processing this event. However, some mission-critical applications like financial and stock market systems cannot tolerate delayed consistency. Think for a moment how stock traders may incur great losses because they received delayed information about a stock price.
- Some legacy applications that were built many years ago cannot use the cloud-native or microservices approaches without rewriting them all over again. Deciding to rebuild an application, especially if it was a complex one, is a decision that requires a lot of consideration. You should weigh the gains and losses of moving such legacy applications to the cloud in terms of time, money and maintenance overhead. After all, it might be the best decision not to convert the software to a cloud-native one.
- Cloud-native application is a term that refers to a software designed to run on unpredictable and ever-changing infrastructures, for example, the cloud.
- In order for an application to survive the competitive market, it must be “always-on”; planned or unplanned downtimes must be avoided at all costs. It must also be designed in a way that allows it to improve over several iterations in which new features are added or bugs get fixed. DevOps practices like CI/CD help such an agile development methodology. Finally, a modern application must be compatible with other devices than desktops.
- You should be aware that some business requirements cannot be fulfilled by a cloud-native app and/or a microservices design. For example embedded systems, applications that are greatly affected by delayed data consistency, and legacy systems that are inefficient to get rebuilt for the sole purpose of being cloud-native.