If you work with containers, you would have heard the terms DevSecOps and SecDevOps often. You might have seen them used separately or interchangeably. But is there a difference? While the jury is still out on that, one thing is certain—information security is our number one priority.
This is the first in a two-part series where we discuss the unspoken challenges of SecDevOps and shifting security left with Scott Surovich, Global Container Engineering Lead at HSBC.
The first part of the series explores the challenges involved in shifting security left in FinTech, getting all stakeholders to buy-in, governance policies, and more. In the second part, we explore vendors' roles, open-source tools, and the challenges of working in a highly regulated industry like finance and banking.
Surovich has worked at HSBC bank for over 19 years. For almost two decades, he has worked in several different areas including, Windows engineering, Linux engineering, Citrix engineering, and virtualization. However, his passion for Kubernetes was ignited with the release of Kubernetes 1.6, and the rest, as they say, is history.
Today, he’s a Kubernetes evangelist and the co-author of Kubernetes and Docker - An Enterprise Guide. In his free time, Surovich plays hockey and enjoys 3D printing (and even has a collection of eight 3D printers!). As this is a two-part interview with Surovich, so make sure to look out for Part II on the Magalix website.
1. Magalix: You've got almost 4,000 offices. We're talking 90,000 plus servers, and 21 data centers. You guys have also made a bet in the cloud and you have pools of infrastructure everywhere. So, could you kind of talk us through some of the day-to-day experience of managing something as Goliath as this?
Scott: Yeah, it takes a fleet of people to do that… you have 40,000 IT professionals. Some increase, some go down. We have of course partnered with Amazon and Google and with some Azure. All three to be honest, and if you're going out into China area there's of course Ali [Baba]. So yeah, we're consumers all across the board. Running something like Kubernetes as well and we try internally to, I'm going to say stay at the same type of offering across all of them because you guys can imagine there'd be a nightmare if all of them went in their own direction. Some things have to. There's always the cloud-native portion that the vendors will offer when there's benefit to that. But probably some of the decisions we can talk about in a little bit is we don't always use the native offering for certain things because we have to consider a lot of things in a highly regulated industry like we are. We honestly, I think have some of the best people I've ever worked with. Again, I've been here 19 years and why would I do that? I like the people that I work with. I trust the people. We got just some talent that's really good and that's what makes us successful.
2. Why did you guys decide to shift left? When we're dealing with a world of dynamic infrastructure, you know, secrets, credentials, all these things are taking a shift to the left. So, number one, where did that decision come from? What did that story look like for you? Why did you guys make that decision? As a guess, was it top-down or bottom-up?
Scott: I'll take the last part. It's both. Our execs keep buying the technology. They know what the shifts are. They talk to the CEOs at Google, CEOs in Amazon, CEOs of vendors, so they know where that goes. Of course, the geeks like us down in the trenches, we're always playing with it and wanting it to go that way, and then of course, developers. If you want to be quick, you know, you'll throw around the terms, you want to be agile, you want to get into that, you have to shift.
You want new features going out to your customers. You have to stay ahead of the competition in the legacy way or heritage way depending on which term you want to use. You can't do that obviously. You've got your VM with all your processes. You want to upgrade one, well, now you need that downtime.
How do you schedule that? That's why releases were slow and if they weren't slow, they were painful because you had people working [at] two in the morning. You were doing 80-hour weeks because you're trying to work outside of Windows because they just weren't made to be rolling upgraded. Or you didn't have the capacity to say okay, you know what, I want to do a rolling upgrade. Let's install 20 new VMs or 20 new physical servers just to shift overloads. It just couldn't be done, and of course, in today's world, not [with] security.
Security is always important but even more so now [that] nobody wants to be on the front page of anything. I mean that's just [the] brand. Awesome, you've got to keep your brand. But we want to protect the customers, we want to protect ourselves, so let's check that in every step of the process and 80-hours get it early. Find out what the problems are. Hopefully, there are no problems that get deployed. As they say, all these helped us to develop quick fail fast just so you can get stuff resolved.
3. How long has it taken you guys to kind of get to this point of saying, "hey look we're going to go the container route, we're doing multi-cloud, and now we want to shift our mindset left." Can you give us a rough idea of how long that's kind of taken for you guys to get to this point?
Scott: Yeah, it's not done. I think it's going to be something. Our company is still evolving too. It's kind of like when somebody says, "how often do you upgrade Kubernetes?" I'm like, "oh boy, you're going to go every three months at a minimum not counting the minors in there for any kind of security." So, I think the process has changed too. There are tools coming out. There are vendors offering solutions when you're getting into the policies… It was that kind of stuff when we started. You know, you're doing OPA policies. Great, we have them, we've got multiple clusters. How do we make sure that those policies are in fact installed on each one and if we update one or add one? What do we do? How do we keep a library of those? You know, policy as code, what tools can you use?
So, it's evolving. But I'd say the journey [has] probably been two [or] three years now. We were doing some stuff in the cloud before that as well but it's just now that we're getting the teams forming... It's breaking down those barriers in a typical enterprise to get people talking, agreeing, and again, it works surprisingly well, more because of the people. When you all agree that this is the way you want to go, that's when you get people lined up and you don't quite run into the systems as they'll say, “you know, this is my space.” We got to do it this way and you try to work with everybody.
4. You mentioned OPA and it's a chapter in your book. I wanted to talk about that for a second. Also, what motivated you to write the book about Kubernetes and Docker – An Enterprise Guide? Can you kind of walk us through your thought process behind that?
Scott: Honestly, because a lot of the topics that we cover in there you start to look for, and you can find blogs, you can find little things here and there … you get a blog with two pages and be like, "cool. Maybe I did those steps but what did I do?"
Certain people sure, you're going to pick up. If you've been in the industry a while, you'll figure it out… You need a starting point. But that doesn't help the other people trying to get in. Marc, who's my co-author, we just kind of said, “you know, there'd be some good topics here and we believe and breathe this every day, why don't we just write some stuff down?” And of course, we hope for the best that people care, that they want to read it, that they like the topics, and it has been well received.
It's all timing. If you go out to look at the books today, you'll find a ton of intro books of here's how you create a stateful stack, here's how you do a pod, here's what you do here, and that's great. You got to start there.
But back when this all started, people weren't thinking about policies. They weren't even thinking of PSPs and online. Most people got their clusters deployed and then some basic functionality in there for some security and logging. But especially authentication, that was a big push for us. We've noticed a lot of people really did not understand how to integrate a provider like an OIDC and how you would integrate that with an active directory or LDAP or SAML, you know, whatever you want to use. That's why we evolved from there saying, “great you've got the users that you want to put in the policies, now what do you use for those?” We do deal with PSPs because you have to, currently.
As we all know, it's being deprecated and that's why we had the OPA to get in there and logically that'll be the replacement for PSPs. Not for me to say… but it makes logical sense. The graduation was a month ago of the project so I think you'll see a lot of people hopping into that, and then of course security.
5. I know you touched a lot on OPA kind of forensic footprint. I guess for those that are just starting the SecDevOps journey, the SIC, the OPA, policy-as-code, like you said they'll see these blog posts but there's really no translation from these instructions into how to apply it to a real-world use case? In your own terms for those that are just starting, what is OPA and what do you use it for?
Scott: It's a great question and it's funny because you hear about OPA now and all you ever think about is people talking about Kubernetes which of course is not the case. But that's what's bringing it to the front, which is great. So just as the industry as a whole, what it can give you is great.. But I boil it down to people that say it's a decision engine, basically. Can you, or can't you?
That's what it comes down to. Okay, you want to do what? Okay, you want to deploy this, and you want to do host path on your deployment? Then you're going to say, "oh no that's bad. We don't want you to do that." So OPA just has something in there that says yep if this contains that, [so] deny…
Now, you can take that then for a validation or two. It's not just “deny” but is it acceptable? So, it's great on the Kubernetes side but now let's add that to Terraform. Let's do that… and have the same kind of engine instead of having to have 20 products to do that. Because of course Terraform has their own thing already. But the more you can centralize multiple functions into one as we talked about earlier, stuff's moving so fast that we have to stay on top of it, and if you start doing that with 20 different tools it becomes a big, big challenge.
6. Again, for those that are just starting out on this journey, I think one of the things that could mislead people is thinking that OPA can solve all my problems. Once I have OPA installed I'm secure, right? Or I've got my governance covered. So, from your own experience, what is OPA good for?
Scott: That's a tricky question. First, I can tell you one thing about our tool. So, we had our first ever kind of cluster that we were kicking around even though we're trying to fill out what offering we'd even use. So, what flavor? One of the first tools you brought in was OPA. So, we had OPA just because we needed the guard rails there, and honestly, like you said, it can do so much so when you get into Rego… now of course, learning it is another concept. There's definitely a learning curve on getting into that.
We got somebody on my team that's awesome at it and really enjoys Rego. To enjoy that I don't know what you got to be thinking. But he loves it. He lives and breathes it. You don't need that because of course, the Slack channel is great. There are people to help, there's of course collections of those. So, you can get different policies if you go to the sites. But let's assume it can't stop something or you don't have a policy to stop something that you don't even know about.
So, if you're running OPA, not gatekeeper, you don't really have that. You won't even know something's happening until it happens, and people always say I've got these guard rails in place already. It could be a PSP, it could be you're running an open shift which has some built-in already. But you have to have extra, and it's not just the security though, it's stability as I see. So, everybody thinks OPA might just do a security side, you know. You have to have a trusted registry. You can't create a host path. You can't do an IPC mechanism, you know. You can't run elevated privileges or run as root. But that's only half of it to me.
Stuff like checking ingress rules to make sure you don't have duplicates. I'm a big fan of multi-tenant simply because I'm utilizing my hardware correctly, not wasting resources. But I have to make sure now that people are playing in the same playground that they don't bump heads on certain things like an ingress rule, and ingress was one of the first reasons we brought it in, to be honest. So not even a security standpoint. Because when people are starting to test Kubernetes, to kick it around, they'd make an ingress called test. Pretty logical. A lot of people do that. Problem is when 10 people do that you don't know which one. You got these ingress rules and then stuff like that happened. So, it's an ever-evolving process. You ask questions like what doesn't it cover? …once you're in OPA, so once something is deployed, yeah OPA is not going to help you a lot if there is the admission to say okay, and you have to know that. It's not a limitation. It's a design. That's what it's meant to do. So, you know if your pod is already running... I think people have to realize that, how it works and understand what else you might need.
7. That’s a good segue to talk a little bit about the hurdles. You talked about whether you had this test cluster or this cluster you're kicking around testing ingress, applying the policy, just kind of checking to see what's happening. But as you start learning more and as that project internally matures and you start thinking about okay, I want to move this into a production workload, what are some of the technical challenges that you ran into when making that shift from your sandbox to the production workloads?
Scott: Yeah, some of the challenges. Onboarding in general, I think we've seen this a lot. It's a whole new way for developers, [and] I hate to just say developers, users of the cluster. Because you're going to get a lot of the off-the-shelf apps now that can come in like that. So, they're not developing but you might get a helm chart, you might get a deployment, you know, it could be a couple things.
It is a different world from virtualization and even PCF, you know, cloud foundry stuff. What we tend to run into a lot is wasted resources. For example, requests. You know, it's getting into the geeky side of Kubernetes obviously. But people will try to deploy something and say, "well I want four CPUs as my request." What they don't realize is that [it's] a reservation in Kubernetes and nobody else can touch those now.
So, you're running a server at 100% request so you can't reschedule. Kubernetes scheduler will say yep nothing goes there. But yet it's at 2% utilization. So, you have to intimately know your application to know what it needs to turn the lights on, to start up the schedule and then limits of course as well. If you don't set those values, you start getting phone calls that say I'm getting out of memory problems and it's because your limit's too low. So, you got to optimize, and in the old days what we did on VMs is we just said throw resources at it, whatever is not used will just be used by somebody else. You don't have that. It was like requests. It's like no, it's pretty selfish, and it says, “nope that's mine.”
8. Because if you throw in a policy that says you have to set your limits and maybe if you take it a step further you could say those limits can't be above memory of two gigs or something like that? I'm sure you start implementing these policies and no matter how much you try to educate, especially in a large organization like HSBC, sometimes that message doesn't come across as law. So, I'm guessing what could happen is you put in a policy, you try to educate, and then the developer says “hey wait a minute, where's my stuff?” Nothing's deployed, and so, I'm sure there's some kind of cultural challenge there too. So, could you talk a little bit about any type of these cultural challenges that you may have run into?
Scott: We came into the cultural side because, I'll go back to again, it's changed a lot. So legacy days. If you're a Windows, Linux person, especially in a big organization, your life is deploying your platform, giving people what they need with their server, and you often walk away.
So, in my previous roles, I am like, okay “yep,” I owned I don't know, 600 Windows servers under my name, a couple hundred Linux machines at a time. I couldn't have told you every application that was on there. But I didn't need to, to be honest. So, here's your stuff. That's great.
I couldn't know all the applications as you saw on that list. Got so many servers out there, and we are a global organization. I don't just deal in the United States. You get the questions from China, from [the] UK, and not just me, my UK counterparts will get questions from the US people. So, you know, doing it that way. We wanted to make sure that when we got into the agile space, the microservices space, that we did give the developers as much control as you can with guardrails, and that's a cultural shift.
It's okay, you know what, here's your stuff, in our case, let's say for example a multi-tenant, a namespace with enough guard rails on it, so quotas as an example. You definitely don't want to say no quotas. You can do almost everything you want there. Obviously, you can't do cluster level stuff. You can't do CRDs, payment sets, but you want to ingress, it's yours.
You want some PVCs, it's yours. All within, again, default guardrails. Now of course, people outgrow that. That's understood. We have to increase those quotas. So, we have automation in place for namespaces, setting up the initial quota. We're trying to figure out other ways to automate.
Internally, you want to be like the cloud. You have to be. You have competition now outside your own organization for your own people and you go to the best logical place between cost offerings to solutions and that. So, the biggest shift is communication. Because in my years, I didn't have to worry about the developers as much, and it's not that I didn't want to. I didn't have time and I don't know what benefit I would have offered. It's just a server.
Now, in this case, we have communication with them. I deal with people I've never had to in the past and it's made the job fun. It makes it stressful because you've got a lot of questions coming in. But you learn so much and honestly you look back now and say well, what could we have done better in the virtualization days had we had this communication rather than making a developer feel like they were being tied down by something we thought they needed or wanted but it was wrong? Now we get into that. Do you need the service mesh? Okay, great. We'll look into that. You need a multi-cluster mesh? Okay, we've got that on the radar. Decide how we're going to do this…
So yeah, it's great, and I actually came from a developer background. That was one of the first things I did many years ago. Honestly, back then I just kind of got bored with it. It was COBOL. To be honest, I didn't have a lot of fun in COBOL. So, I went to the infrastructure space. So, I try not to forget that I liked programming and I wouldn't want somebody putting all the guard rails. There has to be some. There's got to be some give and take on both sides to say, yeah, we realize that you can't just give us all this dedication the way we want, but you know, it's a compromise.
9. I have one last one on the hurdles. This is interesting because if you're in a small organization, you might have like a gold star team player that seems to know how to do everything. But in a large organization, you try to implement SecDevOps and it's not just policies, there's image scanning, where the developer is pulling their base containers. There's code analysis and there's all these other, I guess, the lack of better steps involved to secure the software development life cycle. So, to the best of your knowledge, especially in a large organization, you may not have access to the CI/CD. You may not even know the person that manages that. They may be on the other side of the globe. So, what are some things that you have seen that gets everyone kind of on the same page? Like how does the CI person do their thing? How does someone like you do their thing? Then how do the security folks get involved and so on and so forth?
Scott: Yeah, we definitely ran into it especially at the start. It makes some sense. Especially before we have official projects. I mean it's a large organization we have different priorities. The reason Kubernetes actually has been very successful for us or that shift I'll say in general, let's forget saying just Kubernetes but again, it's the geekiness of the people I work with.
I mean, we actually had phone calls at night that we would just talk about Kubernetes and what we should do with people all over the globe. I might have been up at 3:00 in the morning because I was talking to somebody that could be in Hong Kong and they did the same thing. So, we got that communication out and that started it. I look back now and think if we didn't do that we probably wouldn't be where we are today. We just had to jump and that always helps because you had that head start, and that's why we've partnered with other vendors like Google. They ask for opinions as I'm sure you guys have seen. Because it is still in its infancy in general. So that helped to a point obviously.
We had different teams involved. On those phone calls we had networking guys. It's great that some of those people I knew for years but not to that level. I used to need networking things; I'd send in a request, so I knew who these people were. But we never had to truly bounce ideas off of each other. Probably one of the unique things in this side of the house being the micro services, I don't see [it] personally and this is not just our company, but I've seen it in other ones where I've just talked to people. You see a collaboration that just you didn't see in distributed computing before, where you did get the us versus them thing, and it could very well be that people realized this stuff's merging.
At some point, Kubernetes, you read the documents now people are saying wow in 2021 it is going to become the orchestrator of the data center. So that's going to be VMs, storage, networking. It does a bunch of that already as we know, but you see the vendors pushing that more. So, like you brought up scanning, so obviously cybersecurity, they're involved. Where we store our registries, where we store containers, you probably take some kind of artifact, artifact that you had before and maybe it can do registry. But that's probably a different team as well.
So, you have to talk among each other. We've just taken it upon ourselves as the leaders of certain technologies to have calls, and we'll have cybersecurity on, we'll have all the different cloud people that we have on. It might be the specialist for Amazon, specialists for GCP, the cyber people, it might be somebody who's a developer and we try to do these calls.
We do that one on a bi-weekly basis. We try to do a customer forum where you bring in the developers, talk about what your roadmap might be. Let them offer input. Let them hear what's going on instead of doing this whole designing in a vacuum or a bubble and then they suddenly have the product and they just don't like it.
At Magalix, we engage Kubernetes experts to help you better understand the advantages of shifting left. In fact, we can help you define, deploy, and manage governance policies with an OPA policy execution engine, following Kubernetes’ best practices.
If you want to learn more about SecDevOps and how shifting left helps ensure robust security,
You can get your copy of Scott Surovich and Marc Boorshtein’s book, Kubernetes and Docker - An Enterprise Guide: Effectively containerize applications, integrate enterprise systems, and scale applications in your enterprise, HERE.