Resilience in Action

Resilience in Action E15: Scaling SRE and DevOps operations for the company that mouses around and its galaxies far far away with Brian Scott

September 09, 2022 Kurt Andersen, Blameless Season 1 Episode 15
Resilience in Action
Resilience in Action E15: Scaling SRE and DevOps operations for the company that mouses around and its galaxies far far away with Brian Scott
Show Notes Transcript

Brian Scott is the 1st Technology Evangelist/Engineering Advocate at the Walt Disney Company. In his first 7 years with the company, he served as Staff SRE and Manager of Systems Reliability Engineering building teams in Disney Studios, Lucasfilm, Streaming, 21st Century Fox, and Disney Imagineering, supporting Online Properties at scale across the Enterprise. He has been in DevOps and Web Development for over 20 years working at such companies as MySpace, RazorGator, and many other startups before joining Disney focusing on Public Cloud Governance and Automation, DevOps & SRE. Brian is a contributor to Open Source, a father & huge Go fan! His personal blog is https://brianlscott.com

Kurt Andersen (00:28):

Hello, I'm Kurt Anderson, and welcome back to Resilience in Action. Today, we are talking with Brian Scott. Brian was one of the participants in the SRE Con America's conference back in March, at which he talked about his role as an SRE evangelist. So I wanted to follow-up with him, and find out more about that role and how it has come to be a thing at his company. Brian, would you like to introduce yourself to our audience?

Brian Scott (00:59):

Yeah, definitely. Thanks. So I'm Brian Scott, currently in a Chief Technologist role, or Tech Evangelist role with a large media and entertainment company based in the Burbank area. We do like to mouse around, and yeah, I've been giving actually a few talks recently around SRE and how to implement it across a large enterprise, most recently at SRE Con, but then also at SCALE, which is the Southern California Linux Expo within Southern Cal here.

Kurt Andersen (01:36):

Cool. So how'd you come to get involved with SRE? Let's start with that before we talk about how to implement it in a company.

Brian Scott (01:45):

Oh yeah, definitely. Definitely. I've been in the industry for roughly about 25 years, was very fortunate to start out in tech right out of high school, I actually ran the network at my high school, and then immediately after that worked for a very large retail company, primarily brick and mortar stores doing IT, and then moved around to some startups, some pretty cool startups in the dot com era. So places like MySpace, a lot of high skill applications, and moved into really supporting large scale dot com's. And the company that I'm at now, I've been there for roughly 10 years, started out in a group that was more focused on taking advantage of the internet, this big, giant boom of the internet. How do we take advantage of that? How do we rebrand ourselves from an entertainment company really wanting to take advantage of online properties?

I started out in this very small group supporting the flagship company brand, this company for their online presence, helping launching various new different products very quickly. And so mostly at that time, so this is 2012, 2013, 2014, a lot of online gaming. Launching new games, taking advantage of all the IP that this company was actually building and creating, and we quickly noticed that the company started growing in many different areas. All these other acquisitions started happening, most recently with the galaxy far, far away. And so we really had to strategize and figure out how we were going to scale operations and development. Now, the company that I currently work for is broken out into multiple different companies, but we do operate all as one. In the sense of from a SRE perspective, we really partner with a lot of the different segments and business units, and so for me, growing in that internet group and then starting to expand, and start building out other teams to help all these segments and business units really came apparent.

So during my early career there, I was fortunate enough to help grow and build a lot of these SRE teams, and embed them into different business units across the company, and really start doing the actual knowledge share. And I think that's the actual secret is, "Hey Team A over here that's embedded with this sports BU, is collaborating now with this SRE team, and now I'm supporting something like streaming." And getting to actually share ideals and technology in many different ways on how they're solving many of the same problems, but just in different ways. And so at the time I had roughly about five or six teams, and not only did we have our own stand-ups, and our own ways of working with the Project Managers within each segment, but I would bring them together every month. So we actually started collaborating, and shared those ideals, automation, things like [inaudible 00:04:59] modules. At that time, roughly about 2012, 2013, it was all about Chef. So Chef were really brought into the company, and that's where we really built this library of cookbooks to allow SREs or operators at the time, to actually start developing these cookbooks, and we would host them in this online library mainly at the time on GitHub.

And the way that we helped teams collaborate on all this automation, is we would pick librarians, folks from all across the company that would meet up once a month, and we would go through this software life cycle of reviewing all the requests that people made for features, to be brought into these cookbooks. So that's early stages of really how SRE, and DevOps, and all the automation really came to be within this media company. And then there's a fundamental shift towards containers, and so that's where I'm at now, and really that five to six year journey of how we started to really help teams, or help our business units build these SRE teams focus in different pillars.

Kurt Andersen (06:15):

So initially, these five or six teams were all, you were managing these five or six teams, is that right?

Brian Scott (06:21):

That's correct.

Kurt Andersen (06:22):

Then moved an extra level of interaction by getting a librarian, almost like a team lead, from these different teams, if I'm understanding correctly?

Brian Scott (06:33):

Each one of our teams, we try to incorporate a team lead, that is usually someone who's very passionate about leading a team, someone who's typically in a staff or a senior staff role. And so staffs, and senior staffs are treated more like leaders on a team, really engaging with the business unit to ensure that the SRE team is meeting the demand, and the needs of the products that that business unit is actually creating. And so a lot of times it's that SRE team's job to hold quarterly business reviews with that business unit, to talk about all the products that they're managing for them, from an infrastructure perspective, talk about all the incidents as well, all the postmortems, and then also from a cost optimization standpoint, work with that business units finance teams as well, to help them cost optimize for whatever products they're running either on-prem or in the cloud. That team lead role is really important.

Kurt Andersen (07:38):

Yeah, it sounds like it. How do you pick which business units, or sub organizations get SRE teams? Because SRE are always in demand, and it seems like there is always more demand than there are people.

Brian Scott (07:59):

As my leader always says, SRE is a scarce resource, and so we really have to focus on high value targets. So really for us, the core of our SRE organization is within corporate. But do understand that every pillar within this media company, they do have their own technology centers. They have their SRE teams that work directly for each pillar, but we're looked at more as professional services. Although we have a dedicated SRE team embedded in every one of these pillars within the media company supporting all the various businesses, we also work with their technology teams that directly work for that business unit and/or that segment. A lot of the work that we get is actually word of mouth. Or, hey, this one business unit, their leaders or their CTO is talking to a different CTO, and suddenly we get an email and it turns into a whole new engagement.

A lot of our engagements are short and long. We have engagements where, "Hey, a product team just needs an SRE for a fraction of a day to fix something." It could be just connecting two teams together. This dev team wants to understand, "Hey, how do we incorporate SSO or single sign on into our application?" And where we might dive in with them for two or three weeks, and then that engagement will end, and we'll move on to the next one. Or in some cases, we have engagements that may last six months where we're building out infrastructure, helping the Dev team with CICD, fully embedded in their scrums. And then at that point in time, we may hand off the work onto the centralized technology team that's in that segment, and move onto the next thing. Our engagements can go from anywhere from a few hours, to a few weeks, to even a few years. Currently I've been personally embedded with one current BU that focuses on a galaxy far far away, and I've been there for roughly six months.

These engagements are very variable, depending on what that business unit or that product team needs. So again, a lot of it is very much word of mouth, or especially since my role is the evangelism, me going out and actually talking to teams. And a good example of that is, I may come across someone in Slack having a problem. And I might say, "Hey, do you want to jump on zoom for 15 minutes, and we can power through it together?" This could be the first time I'm meeting this person, or it could be a person that I just randomly talk to every so often. Sometimes I'm just reaching out to people who I never really met before, and then next thing I know it turns into a full on engagement, and they want more of our help.

Kurt Andersen (10:59):

Okay. Interesting. Tell me a little bit more about this tech evangelist's role. Certainly you don't just troll Slack all day, looking for people who need your help.

Brian Scott (11:09):

Yeah, definitely. Obviously managing teams was taking a lot of my time. And so during that time as an SRE manager, I really started getting into the knack of helping out many different teams across the entire enterprise. Now this company is built of roughly 220,000 employees, so I'm meeting someone new every single day. Yeah. Pretty good size, right? So naturally a good 30, 40% of my time was actually helping teams learn and discover new technology, but also helping them find a SRE team to actually go out and support their needs.

A lot of my time started really moving already towards this evangelist role where I'm evangelizing technology, I'm helping business units understand what tech, or what new tech is actually out there, helping teams solve problems, but also connecting teams together. Think of it as being the bees cross pollinating across all these many different flowers. And so my leader and I were talking, and we're like, "You know what? I think we need a dedicated evangelist role." Now, in some other companies, at Microsoft, or Amazon, you may hear the term of Dev Ro, where you have these cloud developer advocates, or Microsoft has the cloud advocates. Well, we thought it might be that time for this company to have a SRE evangelist. So I immediately moved into that role, and so I'm now supporting our teams in a direct capacity, from an IC perspective. That's an individual contributor, and I'm now not only sending out newsletters with updates on what different types of technology teams are actually working on, but I'm also creating internal podcasts, working on blogs, where I'm actually doing interviews with engineers internally, and posting them onto a medium that other engineers can actually watch and consume. But I'm also distilling a lot of technology updates in bite size chunks for our executive leadership to actually consume, and quickly get an understanding of what different teams are actually working on at a high level.

Kurt Andersen (13:39):

Okay. Interesting. That's two sides to the role that I hadn't anticipated, educate the executives being one of them.

Brian Scott (13:48):

Yeah. I think it is important, because obviously they have a lot of things to think about. Being able to showcase what our engineering teams are working on, and how they're not only bringing happiness to other employees of the company, but how the value that they're adding eventually affects the customers that we're actually serving on the outside. I think it's very important, and it really gives the engineers that visibility that they may not have time for.

Kurt Andersen (14:20):

Right. In your experience, what's been the hardest part of the SRE mindset for organizations to grasp and take hold of?

Brian Scott (14:37):

I think it's the ability to fail fast, putting in a good agile process, that's probably the biggest part of it, to be honest, is being able to come up with, "Hey, SRE works in sprints as well. Not just like developers. We can be part of your sprint cycles. We can work in a very agile way. We can be able to actually fail fast, and actually treat our whole process as a software development life cycle process." A lot of organizations even within our own business, within this media company is still operating in a very much waterfall way. So getting them to think about agile, getting them to think about, "Hey, it's okay if we're deploying 20, 30 times a day." It's more about how do we create the ability for our businesses to move fast, fail fast, but also be successful, and have a way to think about liability in a different way.

Some of those things are, "Hey, let's treat postmortems as a learning process. Let's remove the actual blame part of it. Let's make it more about educating others and being successful together. And that incident is again, just a learning opportunity. That doesn't mean that we're doing something wrong. It means that we need to iterate and improve." So it's really taking all the principles of SRE, and really educating these product teams and/or these business units that may not have never operated in that model before. Especially for business units that are used to going out to some external vendor to get their website made, or to get their product made. And then suddenly this vendor comes in, and now it's this black box, or it's built in a way that's not very reliable and/or scalable. So we with an SRE tend to think of ourselves as technology Sherpas. We also work with our business units if they have external vendors, and we do vendor reviews, we do third party reviews of let's say some external SaaS solution. If we're trying to bring in some vendor solution, like a SCICD product or something, we'll do a full review with our security teams, and really help the BU understand the pros and cons of using such a technology.

Kurt Andersen (17:01):

Okay. Yeah, that's a great role. I have not necessarily encountered other teams aspiring to that role, or taking that degree of proactive involvement with the business units, but that's great. Because you've got the technology insights to really enhance some of those things such as the vendor reviews.

Brian Scott (17:23):

I think one of the great things too, because the company that I'm in does have a lot of hands in different pots, if you will. Our SREs are able to actually move around and gain different knowledge, and share knowledge. So I might take someone who's working on streaming, and move them into a place where, "Hey, we're deploying physical attractions on-prem." So you can think of the Happiest Place on Earth. Well, now we're taking the SRE mindset and we're applying it to these physical attractions in, let's say a park, and allowing these SREs to gain a whole different knowledge, and bring that knowledge from this other engagement that they were already doing for X number of years, and really learning something new, and applying those same principles, and vice versa.

Kurt Andersen (18:17):

Yeah. I imagine there's a few different aspects when you're coordinating with physical hardware than if you're just doing it in software.

Brian Scott (18:25):

Exactly, exactly. Yeah. It's really, again, I think there are some core principles to SRE that applies to almost everything, and I would say one of those also core pillars, which Kelsey Hightower tends to talk about, is empathy engineering. It's really taking it to the next level and thinking about your engineering, and thinking about how it's actually going to affect another engineer, who's either going to take over your work, or who's going to assist you in your work, and really making it a pleasant experience for everyone on your team to contribute to that same Infrastructure as Code, or to the same CIC pipelines that you're currently building.

Kurt Andersen (19:05):

How have you seen reliability evolve over the last 10 years or so? Is it becoming more common for executives and engineers to consider reliability? Or is it still an uphill push to get that in there?

Brian Scott (19:26):

Yeah. I would say now it's probably more normal practice. Before I would have to talk to developers and/or the vendor and say, "Okay, did you build in good session management? And can sessions actually live across multiple instances of your application?" But now I would say reliability is now a much more common place today, than it was even five years ago, especially now with cloud, I think even there are some companies that are coming full circle that are coming back to on-prem, and taking a lot of the principles they learned about reliability in the cloud, and bringing that back to on-prem. A good example of this is five years ago, we didn't think about, "Oh, we live in a single data center. We're fine." Well, no. Now it's like, "Okay, well your app is in a single data center, but the product team has defined that they need a level of reliability to recover from an outage in that data center. So now you're technically in a single AZ, versus a multi AZ or a multi-region setup."

So now we're taking a lot of the principles that we learned around creating SLIs, creating SLOs, and I think now it's very much more common practice that teams are thinking about reliability in a very different way. Now, obviously there is an emergence of serverless infrastructure, and these platforms like Fly, and Heroku, as well as Versile, that are now allowing you to not even have to think about that. You just deploy your static site or your front end, and a lot of the liability aspects are already taken care of for you. But then there is the other notion of this where, "Hey, now that we're built all on containers, well now we can migrate applications pretty quickly to a new site, and where it might be for us to be down for that one or two minutes while that workload is shifting."

And I think to end it, I think what I'm really happy to actually start seeing is, "Hey, there is no such thing as 100% uptime. Which is great because before a lot of the uphill battle was like, "No, we have to be up 100% of the time." And typically an SREs answer to that is, "Okay, well, the more nines you want equals more dollar signs." So it's really getting the product teams to think about, "Okay, what are our SLOs and SLOs? And are the outcomes that we expect from our product and how they affect our customer, is it okay if we can be down for a few minutes?" And I think a big example of this is during the WWDC conference over in Apple, I think it was earlier this year or last year, Apple took down the Apple store for 20 minutes while they were updating the site, and it took the internet by storm. And it really showed, "Hey, it's okay to go down for a little while, because they're making the experience better." So I think at the end of the day, everyone needs to pull down these shades, and just realize, "Hey, we're all humans." So we have to think about reliability. We got to think about how do we reduce failure cases. Not make it perfect, but how do we keep ourselves to stay curious, and keep on iterating over making things better?

Kurt Andersen (22:47):

Yeah. Iterating, making things better, continuous improvement, all part of the SRE game, isn't it?

Brian Scott (22:54):

Yep, yep. So yeah, I think SRE definitely plays a key role in that.

Kurt Andersen (23:01):

Yeah. How do you structure your teams, or how is on call handled? Because that's obviously a perennial topic of interest to people. Do you go with multiple teams in separate geographies? Do you just have a single team that's large enough that they can handle a reasonable rotation? Is it a mix?

Brian Scott (23:24):

Yeah, great question. So I'll be very transparent. I do not believe in the NOC, and I'll tell you why. The NOC can serve, I served on a NOC for many years, there is definitely a place for a NOC type of a center within an organization. One of them is a communication to executives and to other teams, which I think is great. Now from an on-call perspective, when I was managing teams, I focused on building very small teams. Now to give you an example of that, is I managed a very small team, or one of my teams was a small team of three SREs supporting a streaming service that's not one of the top three, but it's a separate streaming service, that roughly about I would say fairly high number of users, fairly large scope in terms of infrastructure, we ran a couple of different number of [inaudible 00:24:33] clusters. We only had three people, and we designed it in a way where we really took advantage of automation to allow our team of three, to not only support infrastructure of roughly, just to give you a scale, 500 nodes or more within a cloud provider, but able to empower our developers to not only contribute to infrastructure as code, but to do so in a way that reduced the load, and the amount of work that the SRE team had to do during an actual incident.

The reliability piece really played into that. When you carefully plan out your infrastructure, and carefully design a good model, you can operate large scale infrastructure with a very small team. That's exactly how I built my teams. My teams, again, I had five teams, no team was bigger than three to five people, and each team operated roughly anywhere from one very large product or to a couple of thousand smaller products. You can imagine a team where we were running as an example, let's just say 1,000 different websites, and those are ranging from Node JS, to Python, to WordPress, to very elaborate microservice type. What is it? Build out.

But we built the automation not only through chat bots and through a GitFlow system within GitLab, but we really empowered again the Devs to push out changes as needed. So if there was an incident at let's say 2:00AM, and that particular feature or bug only affected a subset of users, because we had already sat with the product owners to define the SLIs and SLOs for all of our services, we ended up moving all the noise down, to where we really only reached out to an actual on-call person, or an on-call engineer, if it really impacted the actual customer. Now, as far as how we designed that on-call model, we have what's called an incident manager, and the incident manager role rotates across different managers. So not only the SRE Manager, but the QA Manager, the Product Manager, including the Software Manager, and the incident role would rotate every Sunday, just as the SRE on-call role would change every Sunday. The incident manager's job when they got the actual page was, "Hey, this particular feature, well, I have to reach out to Team A, B and C." So while they focused on my communications and pulling in other teams, the SRE can actually focus on the actual incident and getting it resolved.

Kurt Andersen (27:40):

Very good. That sounds like it's by the book. You've got SLOs. You only alert people when it's actually affecting the customer. That's awesome.

Brian Scott (27:52):

Now, I would say that every team wasn't perfect. Obviously there were some teams that didn't have SLIs and SLOs, and focused more on again, firefighting, but I think every team is going to be at a different level when it comes to SRE and on-call. But I would definitely say it's a lot more mature today than it was a few years ago. And to where now we're relying on services like Pager Duty, than an actual human actually paging out.

Kurt Andersen (28:24):

Right. So what are you seeing that's interesting to you on the technology forefront? You mentioned before we started the recording, that you're doing a lot with Kubernetes and education around Kubernetes for folks in your tech evangelism role. What are you seeing coming next, that's interesting to you?

Brian Scott (28:43):

Oh, good question, man. I get asked this question a lot. I do want to see serverless take more of a role, which it already is in Kubernetes. I think right now, we're seeing things like Knative, which powers Google Cloud Run. We're seeing Lambda, we're seeing all these providers we push on the serverless front, and we're realizing that Kubernetes is actually powering a lot of this functionality on the back end. Kubernetes is now turning into this fabric that allows teams to not only jump around between cloud providers, but it's becoming this common API that any company can now leverage, to give the sense of serverless to empower developers to move fast. I think Kubernetes is now finding its home. It's now becoming more commonplace than just some cool, edgy tech that we should go try out. And I think, secondly, I think AI and ML are taking center stage in a lot of the SRE and Dev Ops tooling that we are starting to actually use.

We're seeing AI incorporated into incident management, chat bots, making decisions. We're now seeing the emergence of CDKs. Not only are SREs and developers shifting from things like infrastructure as code with things like Pulumi and a Terraform, but we're seeing SREs and developers now incorporate infrastructure as code using CDKs directly into their applications. So a good example of this is, I'm a SRE, or I'm a developer, and my Python application needs Redis and MySQL. Oh, now I can literally have my application talk to, let's say Amazon, via the CDK, and pre-provision those services on start, without having an SRE pre-build those services before they actually deploy the actual application. Really allowing the application to define its needs, versus the human defining and building its needs. I want to say self-service is one, AI/ML, and then Kubernetes as becoming a backbone fabric into infrastructure.

Kurt Andersen (31:17):

Okay. How much of a role do you find for an internal developer platform? A number of companies like Spotify and Netflix have built these internal frameworks that almost meet what you're describing there in terms of check the boxes of the other ancillary supporting services you need when your service instantiates. Do you find that to be a productive direction to go? Or have you used any of those?

Brian Scott (31:51):

Yeah, so we built several of our own within the company I work for just like everyone else. I would say there is a certain use case for it. I think it makes sense depending on the use case. If you are a small team, or an organization that wants to empower your developers to move in what I call the golden road, where there are guardrails, and everything's built-in from a security perspective, that's great. What I think where you start to go downhill on, is if you pigeonhole your engineers into a platform like that for a very long time. And I say that because after a while, while you're keeping your developers, and your engineers on this golden path with this platform that you built, you somewhat put blinders on them, and they forget how the platform runs from a technical perspective, and all the gears that move behind it. But also you're stifling innovation, you're not allowing your engineers to actually innovate, because you're putting them on this golden path that they can't really veer off of. So I would say platforms like that are good and bad, as long as you have the visibility and the knowhow to say, "Okay, it's time to make a change and either innovate on this platform. Or making sure that you're incorporating the latest and greatest technology that may bring more value to your engineers if you're going to provide a platform as such."

Kurt Andersen (33:28):

Ah, great insight. The double edged sword of the golden road. Yeah. Awesome. Well, we are about at time for this episode. Any parting thoughts that you'd like to share with the audience, Brian? And thank you very much for having been a guest.

Brian Scott (33:48):

Yeah. Thank you so much. I would say to everyone out there as a Technologist, stay curious, always learn, and again, really push others to be successful, really be that great partner to want to teach others.

Kurt Andersen (34:04):

Excellent. Well thank you, Brian. And this has been another episode of Resilience in Action.