ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

What is O11Y? Observability Demystified - Chris Riley - Splunk

By Harness Season 1 Episode 12

In this episode of ShipTalk, we are joined by Chris Riley who is a Senior Developer Advocate at Splunk. Chris's background spans years of professional services and being an advocate. Chris is part of the Observability Panel at {unscripted} 2021 and gives us a background in O11Y, aka Observability. 

With any new technology, there is an internal need to advocate for change. Chris also talks about tips and tricks for those bringing in new technology how to advocate for change and to leave their mark on organizations.   

Ravi Lachhman:

Well, Hey, everybody. Welcome back to another episode of ShipTalk very excited today to be talking to my buddy Chris Riley is a Developer Advocate at Splunk. Chris, why don't you introduce yourself to the listeners if they don't know you?

Chris Riley:

Oh, there's people on here that don't know me. I'm just kidding. That was like the most arrogant intro you could think of. Yeah, my name is Chris Riley. I'm a Senior Tech Advocate at Splunk, which I usually say that means that my career as a developer was not very successful. But I could not let go, the the process and the activity that happens in in application development. So I actually started out my career in managed services, pure IT, turning servers off and on every other week, and moved into application development, and had a lot of fun, there working on some really cool stuff wasn't my forte. And now I spend most of my time speaking at events, and also helping tech enabled enterprises, reconcile technology capabilities with reality, because as you know, there's a lot of people who think tools solve everything.

Ravi Lachhman:

Oh, absolutely. And it's the funny thing, right? Like, tools by themselves, don't do anything. It's all about the culture. It's all about your opinion, or how opinionated your team members are. And that journey is not a science. It's certainly certainly an art, art forum. So fun fact, Chris is actually going to be speaking at our conference {unscripted}. So depending on when you listen to this podcast, Chris is going to speak or has already spoken at the conference. But if you're watching this video on YouTube, there's something behind Chris there's this "O11Y", I can't even say it out loud. But Chris, why don't you tell us a little bit about what is that acronym stand for?

Chris Riley:

You know, I shouldn't be embarrassed about this. I plastered all over the place. I actually did a Twitch Stream on earlier this week. And I got the same question is like, what are you doing? What is this thing? So first of all, say our industry weaponizes terminology I am I am not gonna, you know, debate that it It happens when it when it gets to the point, the way I embrace new terminology is this is the term represent a problem space? And is that problem space unique enough? That when you're having a conversation with somebody, you need to label it, right? So a lot of these terms fizzle away because they just become second nature. Like we don't really think about DevOps so much. That's kind of just the standard practice. So "O11Y", if you're really cool, you call it Ollie, all these stands for observability. And this particular awesome graphic is related to a little game that we built the quest for observability. But "O11Y" if you've queued into this standard, you take a long word, take the first letter, the last letter, and then you count the characters in between. There's your industry term. And I've told people that if you ever catch me in public, where I'm not solicited, because you're you're asking me saying Ollie, or typing it out or anything like call me out and in the reason why is because it it alienates the people who can benefit from modern observability practices. Because they don't, they don't they won't even enter the conversation after they, they see that. But they they stand to benefit from it. So I don't like that it makes a members only club where it is become very useful in my day to day active activities is with those people who are very familiar with the term. It creates a an amount of efficiency, I'll say in conversation, because you know what, Ravi, I always miss the V. Every time I type observability. I always miss the V. So O11Y or Oliie, I will never miss a V.

Ravi Lachhman:

Now I know how they create those acronyms. I've been so bad, like those particular acronyms. But there's some for like internationalization or Kubernetes. It's like I have to like pray that Google like understands what I'm like searching for. And it's so funny because I constantly misspell observability. It's not even in my Chrome spellcheck. It's like oh, I hope I spelled that right the first time when I right clicked add to dictionary. But yeah, a little bit about observability. Why don't you give us a little bit of a cras course? In your words, Ch is, like, What on earth is bservability? So I'll take a s ep back. If you were to Googl observability, it would be coming from a man facturing term. So like the Wikipedia definition of it, it's looking at the output of system to determine how wel the system performed. For ex mple, if you look at a car aft r the car has been manufactured, well, if there's a lot of panel gaps, and you're missing a door, you can infer that the manufac uring process is not good, b t not putting words in Chris s mouth here. Chris, will you ell us a little bit of journey nd story and like, for those wh don't know what it means and what is observability?

Chris Riley:

Yeah, so I think your your definition is better than mine. It comes from manufacturing, it comes from networking, it was more of the technical input, implementations of this come from the networking world prior to how we use it today, which is largely associated with cloud native development and cloud applications. It doesn't need to be but that's where the association has come from. And so I think a lot of people have slightly differing definitions that they are extremely passionate about. And hopefully you all can come and join the panel, because those might come out. So my perspective on observability is very similar to DevOps is very similar to DevSecOps is very similar to SRE, Site Reliability Engineering, which is the strategy component is absolutely critical of this, with this. So understanding that the product of your software delivery is representative of the process that built it is kind of the foundation. But how you get there, mostly in terms of the data that you look at to get there is where some of the debate comes from. And the reason that matters is if you assume that observability is this new set of data traces and spans. So this is a new approach to getting data out of your application, surfacing issues and doing discovery, then you have a limited, you have limited the practice of observability to a handful. It's not, I mean, it's not a handful, but a limited set of companies who can do it, which is largely cloud native companies running in Kubernetes. And microservices. Is it fair to say that those are the only ones that can benefit from a practice of observability? I would say no, I would say No, it isn't fair, because traditional monitoring had a completely different approach you didn't look at the output but you focused on your immutable infrastructure, that one server you gave a name. And if anything went wrong with that server, you went to the log directly, or you looked at events, and you got a flood of information. And you were kind of able to figure out what was going on. In any distributed system. It doesn't have to be microservices, in any distributed complex system, you're wasting a lot of time, if you do that, if you go immediately to the log as your source of information, you could be on the wrong path. And burning cycles when something is down is bad news, especially if you have a e commerce application or so forth. So in the world of observability, we have this expectation that the system tells us what's wrong and gives us the context, we need to troubleshoot it. So we look at observability as a collection of practices, which includes incident response, APM, infrastructure monitoring, real user monitoring, synthetics, whatever you want to call it, basically monitoring at every layer of your tech stack. And that is relevant for everybody. And it's in what I like about using the terminology again, does it address a problem space? Yes. The problem space is distributed, supporting distributed systems is complex. Is that problem space big enough that it's worthwhile to have a label? Yes. Because when we talk about traditional monitoring, we're usually talking about logs. And observability goes beyond the log. And so that's where I kind of embraced the term as a modern form of monitoring.

Ravi Lachhman:

Yeah, that makes that makes perfect sense. Just going back to like my engineering days, like we had built a lot of distributed systems and like a big challenge would be following the user journey because a particular user can transverse a dozen endpoints over a dozen different pieces of infrastructure because how we built it to be robust, you know, so we have multiple nodes each endpoint, in case there's a failure or for scaling reasons, and then your particular user, and just tracing that user was almost impossible. I know, there was some early tracing stuff we looked at from like Jaeger. And that Jaeger Meister, to get be confused. When I first heard the term I thought about the drink. I was like, maybe that will help us maybe we're not learning more about it's not the drink. It's just it is it's as systems grow more complex, the firepower to just to even observe, it becomes significant. And also just under that firefight situation, right. If you had infinite time, of course, you can figure something out, you can kind of look at a GUID, kind of log into like 30 different boxes and see, or these days and ages, Kubernetes, but potentially, the node could go away, or the pod can go away. So you're SOL, at some point, then, so really having that to be to be really quick to respond, right? Because I'm just digging into some other points here. Can we talk about maybe some modern challenges of observability? So let's say I'm a brand new engineer, that I just, you know, I played the quest for observability. I'm like, Yes, I need the observability. Like, where can I start? Like, how? How can I observe? Observe, Chris?

Chris Riley:

Yeah, it this is a great question. And I think that it in this is kind of why I gravitated so much to supporting the tech enabled enterprise, I use the term tech enabled enterprise versus tech company, very deliberately, because everybody's building software. But the enterprise is not used to building software, like, they're not Facebook, they're not Google, they will not be Facebook, or Google. So this the set of challenges they are trying to solve, and the technology they're trying to embrace to do that is, is completely different. But I think one thing is true for everybody, which seems a colossal mistake, and it's rather boring, which is data, GDI, getting data in. So most companies under value, how important it is to get your ingest strategy correct. And there's a lot of new approaches to this one that we really like and I love is Open Telemetry, which is an open source project, for ingesting data into your management plane or your monitoring tools. And what I like about is unshackled your infrastructure because what happens to a lot of enterprises is that agents, proprietary agents, or auto instrumentation or whatever it is, can actually be the determining factor on the monitoring tool that you decide to use or the observability tool. The other aspect, besides the instrumentation side of it, is just the quality of data. If you're mixing metaphors in terms of how one service reports, something, and another service reports, something and how you get the data, one gives you logs. The other one is web hooks. And you don't take that into consideration when you build your dashboards, which is what's 10 people tend to only think about is their dashboards, then it it can be the source of a lot of problems. So that's the first thing is just all the way on the left. How do we make sure that we get the right data in? And then if you think about the dashboarding process, thinking about the outcome, how are you going to use the data? Everybody wants, not everybody, but a lot of people want the vendors to come to them and say, Hey, if you just pick these metrics, you're good, you're solid, I mean, we have golden signals, we have RED, etc. They're useful. They're all useful to start the conversation and get you started. But it may not be the metrics that are best for your organization. And it's certainly in a microservices environment is not the end all. For the wide variation of stacks, you're dealing with all of your microservices. So RED is great from a global perspective. But at the service level, it might be extremely limiting and won't help the service owners as well. So determining what you're going to do with the metric instead of just throwing some metrics out there is also something that I see organizations neglect. And then finally, the complete right side which you mentioned, Incident Response. Incident Response is a strategy. It is completely different and it's related to but different than Incident Management. Incident Management is a system of record. incident response is mobilization and cost. text and it's a very short window. In Incident Response, you have to have a strategy, you have to have an on call strategy, you have to get away from spray and pray, which a lot of organizations do blasted to everybody and see what happens, or what we call lazy mobilization, which has picked out one person who fixes everything always. And let's burn them out as quickly as possible. One thing you'll notice is that all of these things I just described are not technology, things, these are all strategies and considerations you have to make before you implement the technology. And that's where I think a lot of organizations fall short. And they hit this kind of hype cycle where they adopt new tech, it looks like it's doing great things. They're benefiting from the dashboards. And then they realize that it's not the information that they needed, or they're not using it correctly. And they have to resolve that in the window between that and being truly effective. can be really long. And that's where the danger zone, I think, is for a lot of organizations.

Ravi Lachhman:

Excellent points. I mean, so for some of the listeners, it really is like a journey, there's a whole science behind what you can infer. So I'll give, I'll give a little bit of example, I'll play I'll do my entire career in four sentences or less. So from a software engineer, where if you think about what you're logging, it's either implicit or explicit. From a software engineering perspective, it was always explicit, I had to put log statements in my code. So I control what was going into log, then would assume, okay, this is running somewhere else, so that whatever blackbox system is running it, they have some sort of way of tracking if something totally wonky happens, which is outside of our whitebox control. And but as time goes on, right, like I moved on, I changed jobs, or change projects, to Chris's point organizations, your typical organization is very heterogeneous. They have 1000s of these applications that no one potentially no one has, there's no developer anymore. And so how do you come up with an approach to kind of cover all of that, and that's very challenging. There's no lowest common denominator yet things are running in COBOL. You have things that are running in Go, and you have things that you don't know where they are, but you got to bill for them once in a while. It's so like, going back to the metrics, you're talking about, like RED metrics and Golden Signals and whatnot. Those are attempts, I think they're good attempts, say at a lowest common denominator, this is what you should be looking for. But is there's a lot I mean, it's, it's I spent the last year or two focusing on it. I was my mind was blown. How, how much science goes on behind it. Yeah, it was fun. Changing gears a little bit, Chris. Chris is a Developer Advocate. One of the hard skills you mentioned, someone adopting a company adopting new technology. How do you advocate for things? Like, just if you're an engineer, and you want to bring in a new technology or you want you're passionate about something like, how do you even start advocating for change or advocating for something new?

Chris Riley:

That's a great question. I mean, advocacy. It's a fun role. It's, it's, it's a challenging role in you know, usually the best advocates come from a technical position, you know, historically. Sometimes you hear the term evangelism, I think that they're somewhat synonymous, I prefer advocacy, it seems a little less intrusive, because this is not intrusive. It's more a process of stewardship, which is going to get to the strategy that I'm going to imply. Sometimes you hear Developer Relations, which there's a lot of opinions about how the two work together, I'll say, Developer Relations, this tends to be something kind of radically different, more like building a game like this. So advocacy, how do you advocate? Well, first of all, you have to agree that everybody is selling, advocating, always, in even in your role, it doesn't matter. And it usually comes down to I want to use a library or I want to buy this tool or I want to convince my peers to use the same automation that I'm you know, even care about the automation. So all usually all the moments where you're like, hitting your head against the desk, like why don't they listen to me is the moments when you should be like building rapport in relationship. You know, you can go and find all the cheesy stuff out there on how to do this, but it there's a lot of empathy that's needed in a lot of stewardship. So I think the days of going to somebody and saying do this because it's the right thing are totally gone. We're all too busy for that. I don't I don't care how who you are, it doesn't matter how right the person talking to you is if they tried to shove it down your throat, you're not going to listen. So thinking in terms of stewardship is the approach that I like to take. And you have to believe in what your present, so you can't like and that's one thing that people don't understand about advocates and you've probably run into this is like, I'm, I'm not promoting anything, I don't already believe it. Because if I did, that, that if you're can be genuine that comes across, eventually, people are aware of that. So you have to be genuine, you have to think about stewardship, the best trick is give get, right, you want them to get something from you, which is generally a concept, an idea, some sort of a decision, give them something and buy give, it can be really small. So help them be more efficient. If I'm trying to convince somebody that pipeline analytics, for example, matters. I'll give them a metric that I know they care about, but I don't care. So it you need to you need to facilitate and also tie what you're working on to their objectives. Again, it can feel really annoying, and I was out. Once I figured out this trick, I was annoyed by the how well it worked by focusing on their goals, not mine. And but it does. And, you know, if you're very utilitarian like me, the outcome is what matters. So I think everybody's an advocate, I would encourage you if you're in a technical role to consider advocacy, if you if you enjoy this process, in this journey of working with people consider advocacy. But everybody is advocating to the point, Ravi, that I've seen companies have internal advocates as a part of their DevOps service organization. And they often do things like run dojos, internal dojos. And, and just continually spread and steward best practices. And as a result SREs, that's kind of the function of SREs these days. They're no longer on call for their code. They're stewarding the best practices of supporting applications and services.

Ravi Lachhman:

Yeah, absolutely. I think it boils into a lot of what you see like expertise roles. So like essary. And apps like engineer, people who have to use them using the stewardship role. They're stewards of the their domain knowledge. So they have to make sure that they disseminate that across the organization, because unfortunately, there's not an essary in every sprint team. There's not an app sec engineer, and every, you know, two, two pizza team whenever Bezos had a conversation with someone from Amazon, but I eat a lot of pizza that my team would be three people because me and two other people can eat.

Chris Riley:

Well, you decide, is it a smaller team or more pizza?

Ravi Lachhman:

That could be another podcast for their time. And that's absolutely right. Chris hit the nail on the head a being very being very timely podcast, if you go on LinkedIn, or Indeed there's like a huge rush of firms that are not software companies hiring for internal advocates, right? And it's a lot of times I'll put myself in the shoes of like, let's say a staff engineer, you know, if you if you mister/misses, staff engineer, or looking to be a principal or chief, a lot of that there's a huge part of advocacy in your job because you're supposed to be a change a change agent for the company in well, how do you do that? No, not like Chris said, you're not going to be the days of light is really dictating this is to stack, this is what we'll use. Those days are dying, right? Like organizations are being much more accepting of, Hey, you know what, we'd be pragmatic because people go and come right, like, no one's gonna be at the same firm for 40 years. So your legacy is during the time but hey very important skills sets to have. So kind of coming into, you know, the homestretch here for the for the podcast. I like I kind of like to ask an intrinsic question to every one of the podcast guests. So I'll go ahead and ask Chris this so Chris, imagine you're you're just fresh out of university or and you're walking down the street and you're able to time travel or current Chris was able to time travel into the day you graduated University. Then you ran into yourself with your university cap on, what would be some advice any advice that you would tell young Chris entering the real world?

Chris Riley:

So I think I have one. But I have to say that I think I have a lot to learn. Chris today has a lot to learn from that, Chris also. And I think those lessons are around grit in determination and passion, not that I'm not passionate about what I do. But certainly Chris in college years, which is was much more passionate. So you, you underestimate how much can be done in a long period of time. And you overestimate how much can be done in a short period of time. So college, Chris, extremely impatient. And if I have any of my peers listening to me now, or coworkers, or people who know me, let's say you're extremely impatient, Chris, today, you are one. So yes, that's true, I was even more impatient than and that impatience leads to a lot of agnst and frustration, which can come out in ways that are not not productive at all. And so patience is something that I'm still learning. And playing the long game is something that I'm also still learning because all I cared about was the short game back then. And so I think there's there's a lot to learn in in both ways. And you'll notice I didn't say the word maturity, and I didn't say the word experience, because both of those terms kind of bugged me. It's like, be a grown up kind of stuff. I don't believe in that. I think 23 year old Chris actually had an amount of grit that I still had today. But I also think that he could have chilled out a little bit. Chill, Chris. I think that would have helped be more effective in accomplishing his goals.

Ravi Lachhman:

Awesome. Yeah. Very, very stellar advice. We all can benefit from chilling a little bit. Chill once in a while. But hey, Chris, thank you so much for being on the podcast. I'm really looking forward to your session at{unscripted}. Or I enjoyed your session at {unscripted}, depending when someone's listening.

Chris Riley:

Did I do well, that's if time traveling right now. How did it go? Was it bloodbath?Or was it like super mellow?

Ravi Lachhman:

We'll find out. There's some there's certainly some personalities on this on the panel session that Chris is on. Be sure to catch it. But Chris, thank you so much for coming on the podcast. I always enjoy having you on. And I catch everybody next time.

Chris Riley:

Thank you. And yeah, make sure you attend the panel and the entire event. It's gonna be awesome Harness does a great job with virtual events. And there's a lot out there. I understand that. So you got to find the ones that that grab your attention.

People on this episode