Top 7 Site Reliability Engineer (SRE) Interview Questions


Site Reliability Engineer (SRE) positions are open – in the thousands. A recent search of Indeed found 9,475 SRE jobs open in the United States alone. Hiring is solid for this role as organizations across all industries seek to boost the performance and reliability of their systems, whether it’s customer-centric services or mission-critical internal applications. If you have the right mix of skills, this is a high-ceiling opportunity, just like its close cousin, the role of DevOps Engineer.

That said, SRE interviews can be more difficult to prepare for than some other IT jobs. It is still a new area and role for many companies, although it has its roots in traditional IT operations as well as DevOps. This is also a role where soft skills are just as important as technical IQ. Computer prowess is only part of the job.

[ Get prepared. Read also: How to spot a great software developer: 7 interview questions and 10 top DevOps engineer interview questions for 2021. ]

What is an SRE?

Here’s how Eveline Oehrlich, Research Director at the DevOps Institute, defines the role of the SRE team: “Site Reliability Engineering (SRE) is Google’s approach to service management, presented in a book of the same name. It is a set of post-production practices for operating large-scale large systems, with an engineering focus on operations.

Oehrlich continues: “[SRE team members are] software engineers who are intended to perform operations functions instead of a dedicated operations team. The reliability of production systems, and therefore of their users, is supported by an engineer who applies the principles of the SRE site to manage availability, latency, performance, efficiency, change management, monitoring, etc. emergency response and capacity planning. They can also function as support engineers, taking advantage of monitoring, capacity, and optimization automation tools. They focus on the non-functional requirements of availability, performance, security and maintainability. (Read Oehrlich’s full article: DevOps vs. ITIL 4 vs. SRE: Stop the Arguments.)

[ What does an SRE do? What’s SRE vs. DevOps? Read also: What is SRE? ]

How to Prepare for a Site Reliability Engineer (SRE) Interview

“SRE’s role is to help others weigh the tradeoffs and pressures on them to deliver quickly and safely. “

“When we move beyond technical skills and experience, SRE’s role really boils down to helping others assess the trade-offs and pressures on them to deliver quickly and safely,” says Kit Merker, COO at Nobl9. “There is pressure on one side of the organization to deliver brilliant new functionality, and on the other to ensure that we are secure, operational and stable. This conflict exists in every organization, from two people in a garage to a large engineering organization the size of Facebook or Netflix.

If you roll your eyes when career discussions turn to people skills or the wide range of ‘soft’ skills, the SRE field is probably not the best suited for you. These characteristics can be the most difficult part of the job in some organizations, especially those with deep-rooted processes and culture.

“The rise of site reliability engineering shows the importance of the impact of technology on our daily lives,” says Ravi Lachhman, Evangelist at Harness. “Like DevOps, SRE is more than a skill set; organizations must enable and foster SRE cultures and practices.

The real goal of SRE is to prevent outages: SREs are obsessed with the science of availability and measurement.

The romantic notion that SREs are caped-clad superheroes who rush in and save the day when they break down is mainly just that: a romantic notion. It does happen, Lachhman says, but their real goal is to make sure that blackouts don’t happen in the first place; ERS are obsessed with the science of availability and measurement.

“SREs are viewed as experts and help drive practices, architectures and general recommendations on system robustness and reliability across the organization,” explains Lachhman.

Since the role is still new in many organizations, this status cannot be assumed. There is a certain amount of evangelism involved in developing this trusted expert status, and that means working closely with individuals and teams across the organization. It is as much a social as a technical role.

“The best candidates are those who have a compelling story that SRE is about socio-technical systems, not just computer systems,” Merker said. “Humans are the most important part of any system – not the code or the services.”

7 site reliability engineer (SRE) interview questions

Keep this human aspect in mind when searching for your next (or first) SRE job; Likewise, keep this in mind when hiring CRSs. This will at least inform some of the questions you will answer (or ask) in an interview. Below, we’ll break down seven sample questions you can use to prepare for each side of the interview.

Question 1: How do you decide whether the team should work on new features or pay off technical debt?

For SRE candidates, this topic is an opportunity to show how you approach seemingly insurmountable conflicts.

SREs play a growing role in negotiating the tension between creating new functionality and reducing technical debt: most organizations cannot do both simultaneously week after week. While this issue may be rooted in technical decisions, it relates to the “socio-technical” nature of SRE.

This is one of Merker’s favorite questions, and he leaves it deliberately open – he wants to hear the candidate dig for more data and context.

“If they have strict rules, I’m less than impressed with their response,” Merker says. “What I’m looking for is curiosity about the customer and the business, an understanding of various roles in the business, and a desire to get data (if possible) to back up different points of view. “

For SRE candidates, this topic is an opportunity to show how you approach seemingly insurmountable conflicts. Everyone thinks their goal or problem is the most important; How do you actually set priorities that people can (most of the time) agree on and work on? When is technical debt acceptable (or inevitable)? How do you repay it?

[ Get our free ebook: Technical debt: The IT leader’s essential guide ]

“A big part of SRE is mediating between these different interests and finding practical, actionable answers to somewhat impossible questions,” Merker said. “There is no exact answer; it’s the discovery process to find what really matters that makes me want to say STRONG COMMITMENT! “

Question 2: How do you go about defining SLOs and SLIs and how do you make the necessary adjustments?

Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are fundamental metrics for SREs. SLOs are the goals of a particular application; SLIs are the real measure of performance against these goals.

Lachhman notes that the SRE function is often central to defining and refining SLOs and SLIs; Often developers don’t necessarily know the standard or benchmark for the applications they create and maintain, especially if SRE is a relatively new dimension to the larger team.

Hiring managers should educate themselves on how the candidate identifies and defines SLOs and SLIs; if you are the candidate, you must be prepared to talk about how you approach these metrics. Additionally, make sure you can discuss a thoughtful process to re-evaluate and optimize these metrics over time.

“Like any measure, they have to evolve,” says Lachhman. “Negotiating changes to SLO / SLI metrics is part of the course. “

Question 3: Which of the three pillars of observability is most important to you? Where do you think you need to be more visible?

The three pillars here are logging, metrics, and tracing. Observability as a whole is intrinsic to the SRE domain.

“The science of measuring a system is at the heart of why SREs are hired,” says Lachhman, highlighting the “four golden signals” in site reliability engineering as a basis for thinking. to this question.

“Which pillar would help you determine these [signals] the best? “asks Lachhman.” These will eventually lead to your SLO / SLI metrics. Showing interest in one or more of the pillars shows that you are ready to evolve in your role.

Typically, measurement is essential in any SRE role, so keep this in mind if you’re looking to pivot into this role from another IT field – it’s a data-driven discipline.

[ Learn more about hybrid cloud and observability. Get the free eBooks, Hybrid Cloud Strategy for Dummies and Multicloud Portability for Dummies. ]

Question 4: How have you implemented process improvements and other changes in the past?

That’s right: the “e” in SRE stands for engineering, and SREs have technical skills. But this role requires more human skills and change agent capabilities than some other IT roles.

“A SRE must reflect on and question existing working methods. It takes creativity and tenacity.

“Although the SRE role is an engineer role, it is atypical of what you think of an engineer role,” says Oehrlich of the DevOps Institute. “While in some organizations supervisory practices, on-call procedures, and other standard processes are already well established, an ERS should reflect on and challenge existing working methods. It takes creativity and tenacity.

There are many roles that can pay homage to the traits of creativity and tenacity desired in the job description. In SRE, however, these are actually critical characteristics, especially when it comes to ego, cultural resistance to change, and other challenges.

“As a hiring manager, I would ask for examples where the person has demonstrated such qualities, how they go about it and what has been achieved,” says Oehrlich.

Let’s take a look at three more questions we should expect:


Comments are closed.