Staff, Site Reliability Engineer
Walmart
This job is no longer accepting applications
See open jobs at Walmart.See open jobs similar to "Staff, Site Reliability Engineer" Northwest Arkansas.What you'll do at
Position Summary...
What you'll do...
Are you passionate about improving customer experience for millions of customers of Walmart and its subsidiaries? As a Principal Site Reliability Engineer in Customer Engagement Services (CES) Tech Org, you’ll lead efforts to ensure our customer service platforms are resilient, scalable, and lightning-fast. You’ll architect reliability frameworks, drive automation across incident response and observability, and collaborate with engineering and product teams to embed SRE principles into every layer of the stack. This role offers the excitement of solving real-world challenges at a massive scale—where every improvement directly enhances customer satisfaction and operational excellence. If you're energized by building systems that empower associates and delight customers, this is your opportunity to lead with purpose.
About Team: Customer Care Technology
he CES team builds best-in-class customer service experiences for hundreds of millions of Walmart customers and customer service agents globally. We are a group of software engineers, data scientists, and machine learning experts pushing the boundaries of GenAI technology in complex enterprise applications. The CES Technology team is part of the Enterprise Business Systems organization in Walmart Global Tech. We partner with our product, business and UX teams to drive significant measurable business impact. Our mission is to help customers save money and live better.
What you'll do:
About Team
The CES team builds best-in-class customer service experiences for hundreds of millions of Walmart customers and customer service agents globally. We are a group of software engineers, data scientists, and machine learning experts pushing the boundaries of GenAI technology in complex enterprise applications. The CES Technology team is part of the Enterprise Business Systems organization in Walmart Global Tech. We partner with our product, business and UX teams to drive significant measurable business impact. Our mission is to help customers save money and live better.
What You will Do
- Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
- Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
- Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
- Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and “you build it, you run it.”
- Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
- Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
- Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
- Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.
What you'll bring:
- 8+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.
- A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.
- Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.
- Deep expertise in distributed systems, service ownership models, CI/CD pipelines, and observability practices.
- Exceptional communication and collaboration skills, with a track record of influencing cross-functional teams and driving consensus on reliability strategies.
- Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.
- A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.
For information about PTO, see https://one.walmart.com/notices.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.
For information about benefits and eligibility, see One.Walmart.
The annual salary range for this position is $110,000.00-$220,000.00
Additional compensation includes annual or quarterly performance bonuses.
Additional compensation for certain positions may also include:
- Stock
Minimum Qualifications...
Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and4 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.Option 2: 6 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.Preferred Qualifications...
Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
Experience in site reliability engineering, site and system administration, infrastructure management, or related area., Master's degree in site reliability engineering, site and system administration, infrastructure management, or related area and 2 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area., SRE certification (for example, IBM Cloud Site Reliability Engineer)., We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.Primary Location...
2501 Se J St, Ste A, Bentonville, AR 72716-3724, United States of AmericaThis job is no longer accepting applications
See open jobs at Walmart.See open jobs similar to "Staff, Site Reliability Engineer" Northwest Arkansas.