Lead / Head of SRE
Job Title: Lead / Head of SRE
Overview
We are seeking an experienced Lead / Head of Site Reliability Engineering (SRE) to establish, scale, and lead a global SRE organization for our greenfield platform. This is a unique opportunity to build SRE practices, culture, and tooling from the ground up while partnering closely with engineering, product, and security teams to ensure our systems are scalable, resilient, and secure.
This role is both strategic and hands-on — you will not only define and execute the SRE vision but also be deeply involved in supporting infrastructure and applications in AWS.
Responsibilities:
Strategic Leadership & Team Building
Build, scale, and lead a global SRE organization across multiple time zones.
Hire, mentor, and develop top SRE talent, fostering a culture of operational excellence, collaboration, and continuous improvement.
Define and own the SRE vision, roadmap, and success metrics in alignment with company goals.
Operational Excellence & Process Design
Establish and document all SRE processes, runbooks, and playbooks from scratch for a greenfield environment.
Define and enforce SLAs, SLOs, and SLIs, ensuring measurable reliability and availability targets.
Build and implement incident management processes, including on-call rotations, escalation paths, and postmortem practices.
Champion a blameless culture and lead root cause analyses to drive systemic improvements.
Hands-On Technical Leadership
Lead application support efforts — monitor, troubleshoot, and resolve production issues in collaboration with engineering teams.
Contribute to the development of tooling, scripts, and automation to eliminate toil and streamline operations.
Build and maintain observability stacks (metrics, logging, tracing) and ensure actionable alerting.
Drive cost optimization, performance tuning, and capacity planning for infrastructure and applications.
Cross-Functional Collaboration
Partner with Product, Engineering, and Security teams to ensure resiliency is built into every stage of the development lifecycle.
Act as the primary advocate for reliability and operational efficiency within the organization.
Report on key reliability metrics and provide high level insights into system health.
Qualifications:
7+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering or Operational Support, with at least 3+ years in a leadership role.
Proven experience building SRE or Operational Support functions from scratch or scaling them.
Hands-on expertise with AWS services (EC2, ECS/EKS, Lambda, VPC, RDS, S3, IAM, CloudWatch, etc.) and cloud-native architectures.
Strong background in infrastructure-as-code (Terraform, CloudFormation), CI/CD tooling, and automation.
Proficiency in application support and development practices, including debugging, performance tuning, and collaborating with software engineers.
Deep understanding of reliability engineering principles, incident response, observability, and security best practices.
Strong coding/scripting skills in languages like Python, Go, or Bash.
Excellent leadership, communication, and stakeholder management skills.
Track record of defining SLAs/SLOs, improving MTTR, and driving automation initiatives.
Passionate about mentorship, process improvement, and building high-performing teams.
Centric Software provides equal employment opportunities to all qualified applicants without regard to race, sex, sexual orientation, gender identity, national origin, color, age, religion, protected veteran or disability status or genetic information.