COMPANY OVERVIEW A well-established client of us in Kuala Lumpur is seeking for Site Reliability Engineering Lead.
JOB RESPONSIBILITIES
Team Leadership: Lead and mentor a team of SREs, fostering a culture of ownership, collaboration, and continuous improvement. Define clear goals, performance metrics, and development plans for the team.
System Reliability & Performance: Design and implement strategies to improve system reliability, scalability, and performance. Conduct root cause analysis of production incidents and develop preventive solutions.
Infrastructure Management: Oversee the deployment, monitoring, and management of production environments. Collaborate with development teams to design cloud-native infrastructure and architecture.
Automation & CI/CD: Drive automation of operational processes, reducing manual intervention and response times. Optimize CI/CD pipelines to ensure smooth and rapid deployments.
Incident Management: Establish incident response protocols and lead efforts during major incidents. Ensure robust monitoring and alerting systems are in place to proactively detect issues.
Collaboration & Communication: Act as a liaison between engineering, operations, and other teams to align objectives. Share insights and best practices with internal stakeholders to enhance overall system resilience.
JOB REQUIREMENTS
Technical Expertise: Strong experience with cloud platforms (AWS, Azure, Google Cloud) and infrastructure-as-code tools (Terraform, Ansible, etc.). Proficiency in programming/scripting languages (Python, Go, Shell, etc.). Deep knowledge of Kubernetes, containerization, and distributed systems.
Leadership Skills: Proven track record of leading SRE or DevOps teams and managing large-scale production environments. Strong decision-making, prioritization, and problem-solving capabilities.
Monitoring & Metrics: Expertise in implementing and using monitoring tools (Prometheus, Grafana, Datadog, etc.) and logging systems. Familiarity with service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.
Soft Skills: Excellent communication and collaboration skills to work across cross-functional teams. Ability to mentor and upskill team members, fostering a learning-oriented culture.
Experience: At least 8 years of experience in SRE, DevOps, or related roles with a focus on reliability engineering.