Infrastructure SRE Team Lead

Be among the first applicants.
Lesaka Technologies
Cape Town
ZAR 600 000 - 1 000 000
Be among the first applicants.
2 days ago
Job description

A vacancy exists for an Infrastructure SRE Team Lead within the Micro Merchant Division – Kazang, in Cape Town, Century City.

This role is ideal for a seasoned Infrastructure SRE professional looking to take on a leadership position and drive innovation within a dynamic team.

We are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise in Linux-based, open-source environments to lead a team ensuring the reliability, scalability, and performance of our critical systems. This role involves technical leadership, strategic planning, and hands-on implementation of automated solutions for system monitoring, optimization, and infrastructure management. You will collaborate with the DevOps and engineering teams, guiding best practices in CI/CD, observability, and infrastructure automation, while mentoring a team to enhance system resilience and operational efficiency.

Key Responsibilities include, but are not limited to:

  • Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement.
  • Provide technical guidance and career development support for team members.
  • Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
  • Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
  • Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
  • Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
  • Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
  • Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
  • Perform regular system maintenance, including updates, patches, and optimizations.
  • Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
  • Identify and mitigate risks to system reliability, scalability, and security.
  • Ensure compliance with organizational and regulatory standards in system design and operations.
  • Manage on-call rotations and incident response protocols.

Minimum Requirements:

  • Bachelor of Science or any related tertiary qualification.
  • A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity.
  • Proven ability to mentor and develop team members.

Competencies required:

  • Excellent leadership and communication skills.
  • Strategic thinker with a proactive and results-oriented approach.
  • Ability to build and maintain strong cross-functional relationships.
  • High attention to detail and ability to enforce best practices.
  • Passion for technology and continuous learning.
  • Strong problem-solving and analytical skills.
  • Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
  • Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
  • Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
  • Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
  • Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.
  • Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments.
  • Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
  • Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.
Get a free, confidential resume review.
Select file or drag and drop it
Avatar
Free online coaching
Improve your chances of getting that interview invitation!
Be the first to explore new Infrastructure SRE Team Lead jobs in Cape Town