Senior Site Reliability Engineer (SRE)/Team Lead

Be among the first applicants.

ASIA GULF CLOUD PTE. LTD.

Singapore

SGD 100,000 - 125,000

Be among the first applicants.

4 days ago

Job description

Position Summary:

We are looking for a skilled and driven Senior Site Reliability Engineer (SRE) / Team Lead to join our digital banking platform. In this leadership role, you will manage a team of 5–6 engineers and take end-to-end ownership of system reliability, scalability, and operational excellence. You’ll work closely with developing, product, and security teams to ensure our platform is secure, stable, and high-performing—built for scale and resilience.

Key Responsibilities:

Lead, mentor, and grow a high-performing team of SREs to support critical infrastructure and services.
Define and enforce SLOs/SLIs/SLAs across services and ensure adherence to reliability targets.
Design, build, and maintain scalable, fault-tolerant, and secure cloud-native infrastructure (AWS/GCP/Azure).
Drive incident management processes: lead root cause analysis, post-mortems, continuous improvement.
Own observability: implement and optimize monitoring, alerting, and logging systems (e.g., Prometheus, Grafana, ELK, Datadog).
Automate operational tasks and infrastructure provisioning with tools like Terraform, Ansible, Helm, or custom scripts.
Collaborate with development teams to embed reliability and performance practices early in the SDLC (DevOps/DevSecOps mindset).
Partner with security and compliance to ensure infrastructure aligns with banking regulations and audit requirements.
Establish best practices for CI/CD pipelines and release management.
Foster a culture of accountability, collaboration, and continuous improvement within the team.

Must-Have Qualifications:

Bachelor’s degree or above in Computer Science, Engineering, or a related field.
7+ years of SRE, DevOps, or Infrastructure Engineering experience, with 2–3+ years in a team leadership role.
Strong expertise with container orchestration (Kubernetes), infrastructure-as-code (Terraform), and cloud platforms (AWS/GCP/Azure).
Deep understanding of Linux systems, networking, and distributed system design.
Proven experience with incident response, service reliability metrics, and system hardening.
Experience managing production systems with 24/7 availability and scaling in high demand environments.
Excellent problem-solving skills and strong communication across technical and business teams.
Familiar with agile development, CI/CD, and modern GitOps practices.

Nice to Have:

Prior experience in the fintech or banking sector.
Knowledge of security practices, especially for regulated environments.
Hands-on experience with Kafka, Redis, PostgreSQL, or other common distributed systems.
Familiarity with service meshes (e.g., Istio, Linkerd) and API gateways.
Certifications in Kubernetes, AWS, or related DevOps technologies.