AVP, Observability & SRE Platform Specialist, EASRE, Technology & Operations

DBS Bank Limited

Singapore

SGD 60,000 - 80,000

Job description

DBS is a leading financial services group in Asia, with over 280 branches across 18 markets. Headquartered and listed in Singapore, DBS has a growing presence in the three key Asian axes of growth: Greater China, Southeast Asia and South Asia. The bank's capital position, as well as "AA-" and "Aa1" credit ratings, is among the highest in Asia-Pacific. DBS has been recognised for its leadership in the region, having been named “Asia’s Best Bank” by The Banker, a member of the Financial Times group, and “Best Bank in Asia-Pacific” by Global Finance. The bank has also been named “Safest Bank in Asia” by Global Finance for seven consecutive years from 2009 to 2015.

Business Function

Group Technology and Operations (T&O) enables and empowers the bank with an efficient, nimble and resilient infrastructure through a strategic focus on productivity, quality & control, technology, people capability and innovation. In Group T&O, we manage the majority of the Bank's operational processes and inspire to delight our business partners through our multiple banking delivery channels.

Job Objective

DBS Bank is looking for a Platform SRE Engineer with experience working on enterprise level data engineering, analytics, and observability applications. The SRE engineer would be responsible for ensuring high availability of the platform services and perform continuous improvements to increase the platform’s efficiency and resiliency. The SRE engineer will also perform automation development tasks to remove toil and increase the team’s productivity.

Responsibilities

Implement and administer Elastic Stack, Confluent Platform (Kafka), Prometheus, Grafana, NGINX.
Configure Elasticsearch index templates and data life cycle management ILM for data retention.
Develop monitoring, alerting solutions using Elastic Watcher and Kibana or Grafana.
Perform application maintenance, patching, upgrade Elastic stack, Confluent, Grafana, Prometheus, Nginx & other open APM tools versions.
Proactively monitor the platform service availability and help resolve issues.
Automate cluster management routine tasks and optimize processes using APIs and scripting, reducing manual effort, and improving efficiency.
Conduct performance tuning and capacity planning to ensure applications meet scalability and reliability requirements.
Design and develop data engineering pipelines.
Ability to multi-task and prioritize in a fast paced, team-oriented environment.
Continuously review and enhance monitoring processes and methodologies to improve efficiency and effectiveness.
Identify strategic/tactical solutions and provides risk assessments and recommendations.
Collaborate with the Dev Leads to ensure that the dev team’s needs are met through the CI/CD framework, component monitoring and stats, incident escalation, etc.
Develop code (Python, Shell scripting etc.) with quality, scalability, and extensibility.
Excellent technical, analytical, time management, and organizational skills including technical documentation. Agile Scrum skills desired.
Develop custom monitoring dashboards and reports to provide actionable insights and drive decision-making processes.
Contribute to internal knowledge bases, create documentation, and share insights with the team to promote a culture of learning and collaboration.

Deliverables

Ensure on-time delivery of tasks and projects.
Ensure continuous uptime of applications and services.
Ensure no security or audit issues.

Job Dimensions

Comply with bank standards to track and follow up on the assigned projects.
Cover all areas in application and infrastructure operations of the platform.

Requirements

University graduate (computer science or related field) with good experience working with contemporary technologies and scripting languages.
Strong communication skills and ability to explain protocol and processes with team and management.
A passion for learning and using new technologies in the open source communities.
A passion for coding.
Min 10 years of total IT work experience.
Working knowledge of Grafana, Prometheus, Nginx, Elastic stack (Elasticsearch / Logstash / Kibana / Beats) including data ingestion, management, monitoring & analytics. Able to perform L1/2 ELK related tasks.
In-depth experience in Unix/Linux/Shell/Python scripting.
Knowledgeable and experienced in SRE (Site Reliability Engineering) practices covering monitoring, observability, performance management, automation, and resiliency.
Good understanding of Network routing, Load balancing and Networking protocols; a base knowledge of TCP/IP, with an understanding of HTTP and DNS.
Ability to contribute to discussions on design and strategy.
Adequate knowledge of database systems (RDBMS, MariaDB, SQL, NOSQL), Object Oriented Programming and web application development.
Good problem diagnosis and creative problem-solving skills.
Experience in NodeJS, Spring boot could be a plus.
Experience in automation tools (e.g. Ansible) & DevOps pipelines would be a plus.
Knowledge and experience in observability stack – AppDynamics, Dynatrace, APM tools & Open Telemetry is an added advantage.
Experience in architecting a highly resilient Confluent Kafka infrastructure and deep dive knowledge in Kafka.
Experience in developing CI/CD pipelines and tool sets like Bitbucket, Jenkins, JIRA.
Strong hands-on experience in Linux platform, containers, experience in configuring reverse proxies, SSL/TLS.
Strong, committed, and reliable team player, able to take direction but also willing to contribute to discussions on design and strategy.