SRE Site Reliability Engineer

Dice

Dubai

AED 200,000 - 300,000

Job description

Mandatory Skills: Kubernetes, Java API, Cloud Services, DevOps Tools
Optional Skills: AWS, Agile Scrum, API Gateway

Client Overview: Our client, a leading provider of digital Global System for Mobile Communications/General Packet Radio Service (GSM/GPRS) wireless voice and data technology standards, is looking for dynamic and driven professionals to join a rapidly growing high-performance team.

Position: Site Reliability Engineer, ACE Platform Engineering

This role will support critical API Platform, DevOps, and other activities for the Digital Services Group. Responsibilities include:

Provide consulting services for improved system stability, availability, performance, and reliability.
Assist in determining the impact of operational issues and provide input into their resolution via data extraction and quantification.
Work through day-to-day support issues, ensuring effective and timely resolution of issues in the production environment, and troubleshoot customer-impacting issues.
Forecast and plan for a rapidly growing environment.
Support multiple applications, specifically running Solo Gloo/Kubernetes/PCF/Google Cloud Platform/Java-based systems in an enterprise environment.
Support Gloo running on Kubernetes, Grafana, Prometheus, Cassandra, Postgres, Spring Boot, or Java-based applications running on PCF and WebLogic.
Apply monitoring and create complex alerts and dashboards for production systems.
Provide capacity analysis and tuning analysis for Cloud applications in a LINUX and container platform.
Available to provide 24x7 on-call support on a rotating basis with other team members.
Lead efforts in troubleshooting, recovery, and root cause investigation.
Perform analysis of user requirements and problems to automate or improve systems and review system capabilities, workflow, and scheduling limitations.
Follow and develop detailed work plans, schedules, project estimates, resource plans, and status reports.
Facilitate Disaster Recovery (DR) exercises to ensure that the team is fully prepared for any event.
Lead root cause analysis sessions to understand what causes issues in Production and propose solutions to prevent recurrence.
Ensure documentation is created and remains updated for any related work.
Evaluate product and service solutions.

Skill Requirements:

Strong hands-on experience in Kubernetes, infrastructure, and support.
Strong experience in DevOps practice for Microservices using Kubernetes as orchestrator.
Strong experience with Cloud configurations and services.
Strong experience in API microservices.
Experience with tools like: NGINX, Docker, Postman, SOAP UI, ELK, Splunk, App Dynamics, CI/CD tools, and GitLab.
Good experience in performance measures and tuning, capacity planning and management, contingency, and disaster recovery.
Strong scripting knowledge and experience.
Good understanding of networking and routing.