Site Reliability Engineer

PT. TEMAS TBK

Daerah Khusus Ibukota Jakarta

IDR 300,000,000 - 400,000,000

Job description

Site Reliability Engineer
A Site Reliability Engineer (SRE) is responsible for maintaining the reliability and performance of computer systems within an organization. They bridge the gap between development and IT operations by handling operational tasks traditionally managed by operations teams.

Responsibilities:

Design and implement highly available and scalable systems to ensure the reliability and performance of the company’s website or application.
Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.
Monitor systems and applications, proactively identifying and resolving performance bottlenecks or availability issues.
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance.
Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.
Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
Create and maintain documentation for system architecture, configuration, and troubleshooting procedures.
Perform capacity planning and resource allocation to ensure optimal system performance and scalability.
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.
Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.

Required Skills:

Strong knowledge of Linux/Unix systems and command line tools.
Proficiency in scripting languages such as Python, Shell, or Perl.
Experience with configuration management tools like Ansible, Puppet, or Chef.
Familiarity with cloud platforms like AWS, Azure, or Google Cloud.
Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.).
Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools.
Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk.
Knowledge of database (Postgres, MS SQL Server, MySQL) performance tuning and maintenance.
Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues.
Excellent communication and collaboration skills to work effectively with cross-functional teams.
Strong attention to detail and ability to work in a fast-paced, dynamic environment.

Required Qualifications:

Bachelor’s degree in computer science, engineering, or a related field.
Proven experience as a Site Reliability Engineer or a similar role minimum 1 - 3 years.
Solid understanding of software development methodologies and DevOps principles.
Experience with agile and iterative development processes.
Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator).
Familiarity with continuous integration/continuous deployment (CI/CD) pipelines.
Experience with source control systems such as Git or SVN.
Knowledge of security best practices and experience implementing security measures in a production environment.
Ability to work independently and handle multiple projects and priorities simultaneously.
Strong analytical and problem-solving skills, with a focus on continuous improvement and automation.