Site Reliability Engineer

Be among the first applicants.
DUG
Kuala Lumpur
MYR 200,000 - 250,000
Be among the first applicants.
3 days ago
Job description

We are a technology company at the forefront of high-performance computing (HPC) with a strong foundation in applied physics. Our innovative hardware and software solutions for the global technology and resource sectors enable our clients to leverage large and complex data sets.

We operate three world-class green supercomputer clusters, running a large suite of scientific applications. The research and development team of mathematicians, geophysicists, and software engineers are responsible for creating and maintaining this suite of signal-processing and subsurface imaging tools.

We are running a large, combined Python, C/C++ and Java application on thousands of servers as a large distributed application. We are looking for an experienced software engineer to improve the reliability and fault tolerance of the system in cooperation with our platform and IT teams. A part of the responsibilities will be maintaining and extending our graph-based task scheduler which underlies our distributed application.

When submitting your application, you'll show that you have an attention to detail by including ‘Shibboleth’ in your cover letter.

Responsibilities:

  • Focusing on the needs of our distributed cluster applications, identify gaps in our application and system monitoring, and implement solutions in collaboration with the platform and IT teams.
  • Further optimise and develop our graph-based task scheduler, improving resiliency and efficiency of the scheduling decisions.
  • Contribute to design and implementation of the distributed architecture with a focus on reliability and fault tolerance.
  • Inspection and maintenance of software written by other members of the team.
  • Providing and receiving regular, constructive feedback to and from your peers.
  • Acting as a mentor for an exceptional intern or junior developer.
Requirements:

  • Demonstrable expert-level skills as a software developer.
  • Experience with at least one of our programming languages: Python, Java, C, or C++.
  • A good understanding of the different components of a Linux-based HPC cluster (network, filesystem, etc.) or alternatively experience with SRE.
  • Excellent written and spoken business and technical English.
DISCLAIMER:

The offer is subject to pre-employment screenings that may include, but are not limited to:

  • Verification of your right to work in the respective location.
  • Provision of applicable and relevant qualifications.
  • Nationally approved criminal history check.
Get a free, confidential resume review.
Select file or drag and drop it
Avatar
Free online coaching
Improve your chances of getting that interview invitation!
Be the first to explore new Site Reliability Engineer jobs in Kuala Lumpur