Sr.HPC System Engineer

Big Data Technology Solutions
Abu Dhabi
AED 200,000 - 300,000
Job description

Key Responsibilities:
Dell Servers:
Architect and deploy HPC systems using Dell PowerEdge servers, ensuring high availability and optimized performance for compute-intensive applications.
Manage server hardware lifecycle, including deployment, upgrades, and diagnostics.
Configure HPC cluster nodes for seamless integration with Kubernetes and GPU workloads.
AMD GPUs (MI210):
Deploy and optimize AMD GPU-based servers to accelerate AI/ML, HPC, and data-intensive applications.
Monitor GPU utilization, troubleshoot performance bottlenecks, and optimize workloads for GPU acceleration.
Integrate GPUs into Kubernetes environments for containerized GPU-based applications.
Pure Storage:
Design and manage Pure Storage solutions, including FlashBlade, to support HPC and data-intensive workloads.
Implement multitenancy configurations for isolated, secure, and efficient resource utilization.
Monitor storage health and ensure performance optimization for high-speed data access.
Commvault Backup:
Architect and manage enterprise-wide Commvault backup solutions, ensuring data integrity and readiness for disaster recovery.
Implement backup and retention policies for HPC environments, including containerized and GPU-accelerated workloads.
Kubernetes Container Management:
Deploy and manage Kubernetes clusters for HPC applications, ensuring scalability and fault tolerance.
Configure persistent storage for containerized workloads and integrate storage with GPUs for high-performance data processing.
Monitor cluster performance and troubleshoot HPC-specific Kubernetes challenges.
System Optimization and Monitoring:
Implement advanced monitoring solutions for servers, GPUs, storage, and Kubernetes clusters to ensure peak performance.
Develop and enforce policies for system security, resource allocation, and compliance with industry standards.
Lead capacity planning and scaling initiatives for HPC infrastructure.

Technical Skills:
Extensive experience with Dell PowerEdge servers in HPC or enterprise environments.
Proven expertise in AMD GPUs (MI210), including their integration and optimization for AI/ML and HPC workloads.
Advanced knowledge of Pure Storage systems, including multitenancy and high-performance configurations.
Expertise in Commvault backup systems, including design, deployment, and disaster recovery.
Strong proficiency in Kubernetes container orchestration, particularly for GPU-accelerated applications.
Knowledge of high-performance interconnects (e.g., RDMA, InfiniBand) and networking for HPC.

Get a free, confidential resume review.
Select file or drag and drop it
Avatar
Free online coaching
Improve your chances of getting that interview invitation!
Be the first to explore new Sr.HPC System Engineer jobs in Abu Dhabi