We are seeking a highly motivated and skilled engineer to join our team. The ideal candidate will have a strong background in managing server hardware including network, storage, compute, and AI. In addition, experienced in validation of failed server hardware.
Roles and Responsibilities:
Manage and maintain fleet of server racks from different OEMs (network, storage, compute, and AI hardware).
High performance clustered file system access and administration, preferably GPFS/IBM Scale.
FC/Infiniband based SAN administration
Interface with OEM vendors for firmware and driver update related maintenance.
Support failure analysis initiatives through the utilization of available HW resources to validate rack-level, system level, module level failures from different Meta's datacenters.
Manage and maintain network infrastructure for the lab, including switches, routers, and firewalls.
Configure and manage network protocols, such as TCP/IP, DNS, and DHCP.
Ensure network security and compliance with company policies and industry standards.
Experience working with LLMs and popular frameworks such as TensorFlow or PyTorch.
Design and implement containerized applications using Docker and Kubernetes.
Manage and maintain virtual machines using popular hypervisors, such as VMware or KVM.
Provide support with failure analysis labs - inventory management, safety audits, and maintaining access controls to critical server hardware.
Support root cause analysis and diagnosing hardware/software issues. Isolate failures in platform, firmware, BIOS, CPLD, and other applications.
Experience working with dediprog tools (FW/BIOS debug).
Provide regular updates to failure analysis lead and collaborate with the team on different mission critical projects.
Qualifications:
Bachelor’s or master’s degree in computer science, Electrical Engineering, or related field.
5+ years of experience in server rack management, lab infrastructure management, and/or related fields.
Experience with debugging and troubleshooting complex hardware issues, including storage, compute, and AI.
Strong experience with Linux (RedHat, Fedora, CentOS, etc.) or Unix operating systems.
Experience with scripting languages, such as Python, PowerShell, PHP, Perl, etc.
Experience working with containerization, Kubernetes, docker, and virtual machine management.
Experience with failed server hardware validation, including BIOS/CPLD FW debug.
Knowledge of network protocols, including TCP/IP, DNS, and DHCP.
Strong knowledge of server hardware components, including motherboards, power distribution boards, and storage systems.
Strong problem-solving skills and ability to work independently.