As a Senior Data Center Technician, you will play a crucial role in maintaining the optimal performance and reliability of our advanced data center infrastructure. You will be responsible for overseeing the installation, configuration, and management of cutting-edge hardware, particularly NVIDIA GPUs, which are integral to our AI solutions. A strong emphasis will be placed on troubleshooting, diagnostics, and proactive maintenance practices to ensure system uptime, performance efficiency, and security. You will collaborate closely with cross-functional teams including engineering, IT support, and data operations to ensure that our data center operations are conducted in alignment with organizational goals and best practices. This position demands a keen analytical mindset and the ability to respond swiftly to a dynamic technical environment, while also providing mentorship and guidance to junior technicians.
Job Requirements:
At least 5 years of hands-on experience in data center operations, with specific expertise in managing NVIDIA hardware and GPUs such as A100, H100, and H200 GPUs.
Proficient understanding of data center infrastructure, including power distribution, cooling systems, and racking standards.
Demonstrable experience in performing routine maintenance and hardware upgrades in high-availability environments.
Strong foundational knowledge in networking technologies, including Ethernet, TCP/IP, and network troubleshooting techniques.
Proven ability to implement and manage monitoring tools to ensure optimal hardware performance and uptime.
Experience in executing disaster recovery and backup strategies to maintain data integrity and availability.
Exceptional problem-solving skills, with a track record of diagnosing and resolving hardware-related issues within stringent timeframes.
Knowledge of server operating systems (Windows, Linux) including installation, configuration, and system management tasks.
Strong familiarity with virtualization technologies and their application in a data center setting.
Excellent communication skills, with the ability to articulate technical information to non-technical stakeholders.
Comfortable working in high-pressure environments, demonstrating resilience and adaptability in response to incidents.
Team-oriented mindset, with experience leading and mentoring junior team members effectively.
Strong understanding of safety protocols and compliance standards relevant to data center operations.
Willingness to participate in on-call rotations and availability for after-hours support as necessary.
A commitment to continuous learning and professional development in emerging technologies relevant to data center operations.
Job Responsibilities:
Administer daily operations of the data center, ensuring that all systems function at peak performance and meet uptime SLAs.
Conduct routine inspections and maintenance checks on NVIDIA hardware and associated components, proactively identifying potential issues.
Collaborate with the engineering team to deploy new hardware installations and upgrades, ensuring minimal disruption to existing operations.
Monitor system performance using various diagnostic tools, taking immediate corrective actions as needed to mitigate risks.
Document all maintenance and repair activities, including changes made, to maintain an accurate inventory and configuration management database.
Serve as the primary point of contact for incident escalations, applying expert troubleshooting methodologies to restore services swiftly.
Lead and facilitate training sessions for junior technicians focusing on best practices in data center management and equipment handling.
Participate in capacity planning and resource allocation discussions, providing insights based on observed trends in performance and usage.
Implement and reinforce compliance with safety regulations and protocols, ensuring a safe working environment for all personnel.
Develop and refine data center operating procedures and workflows, improving efficiency and productivity across teams.
Engage in collaborative troubleshooting with cross-departmental teams to address complex technical challenges impacting system performance.
Evaluate and recommend new tools and technologies that can automate and enhance data center operations.
Assist in the development of backup and disaster recovery plans, coordinating testing exercises to validate effectiveness.
Maintain accurate documentation of data center configurations, topology, and relevant operational data to enable efficient troubleshooting.
Participate in continuous improvement initiatives, contributing innovative ideas that align with organizational goals and enhance service delivery.
Required Skills:
Advanced knowledge and practical experience with NVIDIA GPUs and associated technologies.
Strong technical aptitude in diagnosing and resolving hardware, network, and system-related issues.
Expertise in system monitoring and management tools for assessing performance metrics and resource utilization.
Proficiency in scripting and automation tools for optimizing data center tasks and processes.
Exceptional organizational skills with a meticulous approach to project management and documentation.
Effective interpersonal skills with the ability to work collaboratively in a team-oriented environment.
Capacity to prioritize and manage multiple tasks in a fast-paced, high-demand environment.
Analytical mindset with a solid foundation in data analysis, performance benchmarking, and reporting.
Ability to maintain composure and productivity under pressure, particularly during critical incidents.
Commitment to professional integrity and ethical practices in all technical operations.
Enthusiastic attitude towards continuous education and adapting to new technologies in the data center ecosystem.
Customer-centric mentality, with a focus on delivering results that enhance user satisfaction and operational efficiency.
Strong familiarity with industry trends and emerging technologies in data center management.
Hands-on experience with virtualization platforms such as VMware or Hyper-V and their strategic implementations in data centers.
Knowledge of disaster recovery methodologies, planning, and execution.
Excellent writing skills for creating technical documentation, capacity reports, and incident logs.