Senior Site Reliability Engineer- Fleet

Cisco Systems, Inc.

Toronto

CAD 125,000 - 150,000

Job description

Cisco Meraki, a division of Cisco Networking, is a cloud-managed IT company and leader in cloud-controlled Wi-Fi, routing, and security. Our intuitive platform enables organizations of all sizes to deliver customer and employee experiences at scale. To provide best-in-class technologies to our customers, we’ve created an unrivaled company culture for our employees. One where diverse backgrounds, perspectives, and experiences shape our work and fuel our evolution. One that is collaborative, flexible, and inclusive and provides employees with the autonomy to develop technology that’s accessible and secure for everyone.

We are seeking a Senior Site Reliability Engineer (SRE) to join our dynamic SRE Fleet team, which is responsible for ensuring the stability, scalability, and efficiency of our infrastructure. You will play a critical role in maintaining and improving a fleet of over 2000+ machines across a global cloud environment. This role is highly collaborative, involving close interaction with engineering and SRE teams in the UK and San Francisco to scale and optimize our infrastructure.

Responsibilities

Develop and maintain automation code for cloud maintenance processes using Ansible and Ruby.
Debug and resolve complex failure scenarios across large-scale systems, ensuring high availability and reliability.
Design, implement, and optimize GitLab CI pipelines to streamline deployment and testing workflows.
Collaborate with engineering teams to identify and address performance bottlenecks and scaling challenges.
Proactively troubleshoot issues across the fleet, using a deep understanding of Linux systems and networking.
Contribute to the creation of robust unit tests and infrastructure testing suites with RSpec.
Participate in collaborative projects to improve infrastructure efficiency, scalability, and observability.
Work cross-functionally with teams in different time zones, fostering a culture of shared ownership and reliability.
Develop and maintain automated tools for collecting infrastructure data to support compliance requirements.
Streamline compliance processes by reducing manual overhead through automation.

You are an ideal candidate if you:

5+ years of experience in Site Reliability Engineering, DevOps, or a similar role in large-scale cloud environments.
Strong expertise in Ansible for infrastructure automation.
Ruby programming and testing frameworks like RSpec.
Linux systems administration and troubleshooting.
CI/CD pipelines, particularly GitLab CI.
Demonstrated experience troubleshooting and debugging in complex distributed systems.
Experience managing and optimizing fleets of thousands of machines.
Excellent collaboration skills and the ability to work effectively across teams in multiple time zones.
Passion for automation, scalability, and infrastructure as code.
Familiarity with cloud providers (AWS, GCP, or similar).
Knowledge of monitoring and observability tools.
Experience with disaster recovery and high availability strategies.