AIS - Senior Site Reliability Engineer

Apple Inc.

London

GBP 80,000 - 100,000

Job description

Imagine what you could do here. At Apple, new ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you can accomplish. We are seeking an extraordinary individual who is passionate about reliability engineering, software development, privacy, and information security with a desire to work in hyper-scale environments. The ideal candidate will have a strong background in production monitoring, a deep understanding of development and operations, and a successful record in running large-scale production environments.

Description

Our team is highly collaborative, working closely with partner teams to deliver the best results for Apple. We strive to find the best solution while also considering the need to get things done efficiently for each engineering challenge we face. Good ideas are valued and rewarded. As an SRE in Apple Information Security, you will:

Operate, monitor, and triage all aspects of our production and non-production environments.
Pioneer and implement the next generation telemetry system for AIS services.
Establish alert handling procedures, runbooks, and collaborate with our global security team.
Automate deployment and orchestration of services into the cloud environment as well as other routine processes.
Actively participate in capacity planning and disaster recovery exercises.
Interact with and support partner teams across the enterprise.
Cultivate and maintain relationships with internal and external third-party vendors.

Minimum Qualifications

Professional experience in Site Reliability Engineering, DevOps, or a related field.
Experience working with cloud compute environments like OpenStack, AWS, GCP, or Azure.
Experience with infrastructure as code (IaC), configuration management, CI/CD, and automation, e.g., Terraform, Pulumi, CloudFormation, Ansible, Chef, Puppet, Jenkins.
Strong programming skills: Python and/or Go.

Preferred Qualifications

Proficiency in implementing and coordinating telemetry using monitoring and observability tools like Splunk, Grafana, Prometheus, or similar.
Extensive experience administering and troubleshooting Linux systems (any distribution), including the usage of standard Linux utilities.
Troubleshooting and debugging experience.
Experience in shell scripting (e.g., bash/zsh) and system administration.
Experience with measuring, analyzing, and optimizing performance.
Experience operating with Scrum/Agile development methodologies.
Strong understanding of concurrency, parallelism, and distributed system concepts.
Passion for high-quality code, tests, documentation, and production services.
Participation in an on-call rotation.
Building and operating container orchestrating systems (Docker, Kubernetes, Vagrant, and micro-services).
Bachelor’s Degree in Computer Science or equivalent experience.