Site Reliability Engineer page is loaded
Site Reliability Engineer
Apply locations Canada - Ottawa, United States - Denver, CO, United States - Shakopee, MN (GHQ)
Time type: Full time
Posted on: Posted Yesterday
Job requisition id: R003414
Career Growth, Flexibility and Collaboration!
Entrust is an innovative leader in identity-centric security solutions, providing an integrated platform of scalable, AI-enabled security offerings. Headquartered in Minnesota, we offer our colleagues the ability to work globally, in a flexible and collaborative environment. Our team makes an impact!
Position Overview:
The Instant Financial Issuance (IFI) Cloud Service includes a wide array of components including web services, application servers, and databases hosted in a Hybrid cloud environment. The Site Reliability Engineer (SRE) will be responsible for ensuring that the SaaS platform is reliable, available, and performant, as well as scalable, secure, and cost-effective. Ultimately, the individual will be responsible for the functional management of all the IFIaaS cloud environments, applications, networks, scoping projects, and the resolution of application and network issues.
Responsibilities:
- Monitor system issues using various metrics, such as uptime, latency, error rate, throughput, and availability.
- Deploy and maintain monitoring and on-call tools i.e.: Splunk, Prometheus, Grafana, PagerDuty, Datadog, etc.
- Create strategies to detect issues, such as setting up alerts, dashboards, and health checks.
- Address issues as they arise, using troubleshooting techniques, root cause analysis, and incident management.
- Design systems to troubleshoot automatically, using self-healing mechanisms, such as auto-scaling, load balancing, and failover, mitigation run books.
- Collaborate with development teams and other stakeholders to identify potential risks, such as security vulnerabilities, performance bottlenecks, deployment issues, or configuration errors.
- Implement various risk mitigation strategies, such as patching, backup, redundancy, encryption, or testing.
- Design, build and maintain robust infrastructure built on Azure and AWS, leveraging native cloud technologies i.e. AKS, EKS, managed SQL, Mongo, etc.
- Define and follow a clear incident response process, which includes roles, responsibilities, escalation, communication, and resolution.
- Use automation and orchestration tools to speed up the recovery process, such as restoring backups, rolling back changes, or deploying fixes.
- Design, implement and maintain robust CI/CD pipelines to automate the software delivery process.
- Automate configuration management tasks across multiple servers in Hybrid cloud environments using tools like Ansible, Terraform, etc.
- Define IaC to provision and manage cloud resources in Hybrid environments (Azure, AWS, On-Prem) including complete lifecycle management scaling and decommissioning.
- Implement best practices and standards to prevent or reduce the occurrence of emergencies, such as code reviews, testing, and monitoring.
- Implement and support a hybrid cloud environment in Microsoft Azure and on-premise.
- Update incident response run books, automation and create new templates as required.
- Manage activities with complete integrity and in accordance with the organization's policies, systems, practices, and programs.
- Collaborate with product teams and other teams to understand the user needs, expectations, and satisfaction.
- Learn from incidents and post-mortems and implement the action items to prevent recurrence or improve response.
- Suggest and implement new solutions and technologies to enhance the system and the service, such as optimization, automation, or innovation.
- Provide after-hours support for production issues on a rotational basis with other team members to ensure system availability 24/7/365.
Basic Qualifications:
- Bachelor’s Degree in Computer Science, Software Engineering, or equivalent combination of education and experience.
- 5+ years of related experience as a Software Engineer, DevOps Engineer, Site Reliability Engineer or a role in similar capacity.
- Extensive experience working with enterprise level micro-services applications, including deployment and maintenance of the applications in distributed environments.
- Demonstrated hands-on experience and expertise with DevOps tooling (Ansible, Terraform, Jenkins, Octopus deploy, etc.) networks, network security, high-level managerial skills.
- In-depth hands-on experience with on-prem and cloud compute, storage and networking solutions (vmWare, NetApp, Azure, AWS, etc).
Where you will be: This role is hybrid, requiring three days a week in-office at our offices in Ottawa, Canada or Denver, CO, as specified in the job description. At Entrust, we have a distributed workforce.
Entrust Corporation is an EOE/AA/Veteran/People with Disabilities employer.
NO AGENCIES, NO RELOCATION
#LI-GR1
#ENT123