Site Reliability Engineer

Be among the first applicants.

Entrust Corporation

Ottawa

CAD 125,000 - 150,000

Be among the first applicants.

Yesterday

Job description

Site Reliability Engineer page is loaded

Site Reliability Engineer

Apply locations Canada - Ottawa, United States - Denver, CO, United States - Shakopee, MN (GHQ)

Time type: Full time

Posted on: Posted Yesterday

Job requisition id: R003414

Career Growth, Flexibility and Collaboration!

Entrust is an innovative leader in identity-centric security solutions, providing an integrated platform of scalable, AI-enabled security offerings. Headquartered in Minnesota, we offer our colleagues the ability to work globally, in a flexible and collaborative environment. Our team makes an impact!

Position Overview:

The Instant Financial Issuance (IFI) Cloud Service includes a wide array of components including web services, application servers, and databases hosted in a Hybrid cloud environment. The Site Reliability Engineer (SRE) will be responsible for ensuring that the SaaS platform is reliable, available, and performant, as well as scalable, secure, and cost-effective. Ultimately, the individual will be responsible for the functional management of all the IFIaaS cloud environments, applications, networks, scoping projects, and the resolution of application and network issues.

Responsibilities:

Monitor system issues using various metrics, such as uptime, latency, error rate, throughput, and availability.
Deploy and maintain monitoring and on-call tools i.e.: Splunk, Prometheus, Grafana, PagerDuty, Datadog, etc.
Create strategies to detect issues, such as setting up alerts, dashboards, and health checks.
Address issues as they arise, using troubleshooting techniques, root cause analysis, and incident management.
Design systems to troubleshoot automatically, using self-healing mechanisms, such as auto-scaling, load balancing, and failover, mitigation run books.
Collaborate with development teams and other stakeholders to identify potential risks, such as security vulnerabilities, performance bottlenecks, deployment issues, or configuration errors.
Implement various risk mitigation strategies, such as patching, backup, redundancy, encryption, or testing.
Design, build and maintain robust infrastructure built on Azure and AWS, leveraging native cloud technologies i.e. AKS, EKS, managed SQL, Mongo, etc.
Define and follow a clear incident response process, which includes roles, responsibilities, escalation, communication, and resolution.
Use automation and orchestration tools to speed up the recovery process, such as restoring backups, rolling back changes, or deploying fixes.
Design, implement and maintain robust CI/CD pipelines to automate the software delivery process.
Automate configuration management tasks across multiple servers in Hybrid cloud environments using tools like Ansible, Terraform, etc.
Define IaC to provision and manage cloud resources in Hybrid environments (Azure, AWS, On-Prem) including complete lifecycle management scaling and decommissioning.
Implement best practices and standards to prevent or reduce the occurrence of emergencies, such as code reviews, testing, and monitoring.
Implement and support a hybrid cloud environment in Microsoft Azure and on-premise.
Update incident response run books, automation and create new templates as required.
Manage activities with complete integrity and in accordance with the organization's policies, systems, practices, and programs.
Collaborate with product teams and other teams to understand the user needs, expectations, and satisfaction.
Learn from incidents and post-mortems and implement the action items to prevent recurrence or improve response.
Suggest and implement new solutions and technologies to enhance the system and the service, such as optimization, automation, or innovation.
Provide after-hours support for production issues on a rotational basis with other team members to ensure system availability 24/7/365.

Basic Qualifications:

Bachelor’s Degree in Computer Science, Software Engineering, or equivalent combination of education and experience.
5+ years of related experience as a Software Engineer, DevOps Engineer, Site Reliability Engineer or a role in similar capacity.
Extensive experience working with enterprise level micro-services applications, including deployment and maintenance of the applications in distributed environments.
Demonstrated hands-on experience and expertise with DevOps tooling (Ansible, Terraform, Jenkins, Octopus deploy, etc.) networks, network security, high-level managerial skills.
In-depth hands-on experience with on-prem and cloud compute, storage and networking solutions (vmWare, NetApp, Azure, AWS, etc).

Where you will be: This role is hybrid, requiring three days a week in-office at our offices in Ottawa, Canada or Denver, CO, as specified in the job description. At Entrust, we have a distributed workforce.

Entrust Corporation is an EOE/AA/Veteran/People with Disabilities employer.

NO AGENCIES, NO RELOCATION

#LI-GR1

#ENT123