Roles and responsibilities
As an Engineering Manager at Canonical, you must be technically strong, but your main responsibility is to run an effective team and develop the colleagues you manage. You will develop and review code as a leader, but know that the best way to improve the product is to ensure that the whole team is focused, productive and unblocked. You are expected to help them grow as engineers, do meaningful work, do it outstandingly well, find professional and personal satisfaction, and work well with colleagues and the community. You will also be expected to be a positive influence on culture, facilitate technical delivery, and regularly reflect with your team on strategy and execution. You will collaborate closely with other Engineering Managers, product managers, and architects, producing an engineering roadmap with ambitious and achievable goals.
We expect Engineering Managers to be fluent in the programming language, architecture, and components that their team uses. Code reviews and architectural leadership are part of the job. The commitment to healthy engineering practices, documentation, quality and performance optimisation is as important, as is the requirement for fair and clear management, and the obligation to ensure a high-performing team.
Location: This role can be home based in the EMEA or Americas regions.
What your day will look like
- Manage a distributed team of engineers and its observability portfolio
- Organize and lead the team's processes in order to help it achieve its objectives
- Conduct one-on-one meetings with team members
- Identify and measure team health indicators
- Interact with a vibrant community
- Review code produced by other engineers
- Attend conferences to represent Canonical and its Observability Stack
What we are looking for in you
- An exceptional academic track record from both high school and university
- Drive and a track record of going above-and-beyond expectations
- A proven track record of professional experience of software delivery
- Professional python development experience, preferably with a track record in open source
- A proven understanding of the importance of observability and monitoring for keeping software running smoothly
- Experience designing and implementing observability solutions
- Willingness to travel up to 4 times a year for internal events
- Professional written and spoken English
- Experience with Linux (Debian or Ubuntu preferred)
- Excellent interpersonal skills, curiosity, flexibility, and accountability
- Passion, thoughtfulness, and self-motivation
- Excellent communication and presentation skills
- Result-oriented, with a personal drive to meet commitments
Additional Skills That You Might Also Bring
- Engineering management experience
- A working knowledge of Go
- Open source contribution experience
- Interest and experience with container technologies
Desired candidate profile
- Observability: The ability to infer the internal state of a system based on the data it produces, such as logs, metrics, and traces.
- Monitoring vs. Observability:
- Monitoring is the act of checking system health by looking at predefined metrics and indicators.
- Observability is the broader concept, which enables proactive troubleshooting and understanding of complex systems by exploring the data and uncovering unknown issues.
- Three Pillars of Observability:
- Metrics: Quantitative data points that provide insights into system performance, like CPU usage, memory usage, request counts, and error rates.
- Logs: Structured or unstructured data that records events, errors, and system activity, which helps diagnose issues and understand system behavior.
- Traces: Distributed traces that track the journey of requests or transactions across various services in a microservices architecture. They help identify bottlenecks and latencies in the system.
Key Skills and Tools
a) Metrics Collection and Monitoring
- Prometheus: A popular open-source tool for gathering and querying metrics in time-series format.
- Grafana: A dashboard and visualization tool commonly paired with Prometheus to create dynamic dashboards.
- StatsD: A simple and powerful metric collection service used to collect data from services.
- InfluxDB: A time-series database designed for storing metrics and events.
b) Logs Management
- ELK Stack (Elasticsearch, Logstash, Kibana): A suite of tools for centralizing and visualizing logs.
- Elasticsearch: A distributed search and analytics engine.
- Logstash: A tool for ingesting, transforming, and sending log data to Elasticsearch.
- Kibana: A dashboard tool for visualizing log data stored in Elasticsearch.
- Fluentd: Another tool for log aggregation, routing, and transformation.
- Datadog Logs: A SaaS platform offering log aggregation, real-time search, and powerful analytics.
c) Distributed Tracing
- Jaeger: An open-source, end-to-end distributed tracing tool for monitoring and troubleshooting microservices-based architectures.
- OpenTelemetry: An open-source framework for collecting telemetry data (traces, metrics, logs) from applications and infrastructure.
- Zipkin: Another open-source distributed tracing tool.
- Datadog APM: Application Performance Monitoring solution for collecting traces and metrics from services.
d) Alerting and Incident Management
- Alertmanager: Often used with Prometheus to manage alert rules and notifications.
- PagerDuty: Incident response platform for managing alerts and coordinating incident resolution.
- Opsgenie: A similar incident management tool that integrates with various monitoring and alerting tools.
- VictorOps: A tool for alert management and collaboration during incidents.