Observability Site Reliability Engineer

DRW Holdings, LLC.
London
GBP 50,000 - 90,000
Job description

DRW is a diversified trading firm with over 3 decades of experience bringing sophisticated technology and exceptional people together to operate in markets around the world. We value autonomy and the ability to quickly pivot to capture opportunities, so we operate using our own capital and trading at our own risk.

Headquartered in Chicago with offices throughout the U.S., Canada, Europe, and Asia, we trade a variety of asset classes including Fixed Income, ETFs, Equities, FX, Commodities and Energy across all major global markets. We have also leveraged our expertise and technology to expand into three non-traditional strategies: real estate, venture capital and cryptoassets.

We operate with respect, curiosity and open minds. The people who thrive here share our belief that it’s not just what we do that matters–it's how we do it. DRW is a place of high expectations, integrity, innovation and a willingness to challenge consensus.

Our Observability team provides mission critical support for many of our centralized logging, metrics and tracing tools used throughout the firm. They manage the deployment and administration of these applications ensuring multi-tenant and highly available operation. In addition, they help interface with other teams to effectively use these tools to get the most out of the data produced. It's a fast-paced, dynamic environment that provides new technical challenges constantly and demands that you learn new things daily.

What you will do in this role:

  1. Provide best in class support for our suite of applications
  2. Troubleshoot production system incidents and create artifacts for postmortems to ensure that similar failures in the future are avoided
  3. Develop automation to facilitate administrative tasks supporting the onboarding and maintenance various users and groups
  4. Test and automate upgrades of our applications to remain on our vendor's latest releases
  5. Constantly be improving our own logging, monitoring and alerting practices
  6. Interact with vendor support to debug and drive third-party issues to resolution
  7. Interface with other teams to be an ambassador of good observability practices
  8. Help teams identify data to ingest and how to make use of this data through dashboards and alerting

Required Experience:

  1. 5+ years of industry experience using various logging and monitoring tools
  2. Coding experience to automate repetitive tasks
  3. Familiarity with CI/CD systems and workflows
  4. Familiarity with git or other version control systems
  5. Persistent drive to improve workflows and make things better
  6. Ability to troubleshoot complex problems
  7. Solid written and verbal communication skills
  8. Ability to work well on a team as well as independently

What will make you stand out:

  1. Experience using Splunk, Grafana, Prometheus and other observability tools
  2. Experience using kubernetes to deploy and maintain systems
  3. Experience using Jsonnet or other templating tools to render complex yaml/json
  4. Familiarity with gitops workflows
  5. Solid configuration management concepts and skills
Get a free, confidential resume review.
Select file or drag and drop it
Avatar
Free online coaching
Improve your chances of getting that interview invitation!
Be the first to explore new Observability Site Reliability Engineer jobs in London