Principal Software Developer- MLOps Platform

Be among the first applicants.

Autodesk

Toronto

CAD 90,000 - 150,000

Be among the first applicants.

7 days ago

Job description

Job Requisition ID #

25WD87574

25WD87574, Principal Software Developer- MLOps Platform

French translation to follow!/Traduction française à suivre!

Position Overview

We are looking for an experienced Principal Software Engineer to join our platform team focusing on AI/ML Platform (AMP). This team builds and maintains central components to fast track the development of new ML/AI models such as model development studio, feature store, model serving, and model observability. The ideal candidate would have a background in MLOps, Data engineering, and DevOps with experience in building high-scale deployment architectures and observability. As an important contributor to our engineering team, you will help shape the future of our AI/ML capabilities, delivering solutions that inspire value for our organization. You will report to a manager.

Responsibilities

System design: You will design, implement and manage software systems for the AI/ML Platform, orchestrating the full ML development lifecycle for the partner teams.
Mentoring: Spread your knowledge, share best practices, and conduct design reviews to enhance expertise at the team level.
Multi-cloud architecture: Define components that leverage strengths from multiple cloud platforms (e.g., AWS, Azure) to optimize performance, cost, and scalability.
AI/ML observability: Build systems for monitoring the performance of AI/ML models and extracting insights on the underlying data such as drift detection, data fairness/bias, and anomalies.
ML Solution Deployment: Develop tools for building and deploying ML artifacts in production environments, facilitating a smooth transition from development to deployment.
Big Data Management: Automate and orchestrate tasks related to managing big data transformation and processing, building large-scale data stores for ML artifacts.
Scalable Services: Design and implement low-latency, scalable prediction and inference services to support the diverse needs of our users.
Cross-Functional Collaboration: Collaborate across diverse teams, including machine learning researchers, developers, product managers, software architects, and operations, fostering a collaborative and cohesive work environment.
End-to-end ownership: Take end-to-end ownership of the components and work with other engineers in the team including design, architecture, implementation, rollout, onboarding support to partner teams, production on-call support, testing/verification, and investigations.

Minimum Qualifications

Educational Background: Bachelor's degree in Computer Science or equivalent practical experience.
Experience: Over 8 years of experience in software development and engineering, delivering production systems and services.
Prior experience of working with MLOps teams at the intersection of expertise across ML model deployments, DevOps, and data engineering.
Hands-on skills: Ability to fluently translate the design into high-quality code in Golang, Python, and Java.
Knowledge of DevOps practices, containerization, and orchestration tools such as CI/CD, Terraform, Docker, Kubernetes, and GitOps.
Demonstrated knowledge of distributed data processing frameworks, orchestrators, and data lake architectures using technologies such as Spark, Airflow, and iceberg/parquet formats.
Prior collaborations with Data science teams to deploy their models, setting up ML observability for inference-level monitoring.
Exposure to building RAG based applications by collaborating with other product teams, Data scientists/AI engineers.
Demonstrated creative problem-solving skills with the ability to break down problems into manageable components.
Knowledge of Amazon AWS and/or Azure cloud for solutioning large scale application deployments.
Excellent communication and collaboration skills, fostering teamwork and effective information exchange.

Preferred Qualifications

Experience of integrating with third-party vendors.
Experience in latency optimization with the ability to diagnose, tune, and enhance the efficiency of serving systems.
Familiarity with tools and frameworks for monitoring and managing the performance of AI/ML models in production (e.g., MLflow, Kubeflow, TensorBoard).
Familiarity with distributed model training/inference pipelines using KubeRay or equivalent.
Exposure to leveraging GPU computing for AI/ML workloads, including experience with CUDA, OpenCL, or other GPU programming tools, to significantly enhance model training and inference performance.
Exposure to ML libraries such as PyTorch, TensorFlow, XGBoost, Pandas, and Scikit-Learn.