About the role
Shopify is the commerce platform that powers millions of merchants worldwide. Behind the product experience are ML systems that drive recommendations, search, and personalization at massive scale.
We build the compute and serving layer behind these systems: multi-node GPU training clusters, real-time inference with strict latency budgets, and the performance engineering that keeps it all efficient at scale. Our models serve hundreds of millions of buyers, and the infrastructure we build directly impacts how merchants grow their businesses.
The Role
You will own the core infrastructure that ML ML Engineers depend on to train and serve models: GPU training clusters, real-time serving systems, and the performance and reliability layer underneath both. You'll sit between ML Engineers who need fast iteration and production systems that need to stay up during events like Black Friday/Cyber Monday, where traffic and stakes peak simultaneously.
This role carries real technical authority. You'll make architectural decisions about how we scale training and serving, set standards for infrastructure quality, and be the person the team relies on when systems need to scale by an order of magnitude. You'll mentor engineers across the team, drive alignment on infrastructure direction across multiple workstreams, and influence technical strategy beyond your immediate team. You'll also raise the engineering bar through hiring and technical reviews.
What You'll Do
Training Infrastructure
Design and operate GPU training pipelines on Kubernetes, including multi-node distributed training on GPU clusters
Own training reliability: checkpointing, fault tolerance, preemption recovery, and resource scheduling
Optimize training performance: mixed precision, kernel tuning, data loading throughput, and cluster utilization. You own compute efficiency; data correctness and freshness are owned by the operations side of the team.
Build abstractions that let ML Engineers launch and iterate on training runs with minimal friction
Serving Infrastructure
Build and maintain model serving infrastructure for real-time recommendation and LLM inference, with strict latency and throughput requirements
Optimize serving cost and performance: batching strategies, model compilation, GPU right-sizing, and autoscaling
Ensure serving systems meet availability and latency targets under peak traffic
Platform & Developer Experience
Build internal tools and platforms that accelerate the model development lifecycle
Define infrastructure patterns and best practices adopted across the team
Improve the inner loop for ML Engineers: faster iteration from code change to training result to production evaluation
Technical Leadership
Drive cross-team technical strategy for ML infrastructure - identify the next set of problems before they become blockers
Mentor and up-level engineers on the team through pairing, design reviews, and setting technical standards
Contribute to hiring: screen candidates, conduct technical interviews, and calibrate the engineering bar
Write technical proposals and RFCs that shape infrastructure direction across the organization
What We're Looking For
Required
7+ years in software engineering, with 5+ years focused on ML infrastructure or distributed systems
Deep hands-on experience with GPU training at scale: distributed training, checkpointing, fault recovery, and performance tuning. You've debugged real problems like NCCL hangs, gradient synchronization issues, or data loading bottlenecks.
Strong Kubernetes skills: pod specs, GPU scheduling, resource quotas, debugging scheduling failures, and operating stateful GPU workloads
Production model serving experience: you've built or operated serving systems behind real user traffic with latency constraints
Solid Python and systems fundamentals; comfortable reading and modifying PyTorch training code
Experience designing infrastructure abstractions used by other engineers
Demonstrated technical leadership: you've driven architecture decisions, written technical proposals, and influenced engineering direction beyond your immediate team
Track record of mentoring engineers and raising the technical bar on a team
Preferred
Experience with cloud-native ML orchestration (SkyPilot, Ray, or similar)
Hands-on with LLM serving stacks (vLLM, TensorRT-LLM, Triton, or equivalent)
Experience with model compression in production (quantization, pruning, distillation)
Experience operating recommendation or retrieval systems at scale
Track record of building internal platforms adopted by other teams
How We Work
You'll pair directly with ML ML Engineers. Understanding their models well enough to build the right infrastructure abstractions is part of the job.
We prefer automation over runbooks. If a process can be scripted, it should be.
On-call is shared. When you're on rotation, your scope is GPU cluster health, training failures, and serving availability - you own it end to end.
You'll profile GPU kernels, chase p99 latency regressions, and care about FLOPS utilization. This is a deeply technical infrastructure role.
Research and production are the same codebase. You'll see your infrastructure decisions reflected in real model quality and real merchant outcomes.
Shopify operates on high trust and low process. You'll have real ownership and the autonomy to make decisions, not just execute tickets.
What Success Looks Like
In 3 months: You've onboarded to training and serving infrastructure, shipped at least one meaningful improvement to reliability or performance, and can independently debug issues across the GPU stack.
In 6 months: You own a major infrastructure subsystem (training cluster or serving platform). Researchers are training faster or serving more reliably because of changes you've made.
In 12 months: You've shaped the technical roadmap for ML infrastructure and influenced engineering direction beyond the team. Other engineers across the organization come to you for architectural guidance. The platform scales to the next generation of models because of the systems and standards you've put in place. You've made the team stronger through hiring and mentorship.