MLOps Engineer (PyTorch)

Singapore, Singapore
Full-Time
On-Site

Job Description:

Job Title: MLOps Engineer (PyTorch)

Location: Singapore

Job Type: Full-time

About the Opportunity

Our client is seeking an MLOps Engineer with a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advanced PyTorch-based training workloads.

This position is a perfect fit for an engineer who is not just focused on model outcomes, but on the quality and robustness of the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.

Key Responsibilities

Architect, build, and maintain end-to-end training and inference pipelines using PyTorch.
Develop and maintain high-quality, robust tooling in both Python and C++ to support the entire model training lifecycle.
Take full ownership of the core training codebase, enforcing best practices for clarity, modularity, reproducibility, and performance.
Design and implement workflows for checkpointing, resuming jobs, model versioning, and experiment tracking.
Proactively optimize compute workloads for bare-metal environments, focusing on I/O bottlenecks, CPU/GPU utilization, and memory efficiency.
Troubleshoot and debug complex, low-level issues, including networking bottlenecks, distributed training errors (e.g., NCCL), and hardware faults.
Configure and manage all ML environments, including containers, package management, GPU drivers, and runtime configurations.
Monitor and debug large-scale training jobs running across multiple nodes and GPUs.

Required Qualifications (You Should Have)

Deep, expert-level knowledge of PyTorch, including DDP (DistributedDataParallel), mixed precision training, and TorchScript.
Advanced programming skills in both C++ and Python.
A solid background in computer science fundamentals (data structures, algorithms, concurrency, operating systems).
Hands-on experience debugging and tuning bare-metal servers, including Linux administration, kernel parameter tuning, and BIOS tuning.
A strong understanding of low-level networking (e.g., RoCE, InfiniBand), interconnects, and distributed training protocols like NCCL and MPI.
A proven track record of building reliable, reproducible pipelines for both model training and evaluation.
Experience with job schedulers (e.g., SLURM, or custom runners) and cluster monitoring tools.

Preferred Qualifications (Nice-to-Have)

Experience with non-standard deployments, such as on-premise local clusters or edge devices (i.e., not public cloud).
Active contributions to PyTorch or other open-source ML/HPC tools.
Familiarity with Infrastructure-as-Code (IaC) tools like Ansible, Terraform, or Nix.
Experience building out a full logging, observability, and alerting stack for training workloads.

How to Apply

Interested candidates are invited to submit their resume, detailing their experience in managing PyTorch workloads on bare-metal infrastructure.