Principal Machine Learning Engineer, Mobile AI Inference Optimization
Description
The opportunity We are building the next generation of mobile game AI experiences, deploying world models to mobile on-device. As our Principal Machine Learning Engineer, you will be the foremost technical authority on bringing state-of-the-art multi-modal models (transformers, diffusion networks, and JAPE-style architectures) from research to production on mobile hardware. This is a deeply hands-on, high-impact role. You will define the inference strategy, drive architectural decisions across the full mobile ML stack, and mentor a team of senior and mid-level engineers. Your work will directly determine the latency, quality, and power profile of AI-driven features experienced by billions of mobile game players. What you'll be doing - Technical Leadership: - Set the technical vision and roadmap for deploying multi-modal AI models to iOS and Android, spanning transformers, diffusion models, and JAPE-style generative architectures. - Make authoritative decisions on model compression, quantization, pruning, and knowledge distillation strategies to meet mobile latency and memory budgets. - Evaluate and select inference runtimes (e.g., CoreML, ONNX Runtime Mobile, TFLite, ExecuTorch) and drive adoption across the team. - Own the end-to-end optimization pipeline: from model export and graph transformation to hardware-specific kernel tuning on NPU, GPU, and CPU. - Architecture & Research Translation: - Collaborate directly with research scientists to translate novel model architectures into deployable, mobile-optimized implementations. - Design scalable systems for multi-modal inference that process diverse inputs — images, text, primitives, and metadata — and produce pixel-level outputs with real-time performance. - Pioneer new approaches to dynamic resolution, token reduction, and speculative decoding tailored to mobile constraints. - Track and rapidly adopt breakthroughs in efficient diffusion (e.g., consistency models, flow matching) and efficient attention (e.g., FlashAttention, linear attention variants). - Team & Cross-Functional Leadership: - Lead and mentor a team of ML engineers; define engineering best practices, code review standards, and on-device benchmarking methodology. - Partner with platform engineers, product managers, and runtime teams to align ML capabilities with device SKU constraints and product roadmaps. - Champion a culture of measurement: define KPIs for latency, accuracy, memory, and power consumption and ensure the team tracks them rigorously. What we're looking for - 8+ years in ML engineering, with at least 3 years focused on on-device / edge inference optimization. - Proven production deployment of transformer-based models (e.g., ViT, LLaMA, Stable Diffusion) and/or JAPE-style generative architectures on mobile or embedded hardware. - Hands-on expertise with CoreML, TFLite, ONNX Runtime, and/or ExecuTorch; deep understanding of operator fusion, memory layout, and runtime scheduling. - Expert-level command of INT8/INT4/FP16 quantization, weight sharing, structured/unstructured pruning, and knowledge distillation. - Strong understanding of mobile SoC architectures (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) and how to target each for peak throughput. - Proficiency in C++ / Objective-C / Swift for runtime integration; solid Python for training-side tooling and export pipelines. - Ability to read, imple