Position Details

Type Full-Time

Experience senior

Exp. Years Not specified

Education Not specified

Category AI & Machine Learning

About this role

Scale AI’s Public Sector ML team deploys advanced AI systems into mission-critical government environments and builds automated evaluation frameworks for safety and governance.

Key Responsibilities

Develop and maintain automated evaluation pipelines
Design test datasets and benchmarks
Build evaluation frameworks for LLM agents
Conduct stress tests and red-teaming
Collaborate to produce evaluation datasets

Technical Overview

Focus on automated evaluation pipelines for ML models, LLM agent evaluation, stress testing, red-teaming, and regulatory compliance; Python-based ML tooling; cloud deployments.

Ideal Candidate

The ideal candidate is a senior ML engineer with production ML experience in government contexts, strong Python skills, and familiarity with ML evaluation, CV robustness, and AI safety frameworks.

Must-Have Skills

Experience in computer visiondeep learningreinforcement learningor NLP in production settingsStrong programming skills in Python; experience with TensorFlow or PyTorchBackground in algorithmsdata structuresand object-oriented programmingExperience with LLM pipelinessimulation environmentsor automated evaluation systemsAbility to convert research insights into measurable evaluation criteria

Nice-to-Have Skills

Graduate degree in CSMLor AICloud experience (AWSGCP) and model deployment experienceExperience with LLM evaluationCV robustnessor RL validationKnowledge of interpretabilityadversarial robustnessor AI safety frameworksFamiliarity with ML evaluation frameworks and agentic model designExperience in regulatedclassifiedor mission-critical ML domains

Tools & Platforms

PythonTensorFlowPyTorchAWSGoogle Cloud PlatformSnowflakeBigQuery

Required Skills

PythonTensorFlow or PyTorchNLP/Computer Vision/RLLLM pipelinessimulation environmentsautomated evaluation systemsdata pipelines

Hard Skills

PythonTensorFlowPyTorchNLPComputer VisionReinforcement LearningAutomated evaluation systemsLLM evaluationstress testingred-teamingdata pipelinesbenchmark design

Soft Skills

communicationstakeholder managementteam collaboration

Industry & Role

Industry Government/Public Sector

Job Function Design and scale automated evaluation pipelines for ML models in public sector deployments

Deal Breakers

Active security clearance or ability to obtain, Onsite in listed cities, Production experience in gov/regulated domains

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Senior Machine Learning Engineer - Model Evaluations, Public Sector

Get matched to jobs like this