✦ Luna Orbit — AI & Machine Learning

Senior Machine Learning Engineer - Model Evaluations, Public Sector

at Scale AI

📍 San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC Onsite Posted April 02, 2026
Type Full-Time
Experience senior
Exp. Years Not specified
Education Not specified
Category AI & Machine Learning

Scale AI’s Public Sector ML team deploys advanced AI systems into mission-critical government environments and builds automated evaluation frameworks for safety and governance.

  • Develop and maintain automated evaluation pipelines
  • Design test datasets and benchmarks
  • Build evaluation frameworks for LLM agents
  • Conduct stress tests and red-teaming
  • Collaborate to produce evaluation datasets

Focus on automated evaluation pipelines for ML models, LLM agent evaluation, stress testing, red-teaming, and regulatory compliance; Python-based ML tooling; cloud deployments.

The ideal candidate is a senior ML engineer with production ML experience in government contexts, strong Python skills, and familiarity with ML evaluation, CV robustness, and AI safety frameworks.

Experience in computer visiondeep learningreinforcement learningor NLP in production settingsStrong programming skills in Python; experience with TensorFlow or PyTorchBackground in algorithmsdata structuresand object-oriented programmingExperience with LLM pipelinessimulation environmentsor automated evaluation systemsAbility to convert research insights into measurable evaluation criteria
Graduate degree in CSMLor AICloud experience (AWSGCP) and model deployment experienceExperience with LLM evaluationCV robustnessor RL validationKnowledge of interpretabilityadversarial robustnessor AI safety frameworksFamiliarity with ML evaluation frameworks and agentic model designExperience in regulatedclassifiedor mission-critical ML domains
PythonTensorFlowPyTorchAWSGoogle Cloud PlatformSnowflakeBigQuery
PythonTensorFlow or PyTorchNLP/Computer Vision/RLLLM pipelinessimulation environmentsautomated evaluation systemsdata pipelines
PythonTensorFlowPyTorchNLPComputer VisionReinforcement LearningAutomated evaluation systemsLLM evaluationstress testingred-teamingdata pipelinesbenchmark design
communicationstakeholder managementteam collaboration
Industry Government/Public Sector
Job Function Design and scale automated evaluation pipelines for ML models in public sector deployments

Active security clearance or ability to obtain, Onsite in listed cities, Production experience in gov/regulated domains

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile