About this role
Global Payments seeks an AI Support Engineer for AI Operations to monitor and resolve production incidents across deployed AI solutions. You will triage issues, perform root cause analysis, improve observability and runbooks, and collaborate with AI engineering, platform, and governance teams.
Key Responsibilities
- Serve as first line of defense for production AI incidents
- Monitor system health and performance for AI applications and agentic/RAG solutions
- Investigate latency, model drift, hallucinations, and broken integrations
- Collaborate to implement observability, logging, and alerting for AI services
- Build runbooks, diagnostic tools, and automated workflows to improve incident response
Technical Overview
The role centers on AI incident management for LLM and GenAI systems, including monitoring for latency, model drift, hallucinations, and broken integrations. You will use AI observability tools such as Fiddler AI, Arize AI, and IBM WatsonX.governance across AWS and GCP environments, and help build automated diagnostic workflows and postmortems.
Ideal Candidate
The ideal candidate is a mid-level engineer with 4+ years of experience in production support, SRE, DevOps, or software engineering—preferably supporting GenAI and/or ML systems. They have strong cloud and observability knowledge across AWS/Amazon Web Services and GCP/Google Cloud Platform, and they can troubleshoot LLM/GenAI issues like latency, model drift, hallucinations, and integration failures while coordinating with AI engineering teams.
Must-Have Skills
4+ years of experience in production supportsoftware engineeringsite reliability engineering (SRE)or DevOpsproduction support for GenAI and/or ML systemscloud infrastructure (AWSGCP)AI observability toolsLLM and GenAI systems (OpenAIAzure OpenAIBedrockVertex AI)
Nice-to-Have Skills
Familiarity with modern orchestration and agentic frameworks
Tools & Platforms
AWSAmazon Web ServicesGCPGoogle Cloud PlatformFiddler AIArize AIIBM WatsonX.governanceOpenAIAzure OpenAIBedrockAmazon BedrockVertex AIGoogle Cloud Vertex AI
Required Skills
production supportsite reliability engineering (SRE)DevOpsincident triageroot cause analysismonitoringobservabilityloggingalertingrunbooksautomated workflowsknowledge basespostmortemsAWSGCPOpenAIAzure OpenAIBedrockVertex AIFiddler AIArize AIIBM WatsonX.governanceLLMGenAImodel drifthallucinationprompt misbehaviorbroken integrations
Hard Skills
production supportincident triageroot cause analysisproduction AI incidents resolutionmonitoring system healthperformance monitoringlatency troubleshootingfailure investigationmodel drift detectionhallucination troubleshootingprompt misbehaviorbroken integrationsobservabilityloggingalertingLLM and GenAI systemsOpenAIAzure OpenAIBedrockAmazon BedrockVertex AIGoogle Cloud Vertex AIorchestration platformsagentic frameworksrunbooksautomated workflowsknowledge basespostmortemsgovernance and compliance incident documentationAWSAmazon Web ServicesGCPGoogle Cloud Platform
Soft Skills
first line of defensecollaboration with AI engineering and platform teamscommunication and escalationcontinuous improvement mindsetdocumentation disciplineattention to reliability and stability
Keywords for Your Resume
AI Support EngineerAI Operationsproduction incidentsincident triageroot cause analysismonitor system healthperformancelatencymodel drifthallucinationprompt misbehaviorbroken integrationsobservabilityloggingalertingrunbooksautomated workflowsknowledge basespostmortemsAWSAmazon Web ServicesGCPGoogle Cloud PlatformFiddler AIArize AIIBM WatsonX.governanceOpenAIAzure OpenAIBedrockAmazon BedrockVertex AIGoogle Cloud Vertex AIGenAILLMGenAI and/or ML systemssite reliability engineering (SRE)
Deal Breakers
Must be legally authorized to work for any employer in the United States without future immigration sponsorship
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile