✦ Luna Orbit — Cloud & Infrastructure

Senior Cloud Operations Engineer

at Nvidia

📍 2 Locations Unknown Posted April 16, 2026
Type Full-Time
Experience senior
Exp. Years 8+ years
Education BS/MS in Computer Science (or equivalent experience)
Category Cloud & Infrastructure

NVIDIA is seeking a Senior Cloud Operations Engineer for the NGC Cloud team to improve efficiency, reliability, and scalability of systems powering global operations. You will automate CI/CD workflows with GitLab, monitor system health with leading observability tools, and support secure and compliant operational processes.

  • Automate build, test, and deployment using GitLab CI/CD pipelines
  • Monitor system health with dashboards, alerts, and operational reporting
  • Perform secure user offboarding, access reviews, and compliance tasks
  • Maintain API performance and integration stability to meet SLAs and SLOs
  • Coordinate releases and vulnerability remediation while maintaining documentation and SOPs

The role centers on operating cloud services and automation pipelines with GitLab CI/CD, including monitoring via Prometheus, Grafana, Datadog, CloudWatch, and Splunk. You will also work across security, reliability, and ITSM processes while maintaining and debugging integrations backed by Java and data stores including Cassandra, DynamoDb, and Redis.

The ideal candidate is a senior cloud operations engineer with 8+ years of hands-on experience supporting complex services and strong automation skills in Python. They have practical monitoring experience using Prometheus, Grafana, Datadog, CloudWatch, and Splunk, and understand ITSM processes for incident, problem, and modification management. They can operate securely and compliantly, maintain GitLab CI/CD pipelines, and work confidently with Java and RDBMS/NoSQL systems like Cassandra, DynamoDb, and Redis.

8+ years of hands-on experience building/supporting complex servicesBS/MS in Computer Science (or equivalent experience)Knowledge in Python for automationdata handlingand tool developmentExperience with monitoring tools (such as PrometheusGrafanaDatadogCloudWatchSplunk) and reportingFamiliarity with ITSM practicesincluding incidentproblemand modification processesAbility to perform secure and compliant offboarding and access-related tasksKnowledge in core Java - Collections APIStreams APIConcurrencyI/OKnowledge in RDBMS and NoSQL (CassandraDynamoDbRedis) databases
Experience designing or implementing automation pipelines or internal operational toolsBackground in customer supporttechnical supportor customer-facing engineering rolesPrior work in a security-conscious or compliance-heavy environmentAbility to build end-to-end monitoring solutionsdashboardsand automated reportingStrong documentation habits and a continuous-improvement approach
GitLabGitLab CI/CD pipelinesPrometheusGrafanaDatadogCloudWatchSplunkITSMSOPsCassandraDynamoDbRedis
cloud operationsGitLab CI/CDPythonPrometheusGrafanaDatadogCloudWatchSplunkITSMincident managementuser offboardingaccess reviewsAPI performanceSLAsSLOssecurity guidelinesvulnerability remediationJavaCollections APIStreams APIConcurrencyRDBMSCassandraDynamoDbRedis
cloud operationsNGC Cloud teamautomationGitLab CI/CD pipelinesCI/CDbuildtestdeploymentmonitoring system healthdashboardsalertsoperational reportsuser offboardingaccess reviewscompliance-related tasksAPI performanceintegration stabilitySLAsSLOsrelease coordinationsecurity guidelinesvulnerability remediationaudits and assessmentsdocumentationSOPsincidentproblemmodification processesITSMsecure and compliant offboardingaccess-related tasksPythonautomationdata handlingtool developmentmonitoring toolsPrometheusGrafanaDatadogCloudWatchSplunkcore Java - Collections APIStreams APIConcurrencyI/ORDBMSNoSQLCassandraDynamoDbRedisJavaexcellent documentationcross-team alignmentsecure/compliant access management
communication skillscollaborate across multiple teamsdocumentation habitsproblem-solvingcross-team alignmentprocess improvement mindset
Industry SaaS
Job Function Operate and automate NVIDIA cloud services with strong monitoring, security, and ITSM-aligned operational rigor
Role Subtype DevOps Engineer
Tech Domains Python, Java, DevOps & SRE, Cybersecurity, Kubernetes
Senior Cloud Operations EngineerOperations EngineerNGC Cloud teamGitLab CI/CD pipelinesCI/CDbuildtestdeploymentmonitoring system healthdashboardsalertsoperational reportsuser offboardingaccess reviewscomplianceAPI performanceintegration stabilitySLAsSLOssecurity guidelinesvulnerability remediationaudits and assessmentsdocumentationSOPsITSMincidentproblemmodification processesPythonPrometheusGrafanaDatadogCloudWatchSplunkJavaCollections APIStreams APIConcurrencyRDBMSNoSQLCassandraDynamoDbRedis

8+ years of hands-on experience building/supporting complex services, Python automation knowledge, Experience with monitoring tools: Prometheus, Grafana, Datadog, CloudWatch, Splunk, Familiarity with ITSM practices including incident, problem, and modification processes, Knowledge of core Java (Collections API, Streams API, Concurrency, I/O), Knowledge in RDBMS and NoSQL (Cassandra, DynamoDb, Redis)

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile