About this role
This role involves leading the development of distributed data collection and observability systems for AI and HPC clusters, focusing on scalable infrastructure, data visualization, and system performance optimization.
Key Responsibilities
- Architect data collection and observability systems
- Lead development of data pipelines
- Collaborate with research and engineering teams
- Optimize infrastructure performance
- Implement visualization and retrieval services
Technical Overview
The technical environment includes Apache Spark, Elastic Search, Grafana, Prometheus, and databases, with programming in Python, JavaScript, and Java. The focus is on large-scale distributed data systems and infrastructure.
Ideal Candidate
The ideal candidate is a senior data platform architect with over 15 years of experience in designing distributed observability and data collection systems. They have expertise in large-scale infrastructure, open-source monitoring tools, and collaborating with research and engineering teams.
Must-Have Skills
Experience designing large-scale distributed observability systemsAbility to collaborate with data scientists and engineersExperience with observability platforms such as Apache SparkElastic SearchGrafanaPrometheusProgramming experience in PythonJavaScriptJavaUnderstanding of databases (relational and non-relational)Experience in infrastructure software and large-scale distributed computing
Nice-to-Have Skills
Experience with AI research teamsExperience managing datacentersKnowledge of open-source observability toolsExperience in infrastructure software development
Tools & Platforms
Apache SparkElastic SearchOpen SearchGrafanaPrometheusRelational DatabasesNon-relational Databases
Required Skills
Distributed Data PlatformObservability SystemsData CollectionData AggregationData EnrichmentData StorageData RetrievalVisualizationApache SparkElastic SearchOpen SearchGrafanaPrometheusRelational DatabasesNon-relational DatabasesPythonJavaScriptJavaData PipelinesInfrastructure TechnologiesDevOps
Hard Skills
Distributed Data PlatformObservability SystemsData CollectionData AggregationData EnrichmentData StorageData RetrievalVisualizationApache SparkElastic SearchOpen SearchGrafanaPrometheusRelational DatabasesNon-relational DatabasesPythonJavaScriptJavaData PipelinesInfrastructure TechnologiesDevOps
Soft Skills
CollaborationStrategic PlanningInterpersonal SkillsProblem SolvingAdaptabilityLeadership
Keywords for Your Resume
Principal Data Platform ArchitectDistributed Data PlatformObservability SystemsData CollectionData AggregationData EnrichmentData StorageData RetrievalVisualizationApache SparkElastic SearchOpen SearchGrafanaPrometheusRelational DatabasesNon-relational DatabasesPythonJavaScriptJavaData PipelinesInfrastructure TechnologiesDevOps
Deal Breakers
Less than 15 years of experience, Lack of experience with distributed observability systems, No experience with Apache Spark or open-source monitoring tools, Inability to collaborate across teams
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile