Platforms Infrastructure Software Engineer
TS/SCI Required, poly is a plus
Hybrid with 2-3 days of flexibility per week
Some examples of initiatives you might be a part of include:
- Re-engineering the deployment strategy to support air-gapped systems.
- Design solutions for stress testing and benchmarking candidate versions ahead of customer release.
- Create internal tools to track code quality metrics, static analysis and potential vulnerabilities.
Key Responsibilities:
- End-to-End Platform Design: Lead the design and development of highly reliable and scalable hosting platforms across both public and private cloud environments.
- Kubernetes Environments: Deploy, manage, and scale Kubernetes clusters to ensure seamless orchestration of containerized applications and services.
- Infrastructure & Performance: Ensure our infrastructure delivers services with high availability and performance that our customers depend on, addressing system bottlenecks and implementing optimization solutions.
- Monitoring & Alerting: Implement comprehensive monitoring and alerting solutions to manage system health and ensure smooth operation at scale.
- Site Reliability Engineering (SRE) Practices: Participate in and promote a culture of SRE best practices, including defining and refining service-level objectives (SLOs).
- Stability & Scalability: Work closely with cross-functional teams to drive optimization efforts and ensure system reliability across the full stack as we scale.
- Incident Response: Lead troubleshooting efforts, conduct root cause analyses, and develop preventive measures to enhance system reliability.
- Development Influence: Use performance data and SLOs to influence development roadmaps and ensure alignment with long-term goals.
- Collaboration: Collaborate across engineering teams to ensure infrastructure supports both feature development and scaling needs effectively.
Qualifications:
- Experience: 5+ years of experience in software engineering, with a strong focus on system stability, performance optimization, and infrastructure management.
- Technical Expertise: Proficiency in C++ or Go preferred, as well as familiarity with cloud platforms such as GCP or AWS. Familiarity with Kubernetes is a plus.
- Experience: Hands on experience with performance, large scale systems data analysis, visualization tools, or debugging.
- Monitoring Tools: Experience with monitoring and alerting tools such as Prometheus, Grafana or related tools.
- Adaptability: Proven ability to thrive in a fast-paced, high-growth environment. Comfortable with evolving requirements.
- Problem-Solving: Strong analytical and problem-solving skills, with a track record of effectively diagnosing and resolving complex issues.
- Communication: Excellent verbal and written communication skills, with the ability to convey technical concepts to diverse audiences.