- Build and Manage Kubernetes Platform on AWS
-Develop, maintain, and scale an AWS and Kubernetes-based infrastructure that supports all backend applications at PayJoy.
-Design, implement, and optimize cloud and container-based solutions to ensure high availability, resilience, and cost-effectiveness.
-Conduct code reviews, manage infrastructure as code (IaC), and implement CI/CD pipelines to promote best practices for code quality, reliability, and security.
- Develop and Enable Code Quality Standards
-Design and implement platform features and enhancements to meet application and developer needs, prioritizing code scalability, cost optimization, and automated testing.
-Write and review code to ensure it meets high standards of quality, robustness, and scalability.
-Act as a technical mentor, providing guidance to team members on writing efficient, maintainable code, including reviews and paired programming as necessary.
- Lead CI/CD and DevOps Practices
-Design and maintain CI/CD pipelines, including Docker image creation, Kubernetes deployment artifacts, and environment provisioning.
-Establish and maintain logging, monitoring, and alerting systems to streamline application development and deployment processes.
-Collaborate with teams to reduce downtime and improve deployment speed and reliability.
- Cross-Team Collaboration and Application Onboarding
-Partner with product and engineering teams to understand new backend applications and identify onboarding requirements for our platform.
-Work closely with cross-functional teams to plan, architect, and implement solutions that fit within the broader platform strategy.
-Evaluate side-effects of onboarding new applications, address any compatibility issues, and create seamless pathways for new app integration.
- On-Call Rotation and Incident Management
-Participate in the on-call rotation, providing technical leadership in incident response and resolution.
-Conduct thorough post-mortem analysis of incidents, document findings, and implement process improvements to prevent future issues.
-Work alongside SREs and other engineers to troubleshoot issues, triage incidents, and perform root-cause analysis to continuously improve platform reliability.
- Optimize Developer Productivity and Resource Utilization
-Develop tools, templates, and documentation to make it easier for developers to work on the platform, enhancing productivity.
-Monitor and analyze platform performance to identify cost-saving opportunities and ensure efficient resource usage.
-Drive automation initiatives to reduce manual intervention, streamline operations, and free up developer time for core product work.