Do IT Now provides High Performance Computing and Artificial Intelligence services, offering consulting, installation, optimization, and support to companies of all sizes, from SMEs to large multinationals. Our clients include Formula One teams, aerospace firms, and institutions in pharma and life sciences. We collaborate with the most innovative leaders in the high-tech industry.

Join us and be part of a technical-oriented, independent team at the forefront of HPC innovation and research. Enjoy the flexibility of 100% remote work, thrive in a multinational and multicultural environment, and benefit from our strong growth. Participate in team-building activities and work with cutting-edge technologies to make a real impact in high-performance computing.

Job Definition

Join our dynamic and innovative team as a Site Reliability Engineer ! Be part of our cutting-edge projects where you'll collaborate seamlessly with cross-functional teams to ensure the reliability, performance, and scalability of our infrastructure and services, with a special focus on our High-Performance Computing (HPC) environments and AI-driven applications. You'll play a crucial role in designing, implementing, and maintaining robust systems that support our company's growth and technological advancement in the realms of HPC, Cloud and AI.

As a passionate member of our team, you'll embody a continuous improvement mindset. Embrace the ever-evolving fields of DevOps, HPC, and AI infrastructure, seizing opportunities to optimize our systems and enhance the performance of our compute-intensive workloads. Be an integral part of our journey towards operational excellence, ensuring AI models and HPC clusters run efficiently and reliably. Let your enthusiasm for cutting-edge technology, reliability, and automation propel you to new heights in a collaborative and forward-thinking environment!

Skills and Experience

Essential Skills:

High degree (Master or PhD level) in Computer Science, Information Technology, or related field
Minimum of 3 years’ experience in SRE or DevOps roles
Proficiency in at least one scripting language (Python, Bash)
Good knowledge of Linux systems and at least one cloud platforms (AWS, GCP, or Azure)
Experience with containerization technologies (Docker, Kubernetes)
Expertise in monitoring and observability tools
Solid understanding of networking concepts and protocols
Excellent problem-solving and troubleshooting skills

Preferential requirements:

Experience with Infrastructure as Code (Terraform, Ansible)
Knowledge of CI/CD pipelines and tools (Jenkins, GitLab CI, GitHub Actions)
Familiarity with database systems and their optimization
Experience with log management and analysis tools
Understanding of security best practices in cloud environments

Language skills:

Fluency in French and English for effective communication

Personal Attributes

Team player with a proactive attitude and strong communication skills
Ability to work independently (especially if remote) and manage multiple priorities
Adaptability and eagerness to learn new technologies (mandatory)

Why work with us?

Technology-driven company culture
100% remote work opportunity
Rapid company growth and career advancement possibilities
Continuous learning and development programs
At the forefront of the SRE practices
Startup and Multicultural environment
Regular team building activities

Join us in our mission to build and maintain highly reliable, scalable, and efficient systems that power our business. If you're passionate about automation, problem-solving, and creating robust infrastructure, we want to hear from you!

Conditions: Permanent – Montpellier/Remote

Remuneration: 50-60k€ We offer a competitive salary commensurate with the qualifications and experience of the candidate and according to the cost of living from where the candidate is based.

Occasional on-call duties and business travel

Site Reliability Engineer - Full Remote

Proposée par Paola TUMBARELLO, le 17 juillet 2024

Job Definition

Skills and Experience

Personal Attributes

Why work with us?