About this role
The challenge
We’re at a pivotal stage in the evolution of our cloud platform. To continue scaling efficiently and strengthening reliability, we are expanding our Operations & SRE capabilities. Our infrastructure supports mission-critical services for our customers, and ensuring performance, stability, and continuous improvement is at the core of our vision.
As a Site Reliability Engineer / Systems Administrator, your mission will be to monitor and optimize our cloud systems, automate processes, ensure effective incident management, and help us maintain a robust, scalable and secure infrastructure. You will play a key role in minimizing downtime, improving operational efficiency, and supporting sustainable growth.
You’ll be part of a highly collaborative engineering environment, working closely with DevOps, Product and Development teams to build reliable services from the ground up, enforce good operational practices and contribute to ongoing enhancements that impact thousands of users.
Collaboration will be essential. You will support critical infrastructure decisions, lead incident response, proactively detect risks and ensure that both technology and teams can continue to scale confidently.
Requirements that are important for us
- Experience in administration of large-scale cloud or MSP infrastructures.
- Expert in Linux systems (a must).
- Design and maintenance of resilient backup and disaster recovery strategies.
- Close collaboration with Dev and Ops teams to ensure reliability from design to production.
- Experience working with critical environments requiring fast and effective incident resolution.
- Infrastructure as Code (Terraform, Ansible) to improve automation and repeatability
- Solid networking expertise: TCP/IP, DNS, load balancing, firewalling, BGP, network virtualization.
- Experience with network storage solutions (Ceph, NFS or similar).
- Familiarity with DevOps technologies, CI/CD and agile methodologies.
- Knowledge of IaaS orchestration such as CloudStack or OpenStack.
- Database skills: MySQL, MariaDB or PostgreSQL.
- Experience with monitoring and tuning tools (Zabbix, Nagios, Prometheus, Grafana, Datadog…).
- Understanding of ITIL processes for managing incidents, problems, and changes.
Key skills and expected impact
- Strong documentation practices and contribution to operational knowledge.Monitoring and optimization of performance, identifying bottlenecks and preventing service interruptions.
- Ability to lead root cause analysis and prevent recurring incidents.
- Implementation of centralized log management and analysis.
Excellent communication in Spanish and intermediate English.
Nice to have
- Hands-on experience with CI/CD pipelines.
- Container orchestration (Docker, Kubernetes…).
- Performance optimization in distributed applications.
- Experience with web servers and virtualized platforms.
- Advanced security knowledge and system hardening.
- Analytical mindset focused on operational excellence.
- Experience with ticketing systems (workflow creation, prioritization, follow-up).
Tools
- Automation & IaC: Terraform, Ansible
- Monitoring & performance: Prometheus, Grafana, Nagios, Zabbix, Datadog
- Logging: Sistemas de gestión centralizada de logs
- Databases: MySQL, MariaDB, PostgreSQL
- Orchestration: CloudStack, OpenStack, containers
- Collaboration & knowledge base: Jira, Confluence, Microsoft 365, Slack
- Ticketing & ITSM: Herramientas basadas en ITIL