Date posted 24/08/2023
View our location
Explore this location
- Area
- Reading
- Work location
- Dual Location - Home & Reading Office
- Contract type
- Full-time
- Shift pattern
- Standard Working Week
Site Reliability Engineer - 12 Month FTC
Our people make us who we are. We’re a diverse and inclusive bunch, and it’s important you can feel you belong here. We value everybody for who they are and what they bring to the table, supporting one another as we continue to deliver for our customers.
- Accountable that all services are reliable, scalable, and secure, by demonstrate exceptional expertise in the areas of services engineering. We are looking an Engineer with experience in developing processes, tools, and automation in distributed systems suitable for production environments. Balancing time across automating operations for our growing footprint of deployments, building self-service products to empower internal customers, and increasing the reliability and scalability of our services with application and system level improvements.
- Accountability for the reliability of our production assets - identify, manage and resolve dependencies and risks
- Decide on priorities to Optimize the evolution steps needed - taking into consideration demand from developers as well as from operations engineers
- Ownership of Service success through the lifecycle by ensuring what needs to be monitored and act upon performance indicators which reflects experience and stability.
- Practice sustainable incident response and blameless post mortems on services focused on improving the availability, scalability, latency, security and efficiency of our internal or customer-facing services
- Engage in and improve the whole lifecycle of a platform from inception and design, through deployment, operation and refinement.
- Aid partners in maintaining systems once they are live by measuring and monitoring availability, latency and overall system health through preventative maintenance.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Demonstrated history in automating operations processes via services and tools.
- Knowledge of continuous integration, testing methodologies, agile development methodologies with operations.
- Demonstrated history in troubleshooting and resolving IT incidents or willing to learn about it.
- Knowledge of common IT infrastructure elements (servers, IP networks and TCP/IP stack, loadbalancers), network services like DNS and common system architecture patterns for resiliency.
Desirable:
- Advanced knowledge of monitoring tools like Splunk, Dynatrace
- Knowledge of Service Level Objective and Service Level Indicator definition used to measure IT application and IT servers performance
- Knowledge of the ITIL framework processes applied to IT operations (Incident Management, Change Management) and BMC ITSM
- Understanding of basic Internet/telco services