We are looking for talent
Middle Site Reliability Engineer
We’re looking for a talented Middle Site Reliability Engineer to be embedded within the product development team and manage those applications’ overall reliability and availability.

Open position:
Site Reliability Engineer
Role Description
About the project:
The team is working for a leading provider of vehicle lifecycle solutions, with headquarters in Chicago, enables the companies that build, insure, and replace vehicles to power the next generation of transportation. Its platform delivers advanced mobile, artificial intelligence, and car technologies. It connects a network of 350+ insurance companies, 24,000+ repair facilities, hundreds of parts suppliers, and dozens of third-party data and service providers. The customer’s collective solutions enhance productivity and help clients deliver better experiences for end consumers.
Requirements:
-
Understanding of troubleshooting Java web applications (issue resolution, escalations)
-
Proficiency in the full software delivery lifecycle
-
Experience with AWS cloud watch implementation is preferred (experience with similar solutions from other cloud providers could be considered, too)
-
Knowledge of Kubernetes
-
Background in using application monitoring tools (for example, Grafana, Prometheus, APPD/Dynatrace/Datadog, or similar)
-
Skills in managing deployment pipelines using tools such as Jenkins and/or GitHub
-
Ability to demonstrate strong skills in observability implementation on large-scale enterprise web applications and microservice frameworks
-
Capability to analyze and troubleshoot complicated, cross-platform issues by handling OS, Networking, Database (SQL), and applications in cloud-based environments
-
Proven facility to dig through metrics, logs, and available sources to triage and resolve an incident
-
Capacity to document solutions, SRE architectural patterns, and best practices to ensure that teams have guidance as needed
You will:
-
Monitor application/infrastructure and take steps to improve overall system software performance, availability, and reliability by incorporating changes through defined feedback loops within the software delivery lifecycle
-
Configuring and maintaining the monitoring tooling as it relates to the target application
-
Document tribal knowledge as you acquire it over time by creating runbooks/playbooks and ensuring critical system information is readily available to those who need it through dashboards
-
Resolve NOC escalations and help prevent the reiteration of incidents by creating processes and automation
-
Apply automation to any tasks/parts of the system that are performed manually
-
Collaborate Work closely with software developers and testers to ensure the product is responding correctly to non-functional requirements such as security, performance, and availability
-
Be a key part of our response to high-severity internal customer incidents, ensuring we meet all SLAs and SLOs
-
Embrace failures and treat incidents as learning opportunities through conducting blameless postmortem reports
-
Participate in product engineering stand-ups and related design activities
-
Coach other team members to ensure systems are supported by following SRE best practices
​
Position type: Full time
Location: Budapest / Remote
Level: Mid
​