Pinterest is looking for an experienced site reliability engineer to build and run our large-scale distributed systems. As an SRE on the Data & Storage team, you will design, build and monitor our applications and infrastructure that handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale.
- Build, and operate across a large-scale data and storage technology stack
- Own large scale distributed systems handling petabytes of data and improve service quality
- Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world
- Perform deep dives into reliability issues, partnering with software and systems engineers across the organization to produce and roll out fixes
- Strong knowledge of Linux/Unix/BSD internals and experience working with open source software (e.g. MySQL, Hadoop, Envoy, HAProxy, Nginx)
- Experience with technologies such as ElasticSearch, ZooKeeper, HBase, Hadoop, Memcache and Kafka with a focus on reliability, automation, operability and performance
- Experience coding in one or more programming languages(Python, Go, Java, Ruby, etc.)
- Ability to debug, optimize code and automate routine tasks. Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
- Bonus points for Infrastructure as code (e.g. Terraform, Puppet, Chef, Ansible, Salt, Fabric, Docker, etc), and, experience with cloud infrastructure (AWS) and distributed, service-oriented architecture