We’re looking for a Technical Lead on the Site Reliability team to design, implement, and deliver software and infrastructure solutions to improve the scalability, availability, and efficiency of Pinterest’s services. The SRE team operates the most fundamental layers of Pinterest’s global infrastructure, which handles billions of requests per month.
- Influence and create new designs, architectures, standards and methods for large-scale distributed systems with a focus on operability
- Collaborate with developers in the deployment and scaling of new product features
- Perform deep dives into reliability issues, partnering with software and systems engineers across the organization to produce and roll out fixes
- Lead and mentor multiple team members in improving efficiency, performance and availability of Pinterest's services (previous management experience a plus)
- Proficiency in scripting, Python preferred. Systems languages (Go, C) are a plus
- Strong knowledge of Linux/Unix/BSD internals and shell scripting; Production experience with JVM, Python, and Golang runtimes are a plus
- Deep knowledge of a configuration management tool (i.e. Puppet, Chef, Ansible, Salt, CFEngine). Experience with containers is a plus
- Experience operating in a modern cloud environment such as AWS, GCP, or Azure or large scale data centers
- Familiarity with distributed systems including service discovery, pub/sub, search indexing, storage, and caching. We use Zookeeper, Kafka, Elasticsearch, MySQL, Hbase, and Memcache respectively.