Meta AR/VR Job | Software Engineer - Compute Infrastructure
Job(岗位): Software Engineer - Compute Infrastructure
Type(岗位类型): Engineering
Citys(岗位城市): Pittsburgh, PA
Date(发布日期): 2022-2-22
Summary(岗位介绍)
Meta Reality Labs Research is looking for a Software Engineer Technical Lead to drive the requirements, software tooling, and adoption of an industry-leading, machine learning super cluster that will be used to process data for avatars in the Metaverse. The ideal candidate will be an expert in developing workflows on large compute clusters, as well as building tools and libraries to ensure researchers are highly productive at developing their own workflows. The role will require a high level of cross-functional collaboration with researchers, data center operations teams, data privacy and security teams, and other software engineering teams.
Qualifications(岗位要求)
Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta.
5+ years of experience developing workflows for large scale AI training
Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering or related field
Communication skills, including experience driving decision making
Experience working with cross functional teams including hardware, software, network, legal, privacy and security
Python experience
Linux/shell scripting development experience
Experience developing and support reliable multi-stage data pipelines
Experience with containers (Docker or similar)
Quantitative reasoning skills, experience analyzing trade offs of different hardware and software solutions
Description(岗位职责)
Define requirements of cluster hardware (storage system speed and size, number and type of CPUs and GPUs required, etc.)
Automation of data ingress into cluster
Responsible of compute allocation policies and containerization technologies for the cluster
Development of data access layer using proprietary framework
Define and communicate cluster software requirements, based on research needs
Enabling adoption of the cluster by additional research cases
Definition, design and implementation of automated testing
Point of contact for hardware & software questions regarding cluster capabilities
Work with privacy and legal teams to develop data handling policies
Reporting on progress, presenting technical risks, challenges and status to executive management
Additional Requirements(额外要求)
8+ years of experience developing workflows for large scale AI training
Understanding of deep neural network training
Experience with securing sensitive data (encryption, access control, audit logging)
Experience with HPC (High Performance Computing)
Experience with scheduling systems such as Slurm or Kubernetes
Experience with large scale object storage services (S3 or similar)
Experience in research or converting research to products
Experience with git
Experience with Conda
Experience with SQL databases
Experience with C++