Meta AR/VR Job | Codec Avatars Systems Engineer | Quest
Job(岗位): Codec Avatars Systems Engineer | Quest
Type(岗位类型): Artificial Intelligence
Citys(岗位城市): Pittsburgh, PA
Date(发布日期): 2023-5-27
Summary(岗位介绍)
Reality Labs Research (RL-R) brings together a diverse and highly interdisciplinary team of researchers and engineers to create the future of augmented and virtual reality. On the Codec Avatars Infrastructure team, you’ll work on building tools, libraries, and frameworks that will help researchers collaborate with each other and empower their research towards the generation of Codec Avatars.
Our team cultivates an honest and considerate environment where self-motivated individuals thrive. We encourage a strong sense of ownership and embrace the ambiguity that comes with working on the frontiers of research.
In this hybrid systems engineer and software engineer role on the Codec Avatar Research Infrastructure team, you will foster our scientific explorations and generate viable paths to the consumer products that will connect people in meaningful ways for decades to come.
Qualifications(岗位要求)
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
Experience working independently, handling large projects simultaneously, and prioritizing team roadmap and deliverables by balancing required effort with resulting impact
5+ years experience in systems engineering
5+ years experience automating the management of infrastructure and services
5+ years experience coding in at least one of the following languages: Python, Ruby, PHP, Rust, or Go
Thorough understanding of Linux operating system internals
Experience with managing HPC scheduler libraries like Slurm, Kubernetes, or LSF
Experience with Python library management systems such as Conda or venv
Description(岗位职责)
Build, scale, and secure the Linux environment within Meta research lab HPC infrastructure, a heterogeneous environment containing diverse operating systems and applications
Work side by side with research scientists to enable the infrastructure for large scale training jobs that explore AR/VR
Provide on-call support and lead incident root cause analysis through multiple infrastructure layers (compute, storage, network) for our research lab’s HPC clusters and act as a final escalation point
Apply modern engineering methodologies such as Infrastructure-as-Code, container orchestration, and software-defined storage for large scale compute clusters
Collaborate in a diverse team environment across multiple scientific and engineering disciplines, making the architectural tradeoffs required to rapidly deliver software and infrastructure solutions
Find ways to leverage the scale and complexity of the larger Meta production infrastructure to solve problems for Reality Lab researchers
Provide guidance to other engineers on best practices to build mature services which are highly available, reliable, secure, and scalable
Help others around you move faster by identifying issues and driving them to resolution
Influence outcomes within your immediate team, peer engineering teams, and with cross-functional stakeholders
Additional Requirements(额外要求)
Prior experience in cluster oncall operations, including troubleshooting server/scheduler/storage errors, maintaining compute/storage environments/libraries/tools, helping onboard users to the cluster, and answering general questions from users.
Prior experience supporting configuration management in a multi-region environment
Prior experience building services
Prior experience building PaaS or internal clouds
Prior experience in cluster coordination and strategy planning, including collecting/understanding needs of users, developing tools to improve user experience, providing guidance on best practices, coordinating distribution of compute/storage resources, forecasting compute/storage needs, and developing long-term user experience/compute/storage strategies.
Prior experience with containerization technologies like Docker or Virtual Machines
Prior experience in developing/managing distributed network file systems
Prior experience optimizing multi-tenant HPC clusters for performance and maintenance
Prior academic or development experience with machine learning and/or deep learning
Prior experience in ML libraries such as PyTorch, TensorFlow or cuDNN
Prior experience in Computer vision libraries such as OpenCV
Prior experience in GPGPU development with CUDA, OpenCL or DirectCompute