Senior DevOps Engineer (175363)

Shangri-La Hotels -
上海市

立即申请

职位详情

终身制

完整职位描述

Headquartered in Hong Kong, we have over 100 hotels and resorts under four brands nested in key cities and beautiful beachfront locations globally. We are expanding rapidly with a strong development pipeline throughout Asia, the Middle East, Europe and Africa.

Regarded as one of the world's finest hotel ownership and management companies, Shangri-La is dedicated to delight guests around the world with legendary service, finely tuned from over 45 years of hospitality from the heart. We have an affinity with Asian travelers and we offer them a gateway to the rest of the world, positioning us a leading brand in luxury hospitality.

As an enviable employer with industry leading levels of colleague engagement, our people are our priority. Our success is only made possible through the efforts and abilities of over 42,000 colleagues worldwide. In accordance with this belief, the focused investment we make in the learning and development of our colleagues is unparalleled in the global hospitality industry. From welcoming new colleagues, to best in class leadership development, you can be sure that potential is identified and nurtured throughout your career.

基础资质

计算机相关专业本科及以上学历，5年及以上DevOps领域实战经验

有大型互联网项目或云原生项目全流程交付经验，具备复杂问题落地解决能力

基本的英语听说能力

核心技术要求

深度精通Kubernetes架构与生态，对容器网络、存储调度、安全隔离有深入理解，‌具备多个从0到1项目的容器化改造、集群搭建经验，能够独立负责大型项目的容器化架构设计与落地交付‌

精通CI/CD全流程体系，熟练掌握Jenkins、GitLab CI、Argo CD等主流工具链，能够主导研发效能平台的架构设计与迭代优化

扎实掌握AI工程化能力，熟悉大模型部署架构与AI工作负载特性，能够搭建适配AI研发的容器化运行环境，支持模型训练、微调、推理全流程的资源调度与运维保障，具备AI开发测试运维一体化流程建设经验

能够基于大模型能力优化DevOps流程，参与AI辅助研发效能工具落地（如AI智能CR、智能告警分析、故障自动根因定位等），具备AI组件（模型网关、Agent、知识库服务）对接与运维经验

具备扎实的基础设施即代码能力，熟练使用Terraform、Ansible、Helm等工具，熟悉主流公有云（阿里云/ AWS/ 腾讯云）的架构设计与最佳实践

掌握至少一门编程语言（Python/Go/Java优先），具备大型分布式系统的故障排查、性能调优与高可用架构设计经验

熟悉可观测性体系，能够基于Prometheus、Grafana、ELK等技术搭建完善的监控告警与链路追踪体系，针对AI任务特性优化监控与资源弹性扩缩容策略

软能力要求

具备极强的服务保障意识，能够牵头处理重大线上故障，推动建立稳定可靠的生产环境保障体系

拥有清晰的用户体验意识，能够面向AI研发与业务研发团队输出易用、高效的DevOps工具与流程，持续优化研发交付效率与使用体验

具备优秀的技术选型与方案设计能力，能够结合AI项目与业务需求平衡技术先进性与落地成本

良好的跨团队沟通与技术影响力，能够带动团队技术成长，推动AI赋能DevOps的文化落地

Basic Qualifications

Bachelor's degree or above in computer-related majors, with no less than 5 years of practical experience in DevOps field

Full-lifecycle delivery experience in large-scale Internet or cloud-native projects, with the ability to solve complex engineering problems

Basic English listening and speaking skills

Core Technical Requirements

Deeply proficient in Kubernetes architecture and ecosystem, with in-depth understanding of container network, storage scheduling and security isolation. ‌You are required to have experience in multiple containerization transformation and cluster construction projects from scratch, and can independently be responsible for the containerization architecture design and delivery of large-scale projects.‌

Proficient in the full CI/CD process, skilled in mainstream toolchains including Jenkins, GitLab CI and Argo CD, and able to lead the architecture design and iterative optimization of R&D efficiency platforms

Solid AI engineering capabilities, familiar with large model deployment architecture and AI workload characteristics, able to build containerized operating environment adapted for AI R&D, support resource scheduling and operation guarantee for the whole process of model training, fine-tuning and inference, with experience in building integrated AI development, testing and operation and maintenance processes

Able to optimize DevOps process based on large model capabilities, participate in the implementation of AI-assisted R&D efficiency tools (such as AI intelligent CR, intelligent alarm analysis, automatic root cause location of faults, etc.), with experience in docking and operation of AI components (model gateway, Agent, knowledge base service)

Solid capabilities in infrastructure as code, proficient in tools such as Terraform, Ansible and Helm, familiar with architecture design and best practices of mainstream public clouds (Alibaba Cloud/AWS/Tencent Cloud)

Master at least one programming language (Python/Go/Java preferred), with experience in troubleshooting, performance tuning and high-availability architecture design for large distributed systems

Familiar with observability system, able to build complete monitoring, alarm and distributed tracing system based on Prometheus, Grafana, ELK and other technologies, and optimize monitoring and resource elastic scaling strategies according to the characteristics of AI tasks

Soft Skill Requirements

Have a strong sense of service assurance, able to lead the handling of major online failures and promote the establishment of a stable and reliable production environment assurance system

Have a clear awareness of user experience, able to deliver easy-to-use and efficient DevOps tools and processes for AI R&D and business R&D teams, and continuously optimize R&D delivery efficiency and user experience

Excellent technical selection and solution design capabilities, able to balance technological advancement and implementation cost based on AI projects and business requirements

Good cross-team communication and technical influence, able to drive the technical growth of the team and promote the implementation of AI-enabled DevOps culture

立即申请

求职者工具

雇主工具

浏览

保持联系