调度方面关注的会议有:
- 系统软件:OSDI(CCF-A),SOSP(CCF-A),ATC(CCF-A),Eurosys(CCF-B),ICDCS(CCF-B)等
- 体系结构:ASPLOS(CCF-A),ISCA(CCF-A),MICRO(CCF-A)等
- 通信:SIGCOMM(CCF-A),INFOCOMM(CCF-A),NSDI(CCF-B)等
论文收集和整理可以参考张营师兄的文章如何收集和整理论文。
资源管理系统 Resource Management & Scheduling System
- 《Large-scale cluster management at Google with Borg》(Eurosys’15)pdf, 译文,Youtube
- 《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》(NSDI’11)usenix,mesos
- 《Apache Hadoop YARN: yet another resource negotiator》(SOCC’13)acm,YARN
- 《Apollo: scalable and coordinated scheduling for cloud-scale computing》(OSDI’14)usenix
- 《Omega: flexible, scalable schedulers for large compute clusters》(Eurosys’13)acm
- 《Firmament: Fast, Centralized Cluster Scheduling at Scale》(OSDI’16)usenix,firmament
- 《Sparrow: distributed, low latency scheduling》(SOSP’13)acm
- 《Mercury: Hybrid centralized and distributed scheduling in large shared clusters》(ATC’15)usenix
- 《Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale》(VLDB’14)pdf
- 《Advancements in YARN Resource Manager》(2018)pdf,译文
- 《Multi-resource Packing for Cluster Schedulers(Tetris)》(SigComm’14)pdf,YARN-2745,pptx
改进资源效率和调度策略 Improving Resource Efficiency & Scheduling Policy
- 《Altruistic Scheduling in Multi-Resource Clusters》(OSDI’16)usenix
- 《Pado: A data processing engine for harnessing transient resources in datacenters》(Eurosys’17)acm
- 《GRAPHENE:Packing and Dependency-aware Scheduling for Data-Parallel Clusters》(OSDI’16)usenix
- 《3sigma: distribution-based cluster scheduling for runtime uncertainty》(Eurosys’18)acm
- 《Efficient Queue Management for Cluster Scheduling》(Eurosys’16)acm
- 《Dominant Resource Fairness: Fair Allocation of Multiple Resource Types》(NSDI’11)usenix
- 《Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks》(SOSP’17)acm
- 《ROSE: Cluster Resource Scheduling via Speculative Over-subscription》(ICDCS’18)eprints
异构作业混合部署 Colocating Heterogeneous Workloads
- 《Heracles: improving resource efficiency at scale》(ISCA’15)acm
- 《Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling》(ATC’17)usenixDSS
- 《Bolt: I Know Waht You Did Last Summer… In the Cloud》(ASPLOS’17)acm
- 《Quasar: resource-efficient and QoS-aware cluster management》(ASPLOS’14)acm
- 《Interference management for distributed parallel applications in consolidated clusters》(ASPLOS’16)acm
- 《Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems》(ASPLOS’16)acm
- 《CPI2: CPU performance isolation for shared compute clusters》(EuroSys’13)Google Pub
大规模机器学习作业的调度 Scheduling for Deep Learning Clusters
- 《Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads》 (CoRR’19)arXiv
- 《Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools》 (CoRR’19)arXiv
- 《Tiresias: A GPU Cluster Manager for Distributed Deep Learning》(NSDI’19)usenix slides
- 《Optimus: an efficient dynamic resource scheduler for deep learning clusters》(Eurosys’18)acmOptimus Src
- 《Litz: An Elastic Framework for High-Performance Distributed Machine Learning》(CMU-PDL-17-103,ATC’18)CMU PDL TR List
- 《Gandiva: Introspective Cluster Scheduling for Deep Learning》(OSDI’18)usenix
- 《Topology-aware GPU scheduling for learning workloads in cloud environments》(SC’17)acm
- 《Project Adam: Building an Efficient and Scalable Deep Learning Training System》(OSDI’14)usenix
- 《Scaling Distributed Machine Learning with the Parameter Server》(OSDI’14)usenix
- A List
杂项 Miscellaneous
- 《The Tail at Scale》(CACM’13)Google Pub
- 《Cluster Scheduling for Datacenters》(CACM’18)ACM
- 《Resource Management with Deep Reinforcement Learning》(HotNets’16)pdf
- 《Reconciling high server utilization and sub-millisecond quality-of-service》(EuroSys’14)pdf
- 《Heterogeneity and dynamicity of clouds at scale - Google trace analysis》(SoCC’12)pdf
- 《Job Scheduling without Prior Information in Big Data Processing Systems》(ICDCS’17)
- 《The evolution of cluster scheduler architectures》blog, 译文
- 《阿里巴巴云化架构创新之路》(ArchSummit’17)Infoq CN
- 《阿里巴巴调度与集群管理系统Sigma》(ArchSummit’17)Infoq CN
研究组 Research Groups
- Cluster Resource Management - Microsoft Research MSR
- Cloud Efficiency - Microsoft Research MSR
- Distributed Systems and Parallel Computing from Google Pubs Google Pubs
- LABOS from EFPL EFPL
- Parallel Data Lab Project - Cloud Scheduling (TetriSched) from CMU CMU
- Christos Kozyrakis Stanford, Christina Delimitrou Cornell
- Malte Schwarzkopf MIT
- Shivaram Venkataraman UC Berkeley, Kay Ousterhout UC Berkeley
- Robert Grandl Wisc