调度方面关注的会议有:

  • 系统软件:OSDI(CCF-A),SOSP(CCF-A),ATC(CCF-A),Eurosys(CCF-B),ICDCS(CCF-B)等
  • 体系结构:ASPLOS(CCF-A),ISCA(CCF-A),MICRO(CCF-A)等
  • 通信:SIGCOMM(CCF-A),INFOCOMM(CCF-A),NSDI(CCF-B)等

论文收集和整理可以参考张营师兄的文章如何收集和整理论文

资源管理系统 Resource Management & Scheduling System

  • 《Large-scale cluster management at Google with Borg》(Eurosys’15)pdf, 译文Youtube
  • 《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》(NSDI’11)usenixmesos
  • 《Apache Hadoop YARN: yet another resource negotiator》(SOCC’13)acmYARN
  • 《Apollo: scalable and coordinated scheduling for cloud-scale computing》(OSDI’14)usenix
  • 《Omega: flexible, scalable schedulers for large compute clusters》(Eurosys’13)acm
  • 《Firmament: Fast, Centralized Cluster Scheduling at Scale》(OSDI’16)usenixfirmament
  • 《Sparrow: distributed, low latency scheduling》(SOSP’13)acm
  • 《Mercury: Hybrid centralized and distributed scheduling in large shared clusters》(ATC’15)usenix
  • 《Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale》(VLDB’14)pdf
  • 《Advancements in YARN Resource Manager》(2018)pdf译文
  • 《Multi-resource Packing for Cluster Schedulers(Tetris)》(SigComm’14)pdfYARN-2745pptx

改进资源效率和调度策略 Improving Resource Efficiency & Scheduling Policy

  • 《Altruistic Scheduling in Multi-Resource Clusters》(OSDI’16)usenix
  • 《Pado: A data processing engine for harnessing transient resources in datacenters》(Eurosys’17)acm
  • 《GRAPHENE:Packing and Dependency-aware Scheduling for Data-Parallel Clusters》(OSDI’16)usenix
  • 《3sigma: distribution-based cluster scheduling for runtime uncertainty》(Eurosys’18)acm
  • 《Efficient Queue Management for Cluster Scheduling》(Eurosys’16)acm
  • 《Dominant Resource Fairness: Fair Allocation of Multiple Resource Types》(NSDI’11)usenix
  • 《Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks》(SOSP’17)acm
  • 《ROSE: Cluster Resource Scheduling via Speculative Over-subscription》(ICDCS’18)eprints

异构作业混合部署 Colocating Heterogeneous Workloads

  • 《Heracles: improving resource efficiency at scale》(ISCA’15)acm
  • 《Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling》(ATC’17)usenixDSS
  • 《Bolt: I Know Waht You Did Last Summer… In the Cloud》(ASPLOS’17)acm
  • 《Quasar: resource-efficient and QoS-aware cluster management》(ASPLOS’14)acm
  • 《Interference management for distributed parallel applications in consolidated clusters》(ASPLOS’16)acm
  • 《Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems》(ASPLOS’16)acm
  • 《CPI2: CPU performance isolation for shared compute clusters》(EuroSys’13)Google Pub

大规模机器学习作业的调度 Scheduling for Deep Learning Clusters

  • 《Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads》 (CoRR’19)arXiv
  • 《Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools》 (CoRR’19)arXiv
  • 《Tiresias: A GPU Cluster Manager for Distributed Deep Learning》(NSDI’19)usenix slides
  • 《Optimus: an efficient dynamic resource scheduler for deep learning clusters》(Eurosys’18)acmOptimus Src
  • 《Litz: An Elastic Framework for High-Performance Distributed Machine Learning》(CMU-PDL-17-103,ATC’18)CMU PDL TR List
  • 《Gandiva: Introspective Cluster Scheduling for Deep Learning》(OSDI’18)usenix
  • 《Topology-aware GPU scheduling for learning workloads in cloud environments》(SC’17)acm
  • 《Project Adam: Building an Efficient and Scalable Deep Learning Training System》(OSDI’14)usenix
  • 《Scaling Distributed Machine Learning with the Parameter Server》(OSDI’14)usenix
  • A List

杂项 Miscellaneous

  • 《The Tail at Scale》(CACM’13)Google Pub
  • 《Cluster Scheduling for Datacenters》(CACM’18)ACM
  • 《Resource Management with Deep Reinforcement Learning》(HotNets’16)pdf
  • 《Reconciling high server utilization and sub-millisecond quality-of-service》(EuroSys’14)pdf
  • 《Heterogeneity and dynamicity of clouds at scale - Google trace analysis》(SoCC’12)pdf
  • 《Job Scheduling without Prior Information in Big Data Processing Systems》(ICDCS’17)
  • 《The evolution of cluster scheduler architectures》blog 译文
  • 《阿里巴巴云化架构创新之路》(ArchSummit’17)Infoq CN
  • 《阿里巴巴调度与集群管理系统Sigma》(ArchSummit’17)Infoq CN

研究组 Research Groups

  • Cluster Resource Management - Microsoft Research MSR
  • Cloud Efficiency - Microsoft Research MSR
  • Distributed Systems and Parallel Computing from Google Pubs Google Pubs
  • LABOS from EFPL EFPL
  • Parallel Data Lab Project - Cloud Scheduling (TetriSched) from CMU CMU
  • Christos Kozyrakis Stanford, Christina Delimitrou Cornell
  • Malte Schwarzkopf MIT
  • Shivaram Venkataraman UC Berkeley, Kay Ousterhout UC Berkeley
  • Robert Grandl Wisc