Reading List | sigcre

调度方面关注的会议有：

论文收集和整理可以参考张营师兄的文章如何收集和整理论文。

《Large-scale cluster management at Google with Borg》（Eurosys’15）pdf, 译文，Youtube
《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》（NSDI’11）usenix，mesos
《Apache Hadoop YARN: yet another resource negotiator》（SOCC’13）acm，YARN
《Apollo: scalable and coordinated scheduling for cloud-scale computing》（OSDI’14）usenix
《Omega: flexible, scalable schedulers for large compute clusters》（Eurosys’13）acm
《Firmament: Fast, Centralized Cluster Scheduling at Scale》（OSDI’16）usenix，firmament
《Sparrow: distributed, low latency scheduling》（SOSP’13）acm
《Mercury: Hybrid centralized and distributed scheduling in large shared clusters》（ATC’15）usenix
《Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale》（VLDB’14）pdf
《Advancements in YARN Resource Manager》（2018）pdf，译文
《Multi-resource Packing for Cluster Schedulers（Tetris）》（SigComm’14）pdf，YARN-2745，pptx

《Altruistic Scheduling in Multi-Resource Clusters》（OSDI’16）usenix
《Pado: A data processing engine for harnessing transient resources in datacenters》（Eurosys’17）acm
《GRAPHENE:Packing and Dependency-aware Scheduling for Data-Parallel Clusters》（OSDI’16）usenix
《3sigma: distribution-based cluster scheduling for runtime uncertainty》（Eurosys’18）acm
《Efficient Queue Management for Cluster Scheduling》（Eurosys’16）acm
《Dominant Resource Fairness: Fair Allocation of Multiple Resource Types》（NSDI’11）usenix
《Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks》（SOSP’17）acm
《ROSE: Cluster Resource Scheduling via Speculative Over-subscription》（ICDCS’18）eprints

《Heracles: improving resource efficiency at scale》（ISCA’15）acm
《Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling》（ATC’17）usenix DSS
《Bolt: I Know Waht You Did Last Summer… In the Cloud》（ASPLOS’17）acm
《Quasar: resource-efficient and QoS-aware cluster management》（ASPLOS’14）acm
《Interference management for distributed parallel applications in consolidated clusters》（ASPLOS’16）acm
《Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems》（ASPLOS’16）acm
《CPI2: CPU performance isolation for shared compute clusters》（EuroSys’13）Google Pub

《Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads》 (CoRR’19)arXiv
《Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools》 (CoRR’19)arXiv
《Tiresias: A GPU Cluster Manager for Distributed Deep Learning》（NSDI’19）usenix slides
《Optimus: an efficient dynamic resource scheduler for deep learning clusters》（Eurosys’18）acm Optimus Src
《Litz: An Elastic Framework for High-Performance Distributed Machine Learning》（CMU-PDL-17-103，ATC’18）CMU PDL TR List
《Gandiva: Introspective Cluster Scheduling for Deep Learning》（OSDI’18）usenix
《Topology-aware GPU scheduling for learning workloads in cloud environments》（SC’17）acm
《Project Adam: Building an Efficient and Scalable Deep Learning Training System》（OSDI’14）usenix
《Scaling Distributed Machine Learning with the Parameter Server》（OSDI’14）usenix
A List

《The Tail at Scale》（CACM’13）Google Pub
《Cluster Scheduling for Datacenters》（CACM’18）ACM
《Resource Management with Deep Reinforcement Learning》（HotNets’16）pdf
《Reconciling high server utilization and sub-millisecond quality-of-service》（EuroSys’14）pdf
《Heterogeneity and dynamicity of clouds at scale - Google trace analysis》（SoCC’12）pdf
《Job Scheduling without Prior Information in Big Data Processing Systems》（ICDCS’17）
《The evolution of cluster scheduler architectures》blog，译文
《阿里巴巴云化架构创新之路》（ArchSummit’17）Infoq CN
《阿里巴巴调度与集群管理系统Sigma》（ArchSummit’17）Infoq CN