Workload Management Project Page

Introduction

Data centers consume about 1% of worldwide electricity. Recent studies show that the electricity consumption in data centers is still increasing. There is an urgent need for reducing the energy consumption of data centers. Maintaining large amounts of data and providing powerful computing services require data centers to effectively distribute the workload. Load balancing is a method of distributing network traffic to servers in datacenters. Hardware load balancers are generally inefficient and designed for the worst case. Software has the advantage of being scalable to network traffic and having access to real time network and diagnostic data. Existing thermal aware algorithms did not address cooling energy reduction at the application level. Our research seeks to reduce the cooling costs in datacenters by designing temperature aware workload balancing algorithms. We used GpuCloudSim Plus to simulate a datacenter distributing GPU-intensive applications under different workloads and utilizations. Machine learning models were integrated to predict temperatures and evaluate the performance of our algorithm.

Temperature and Energy Modeling

We studied existing machine learning algorithms and methods used in thermal prediction for data center, and compare the performance of these machine learning algorithms and methods. To investigate the impact of CPU and GPU activities on the temperature and energy consumption, we designed a set of experiments and developed programs to extract the temperature, energy, and performance data from our experimental results. Then, we applied several regression models to estimate the temperature and energy consumption to compare the performance of these methods.

CPU temperature and energy consumption

Built-in temperature sensors were used to track the temperature of the air entering and exiting each computing node and the ambient temperature in the cluster room. Interior temperature sensors were used to detect the temperature of CPU, disk, memory, and GPU. We developed scripts to track the utilizations and temperatures of the key components (CPU, disk, memory, and GPU) in these cluster servers. Whetstone[5], a synthetic benchmark program, was used to generate CPU-intensive workloads for studying the thermal impacts of CPU. By applying regression models and the XGBoost machine learning model, we observed that the XGBoost model has a better performance in CPU temperature predicting.

GPU temperature and energy consumption

We applied machine learning models to estimate the GPU temperature, and evaluated the models on a recursive multi-step forecasting task which reflected the models’ expected performance within the simulation. After initial modeling, we adjusted the training process by using scheduled sampling to reduce the discrepancy between training error and recursive multi-step predictions. Our best performance GPU temperature prediction model, a hybrid CNN-LSTM, leveraged sequential continuity and reduced RMSE from 2.54 to 0.73 when compared to the baseline model.

Workload Scheduling

We proposed a thermal-aware scheduling algorithm, ThermalAwareGPU, for GPU-intensive workload distribution to reduce temperature in datacenters. To compare our algorithm's performance, we ran experiments with both built-in load balancing algorithms in (GPU)CloudSim Plus and an algorithm adapted from an existing research. The following 4 scheduling algorithms were compared with our proposed algorithm:
     FirstFit: finds the first host with suitable resources
     BestFit: finds the host with smallest available resource that is larger than request
     RoundRobin: hosts are selected in cyclic way
     ThermalAwareBaseline: BestFit temperature, FirstFit resources, adapted for cloudlet workload balancing

To replicate real-world workloads found in data centers, three common patterns of data traffic were utilized: random, intermittent (bursty), and periodic. We simulated a data center of four hosts, with each host having 20 CPU cores and 500 GPU cores. GpuCloudlets were sent in batches every second by a workload generator. Three machine learning models were used for computing cost power prediction for GPU servers.

Acknowledgement

This project is supported by the following programs:
     (1) REU Site: Applying Data Science on Energy-efficient Cluster Systems and Applications and funded by National Science Foundation Grant CNS-2244391;
     (2) SECURE For Student Success (SfS2) Program and funded by the United States Department of Education FY 2023 Title V, Part A, Developing Hispanic-Serving Institutions Program five-year grant, Award Number P31S0230232, CFDA Number 84.031S;
     (3) The Louis Stokes Alliance for Minority Participation--LSAMP Program and funded by the National Science Foundation and the California State University System.

Faculty Advisors

Xunfei Jiang, Mahdi Ebrahimi

Student Researchers

CSUN Student Researchers

   Brandon Ismalej, Department of Computer Science, California State University, Northridge
   Icess Iana Nisce, Department of Computer Science, California State University, Northridge
   David Macoto Ward, Department of Computer Science, California State University, Northridge

2024 REU Summer Research

Beth Ann Sweezer, Department of Computer Science, California State University, Northridge
Harold Wang, Department of Mathematics, University of California, San Diego

2023 REU Summer Research

   Matthew Smith, Department of Computer Science, California State University, Northridge
   Rachel Finley, Department of Computer Science, Santa Monica College
   Luke Zhao, Department of Computer Science, Brown University

Energy Saving

Energy-efficient Workload Management in Data Centers

Energy-efficiency

Workload Scheduling in Data Centers

Machine learning Algorithms