Data centers consume about 1% of worldwide electricity. Recent studies show that the electricity consumption in data centers is still increasing. There is an urgent need for reducing the energy consumption of data centers. Maintaining large amounts of data and providing powerful computing services require data centers to effectively distribute the workload. Load balancing is a method of distributing network traffic to servers in datacenters. Hardware load balancers are generally inefficient and designed for the worst case. Software has the advantage of being scalable to network traffic and having access to real time network and diagnostic data. Existing thermal aware algorithms did not address cooling energy reduction at the application level. Our research seeks to reduce the cooling costs in datacenters by designing temperature aware workload balancing algorithms. We used GpuCloudSim Plus to simulate a datacenter distributing GPU-intensive applications under different workloads and utilizations. Machine learning models were integrated to predict temperatures and evaluate the performance of our algorithm.
We studied existing machine learning algorithms and methods used in thermal prediction for data center, and compare the performance of these machine learning algorithms and methods. To investigate the impact of CPU and GPU activities on the temperature and energy consumption, we designed a set of experiments and developed programs to extract the temperature, energy, and performance data from our experimental results. Then, we applied several regression models to estimate the temperature and energy consumption to compare the performance of these methods.
Built-in temperature sensors were used to track the temperature of the air entering and exiting each computing node and the ambient temperature in the cluster room. Interior temperature sensors were used to detect the temperature of CPU, disk, memory, and GPU. We developed scripts to track the utilizations and temperatures of the key components (CPU, disk, memory, and GPU) in these cluster servers. Whetstone[5], a synthetic benchmark program, was used to generate CPU-intensive workloads for studying the thermal impacts of CPU. By applying regression models and the XGBoost machine learning model, we observed that the XGBoost model has a better performance in CPU temperature predicting.
We applied machine learning models to estimate the GPU temperature, and evaluated the models on a recursive multi-step forecasting task which reflected the models’ expected performance within the simulation. After initial modeling, we adjusted the training process by using scheduled sampling to reduce the discrepancy between training error and recursive multi-step predictions. Our best performance GPU temperature prediction model, a hybrid CNN-LSTM, leveraged sequential continuity and reduced RMSE from 2.54 to 0.73 when compared to the baseline model.
We proposed a thermal-aware scheduling algorithm, ThermalAwareGPU, for GPU-intensive workload distribution to reduce temperature in datacenters.
To compare our algorithm's performance, we ran experiments with both built-in
load balancing algorithms in (GPU)CloudSim Plus and an algorithm adapted from
an existing research. The following 4 scheduling algorithms were compared with our proposed algorithm:
FirstFit: finds the first host with suitable resources
BestFit: finds the host with smallest available resource that is larger than request
RoundRobin: hosts are selected in cyclic way
ThermalAwareBaseline: BestFit temperature, FirstFit resources, adapted for cloudlet workload balancing
To replicate real-world workloads found in data centers, three common patterns of data traffic were utilized: random, intermittent (bursty), and periodic.
We simulated a data center of four hosts, with each host having 20 CPU cores and 500 GPU cores.
GpuCloudlets were sent in batches every second by a workload generator. Three machine learning models were used for computing cost power prediction for GPU servers.
This project is supported by the following programs:
(1) REU Site: Applying Data Science on Energy-efficient Cluster Systems and Applications and funded by National Science Foundation Grant CNS-2244391;
(2) SECURE For Student Success (SfS2) Program and funded by the United States Department of Education FY 2023 Title V, Part A, Developing Hispanic-Serving Institutions Program five-year grant, Award Number P31S0230232, CFDA Number 84.031S;
(3) The Louis Stokes Alliance for Minority Participation--LSAMP Program and funded by the National Science Foundation and the California State University System.