Intelligent AVX Tuning Framework

Problem Statement

Nowadays, technology is booming. The scientific and engineering research, and also many high-tech applications, require high efficiency of floating-point calculation. Therefore, AVX512 instruction set is given to birth. However, in practical experiment, a problem has been found: The throughput of Non-AVX tasks will suffer unexpected decrease when there are AVX512 tasks running on the same server. This project is to avoid the throughput decrease of Non-AVX tasks, in order to achieve the optimal overall efficiency of the server.

Concept Generation

The whole design consists of Four parts:

Tasks Simulation: Generate several AVX512 and Non-AVX workloads for the purpose of simulating the real environment. Using C++.

Extract the required system data: Get the required CPU status from system including core frequency, throughput, and active core number. Using Linux commands.

Demo: Demo the current status of CPUs and learning curves of training/testing by a real-time User Interface. Using Python.

Generate Operation: Generate the optimal CPU operations to migrate the active workload to an idle CPU according to the current observation of the system using DQN (Deep Q Network).

Design Description

The design consists of three parts: hardware, reinforcement learning algorithm, and user interface. When the number of active cores increase, the corresponding frequency and throughput will decrease. Also, for AVX512 workload and Non-AVX one, the throughput will be different.

The data of CPU status will be reformatted into a state table which records the original throughput and the updated throughput. The gym environment will generate an action to migrate the workload and calculate the reward. With this action, the CPU status will be updated. The user interface will vividly display the CPU status and show the learning curve.

Modeling and Analysis

The main facts that influence the convergence of the training result are the action spaces, reward function and the observation spaces, where lots of attempts were made to modify the model.

Average core frequency >= 3 GHz

Response time of tuning <= 800 ms

Exceptional handling <= 3%

Decoupling extent 3 separate modules

Application Programming Interface >= 2

Time of learning <= 5 hours

Conclusion

The intelligent tuning framework with reinforcement learning (RL) allows customers to optimize their server using general APIs or user-friendly GUI. The key to achieve this goal is to construct a uniform and robust framework and abstract server tuning problems as universal RL environment. Besides, the uniform and stable framework also makes sense.

Jiecheng Shi

Intelligent Server Performance Tuning Framework

Undergraduate Capstone Project @ UM-SJTU JI

Problem Statement

Concept Generation

Design Description

Modeling and Analysis

Conclusion