Development of Fast and Automatic Service recovery and Transition software in Hybrid Cloud Environment

The StarLab project aims to secure university-oriented global source technologies by selecting and supporting excellent laboratories that have accumulated basic and source technologies in each SW field and can conduct practical research. The lab has been selected for the cloud field and will perform “Development of Fast and Automatic Service recovery and Transition software in Hybrid Cloud Environment”. This project was started in 2020 and will be ended in 2027.

For this project, the lab develops HyFAST System on its own. The project is being carried out in three stages, and various management and maintenance features are currently being developed in a hybrid-cloud environment (OpenStack and Kubernetes) where VM and container infrastructure are mixed in the first stage until 2023.

The project is progressed into 3 stages.

  • Stage1 – Development of HyFAST in hybrid cloud environment
    • Fast and automatic service recovery and transition software based on multilayer monitoring
    • Automation workflow and integrated orchestration software
    • Situational data (e.g. Event, Log, Multilayered state) management
    • ML-based analysis and prediction features
  • Stage2 – Development of HyFAST in multi-cloud and heterogeneous cloud environments
    • Extending orchestration features to support heterogeneous cloud environments
    • Extending orchestration features to support multi-site environments such as distributed edge clouds
  • Stage3 – Demonstration and optimization of HyFAST
    • Demonstration of HyFAST that applied in private cloud and edge cloud
    • Optimization of automation features

Research and Development Achievements

Offline Events (KubeCon, IETF, …)

KPI (Stage 1)

Evaluation ItemTarget ValueDescription
Failure Detection Time≤ 0.9 secThe time taken from the occurrence of a failure caused by various factors until it is detected by the developed software
Service Recovery Time≤ 6 secThe time from the detection of a system or service failure to its complete recovery
Service Availability LevelLevel 1The service availability level of the cloud infrastructure
Evaluation Item (Key Performance Indicator)UnitR&D Result / Target ValueAchievement Level
Failure Detection Timesec0.8/0.9100
Service Recovery Timesec4.3/6100
Service Availability LevelLevel1/1100

Key Activities

  • Design and Development of an Automated Hybrid Cloud Provisioning System
Automated Hybrid Cloud Provisioning Architecture
  • Developed a remote automated provisioning system for Logical Kubernetes Clusters across multi-cloud environments (including distributed edge) based on Kubernetes design principles.
  • Each Cluster resource contains configuration details such as infrastructure provider, network settings, middleware, and required software, and the system executes the provisioning process to reach the desired state accordingly.
  • Development of an ML-Based Multi-Context Data Collection System and Root Cause Analysis (RCA) Framework for Failure Detection
Architecture of the RCA System
  • Developed real-time monitoring and machine learning-based prediction capabilities for system-wide data and specific infrastructure layers and characteristics
  • Implemented multi-anomaly prediction and root cause analysis (RCA) to identify primary failure sources upon detection
  • Modularized the entire system architecture into components including Orchestration, Model Management, Model Inference, Data Management, Data Ingestion, and Visualization
  • Design and Implementation of Stateful Workload Migration Mechanism
Kubernetes Cluster Architecture
with Velero Integration and Extension
Kubernetes Cluster Architecture
with Pod Migration Controller
  • Designed the system to enable seamless Pod migration without service disruption by retrieving the state of the running workload upon receiving a Pod Migration request triggered by RCA.
  • Developed custom Pod Migration resources and a dedicated controller to manage the migration process.
  • Extended the existing research by integrating with CRIU and Velero, enabling the migration of workloads from Day2 clusters to Day1 clusters.
  • Development of Machine Learning-Based Autoscaling for QoE Optimization
Hybrid Scaler Architecture for QoE Optimization
  • Enhanced the existing Kubernetes autoscaling components—Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA)—with machine learning techniques to ensure Quality of Experience (QoE).
  • Extended the research to the serverless open-source project Knative, achieving improved resource efficiency and QoE compared to the project’s default high-performance options.

Link to project-related software