Infrafit | node right-sizing
InfraFit is an advanced feature that provides a granular view of resource allocation across the various nodes supporting your Kubernetes clusters.
Infrafit is only fully available for Expert Plan users. To upgrade your subscription level, contact support@perfectscale.io or use the PerfectScale Slack Community.
Infrafit Overview
Infrafit is an advanced feature that provides comprehensive visibility of the entire Kubernetes environment at the infrastructure level. It helps in understanding the behavior of specific node groups and node types, optimizing the underlying infrastructure of the workloads. Infrafit offers a clear historical view of node utilization across the entire environment, enhancing the optimization process.
This level of visibility allows you to identify areas with idle space and gain insights into adjusting node sizes based on actual resource utilization. Additionally, Infrafit provides a unique view into scheduling results, showing what workloads are being scheduled to each node group, what is preventing scaling down, and what causes overcommitment over time.
Infrafit & Podfit Connectivity
Infrafit and Podfit together offer a comprehensive solution that provides a multidimensional approach to reduce wasted resources in your nodes. When drilling down into individual node types or node groups, InfraFit provides a unique view into scheduling results, showing which workloads are being scheduled to the node group, what is preventing scaling down, and what causes overcommitting over time.
This provides insights that can help you more efficiently bin-packing and sizing your node to maximize resource utilization. It also helps you lower your overall node count while improving the efficiency of your node-autoscalers (like Karpenter or Cluster Autoscaler) to keep your costs optimized.
By combining the capabilities of both features, you gain a comprehensive understanding of your infrastructure and workloads, enabling more efficient management and cost reduction.
Key Benefits
Integrated Insights: Infrafit provides detailed visibility into node groups and node types, while Podfit offers insights into workload scheduling and resource allocation. Together, they give you a holistic view of your environment.
Optimized Resource Allocation: By analyzing the data from Infrafit and Podfit, you can make informed decisions on adjusting node sizes, improving bin-packing, and optimizing the performance of node-autoscalers (e.g., Karpenter or Cluster Autoscaler).
Enhanced Efficiency: Identify workloads that prevent scaling down or cause overcommitment and adjust resource allocations accordingly. This helps in lowering the overall node count while maximizing resource utilization.
Cost Reduction: With a clear understanding of resource usage and waste, you can implement strategies to reduce costs and improve the efficiency of your infrastructure.
How It Works
Workload Scheduling Visibility: Infrafit provides a unique view into scheduling results, showing which workloads are being scheduled to each node group, what prevents scaling down, and what causes overcommitment over time.
Resource Utilization Insights: Podfit complements this by offering detailed insights into workload runtime and resource usage, helping you identify areas of improvement.
Data-Driven Recommendations: Together, Infrafit and Podfit provide actionable recommendations to optimize your Kubernetes environment, ensuring that resources are allocated efficiently and costs are minimized.
By leveraging the combined power of Infrafit and Podfit, you can achieve a more efficient, cost-effective, and well-optimized Kubernetes infrastructure. We will explore how to leverage these capabilities throughout the documentation.
Upper Panel
1. Tenant name
Displays the name of the account and enables you to switch between different accounts quickly.
2. Clusters drop-down
This menu allows you to switch between clusters and displays the associated data.
3. In-app path
Shows your current location within the app, helping you easily navigate and understand where you are in the interface.
4. Optimization Policy
Displays the cluster's optimization policy. Optimization policy allows you to specify how your resources should be allocated in order to support the individual needs of your workloads. Define the policies that best suit your environment and business goals, depending on whether you want to maximize cost savings or provide extra headroom to maintain the resilience of mission-critical services.
MaxSavings - maximum cost savings, the best for non-production environments
Balanced (default) - optimally balances cost and resiliency
ExtraHeadroom - the best fit for latency-sensitive environments
MaxHeadroom - keeps the environment above the highest spikes
Discover more about customizing the Optimization policy here.
5. Timeframe
Allows you to check the data for a specific time period: click on the drop-down list in the upper right corner and select one of the options.
6. Export
This feature allows you to seamlessly analyze and effortlessly share your data by exporting it into a .csv file. Click the Export
button, and the data will be exported to your local machine in a few seconds.
Nodes Resource Utilization Over Time Panel
1. Node Groups/Node Types selector
This selector allows easy switching between node groups and node types, giving you different perspectives of your infrastructure for more comprehensive analysis.
2. Utilization
A drop-down selector allows you to choose the resource utilization percentile (affects graphs as well as the Utilization chart in the table below), helping you better understand the resource utilization patterns.
3. Cost per Node Group/Node Type
Displays the cost of the Node Groups or Node Types over the selected timeframe.
4. CPU Over Time
Displays the allocated, requested, and used amount of CPU (cores) with the selected utilization percentile over the selected timeframe.
5. Memory Over Time
Displays the allocated, requested, and used amount of Memory (GB) with the selected utilization percentile over the selected timeframe.
Data Table Display Options
Use the Node Groups/Node Types switcher to change the desired view easily.
View by Node Group
Gain insights and identify the most impactful optimization opportunities within the node groups with the Node Groups view.
Node Group
Displays the Node Group Name. Click on a column title or use a drop-down list to sort or filter data.
Architecture
Displays the Node Architecture (ARM, x86). Click on a column title or use a drop-down list to sort or filter data.
Nodes (avg & max)
Displays the average and maximum number of nodes in a specific node group. Click on a column title to sort data.
Reservation
Display the reservation type of nodes in the group. Click on a column title or use a drop-down list to sort or filter data.
Avg Cost/h
Displays the average node group cost per hour. Click on a column title to sort data.
Total Cost
Displays the total node group cost. Click on a column title to sort data.
Idle Cost
Displays the cost of the space in a node group that has never been used. Click on a column title to sort the data.
Last Seen
Displays the last time PerfectScale observed the Node Group.
Utilization
A visual representation of CPU and Memory Utilization (allocation, request, and usage) based on the selected usage percentile. Use a drop-down list to filter data with the needed value.
Node Groups Total
Shows the total cost and idle of the nodes.
Node Group Details
Clicking on any node group will navigate to the Node Group Details screen, where you can view information about running workloads in this group or review the breakdown of node types accompanied with related data for further analysis.
After clicking on a specific Node Group, you will be navigated to its detailed view, where you can easily dive into the data with three different levels of granularity:
Node Types
Upper Panel
Node Group indicates the name of the node group associated with the displayed data.
Timeframe allows you to check the data for a specific time period: click on the drop-down list in the upper right corner and select one of the existing options.
Seamlessly Export and effortlessly share your data by exporting it into a .csv file for further analysis.
Node Group Resources Utilization
Cost per Node Type displays the cost trend of the Node Types over the selected timeframe.
CPU displays the allocated, requested, and used amount of CPU (cores) with the selected usage percentile in the group over time.
Memory displays the allocated, requested, and used amount of Memory (GB) with the selected usage percentile in the group over time.
Node Group Data Table
Node Type
Instance Type Name. Click on a column title or use a drop-down list to sort or filter data.
Architecture
Node Architecture. Click on a column title or use a drop-down list to sort or filter data.
Reservation
Node reservation type. Click on a column title or use a drop-down list to sort or filter data.
CPU/Mem (node)
Node size. Click on a column title to sort data.
Nodes avg/max
Average and maximum number of nodes with a specific type. Click on a column title to sort data.
Running Hours
Total instance running hours.
Avg Cost/h
Average cost per hour of the instance with the specific type. Click on a column title to sort data.
Total Cost
Total cost of nodes with the specific type. Click on a column title to sort data.
Idle Cost
Cost of the space in nodes with the specific types that has never been used. Click on a column title to sort the data.
Last Seen
Last time PerfectScale observed node with a specific type.
Utilization
Workloads
Diving into the workloads running on a specific node enables you to seamlessly identify which workloads contribute the most to resource waste due to over-provisioning and adjust resource allocations with data-driven recommendations, creating new opportunities to optimize your underlying Kubernetes infrastructure.
Workload
Indicates the name of the workload.
Automation
Displays the automation status of a particular workload. You can easily sort the data by automation status to focus on the most relevant information for further investigation.
Type
Indicates the type of the workload.
Namespace
Indicates the workloads' namespace.
Running Hours
The workload running hours.
Total Cost
The total cost of allocated resources of the workload.
Pod Waste
The total cost of reducible workload resources.
Container
Indicates the container of teh workload.
Labels and Policies view
Optimization Policy
Displays the Optimization policy associated with the workload:
MaxSavings - maximum cost savings, the best for non-production environments
Balanced (default) - optimally balances cost and resiliency
ExtraHeadroom - the best fit for latency-sensitive environments
MaxHeadroom - keeps the environment above the highest spikes
Labels
Displays the label associated with the workload. You can select up to two labels to display.
The Workload labels have higher precedence than Namespace labels. If the Workload label and Namespace label have the same name, only the Workload label will display.
HPA view
The HPA view provides a clear overview of workloads utilizing Horizontal Pod Autoscaler (HPA). This feature enables users to quickly identify the workloads where HPA has been introduced and adjust HPA thresholds with provided informative tooltips that offer tailored recommendations. These recommendations are particularly helpful in optimizing scaling decisions, minimizing resource waste, and ensuring efficient operation of workloads.
HPA
Indicates whether HPA has been introduced for the workload. You can easily sort the column by clicking the header or apply specific filters.
CPU (%)
Displays the trigger for HPA by CPU. For insights on threshold recommendations, simply hover over the warning tooltip. You can easily sort the column by its values by clicking the header.
There are two types of indicators to be aware of:
A red indication signifies that the threshold is below 60%, indicating potential significant CPU waste.
A yellow indication suggests that the threshold falls between 60% and 80%, pointing to potential moderate CPU waste.
Memory (%)
Displays the trigger for HPA by Memory. For insights on threshold recommendations, simply hover over the warning tooltip. You can easily sort the column by its values by clicking on the header.
There are two types of indicators to be aware of:
A red indication signifies that the threshold is below 60%, indicating potential significant Memory waste.
A yellow indication suggests that the threshold falls between 60% and 80%, pointing to potential moderate Memory waste.
Custom metric
Indicates whether a Custom metric has been detected. You can easily sort the column by clicking the header or apply specific filters.
Workloads Chart
The workloads trend chart helps you identify waste and cost trends within the Node Group, directing your focus to the most critical and valuable aspects.
1. Scope
Select whether you want to display the workload data by waste or cost using this selector.
2. Workload Type
Filter the data by the workload type.
3. Interval
Select the interval for data display based on your desired data granularity.
4. Legend
Use the workloads legend to include or exclude the particular workloads from the chart.
5. Limit Selector
Choose a limit from the dropdown (from 1 to 15) to exhibit only the top N entities
View by Node Type
Identify wasted node resources and gain optimization insights to select the right nodes to better serve your environment's needs with the Node Types view.
Node Type
Displays the Instance Type Name. Click on a column title or use a drop-down list to sort or filter data.
Architecture
Displays the Node Architecture. Click on a column title or use a drop-down list to sort or filter data.
Reservation
Display the reservation type of nodes. Click on a column title or use a drop-down list to sort or filter data.
CPU/Mem (node)
Refers to the size of the instance type (memory and CPU).
Nodes (avg & max)
Displays the average and maximum number of nodes with the specific instance type. Click on a column title to sort data.
Node Group
Displays which group nodes with a specific instance type belong to. Click on a column title or use a drop-down list to sort or filter data.
Running Hours
Displays the total instance running hours.
Avg Cost/h
Displays the average cost per hour of the instance with the specific type. Click on a column title to sort data.
Total Cost
Displays the total cost of nodes with the specific type. Click on a column title to sort data.
Idle Cost
Displays the cost of the space in nodes with the specific types that has never been used. Click on a column title to sort the data.
Last Seen
Displays the last time PerfectScale observed the node with the specific instance type.
Utilization
A visual representation of CPU and Memory Utilization (allocation, request, and usage) based on the selected usage percentile. Use a drop-down list to filter data with the needed value.
Last updated