Podfit | vertical pod right-sizing
Explore a granular, comprehensive view of your clusters' health and costs, identify and prioritize areas that need attention while autonomously optimizing workloads
Last updated
Explore a granular, comprehensive view of your clusters' health and costs, identify and prioritize areas that need attention while autonomously optimizing workloads
Last updated
PerfectScale Podfit provides comprehensive insights into the health and costs of your cluster and its components, helping you quickly pinpoint areas requiring attention along with data-driven, actionable recommendations to streamline and enhance your optimization process.
The overview and telemetry section delivers a comprehensive summary of performance risks, costs, and waste insights for the selected cluster, along with identified optimization opportunities you can quickly achieve with PerfectScale. This view enables a quick evaluation of your cluster's overall health and efficiency, pinpointing configuration issues and empowering you to streamline and enhance your optimization process effectively.
Cluster selector allows for dynamic switching between clusters, enabling seamless management and monitoring of a multi-cluster environment.
Tenant - the account name (PerfectScale in the example above).
Optimization Policy - displays the optimization policy of the selected cluster. Optimization Policy allows you to specify how your resources should be allocated in order to support the individual needs of your workloads. Define the policies that best suit your environment and business goals, depending on whether you want to maximize cost savings or provide extra headroom to maintain the resilience of mission-critical services.
MaxSavings - maximum cost savings, the best for non-production environments
Balanced (default) - optimally balances cost and resiliency
ExtraHeadroom - the best fit for latency-sensitive environments
MaxHeadroom - keeps the environment above the highest spikes
Timeframe allows you to adjust the period for reviewing metrics, enabling a focused analysis for a specific time range.
Export allows you to easily download your data as a .csv file, enabling smooth analysis and effortless sharing.
The telemetry section provides a comprehensive overview of aggregated data for the selected cluster, offering key insights into the cluster's health and efficiency. This helps you evaluate the cluster's performance easily and identifies opportunities for optimization, giving you a clear view of its overall status.
Unused Resources provides insights into the resources within the cluster that are not being effectively utilized:
Pod Waste displays the total cost of wasted resources within the cluster. Clicking on this metric will direct you to the workload waste report, offering a detailed visual breakdown of the workloads contributing to the waste. This allows you to quickly identify the most impactful areas requiring attention, enabling more efficient optimization.
Node Idle indicates the total cost of unutilized node space. By clicking on this metric, you'll be navigated to a comprehensive view of the cluster at the infrastructure level. This view provides valuable insights into the behavior of different node groups and types, enabling you to optimize the underlying infrastructure for your workloads effectively.
Potential Savings is a powerful widget that offers insights into the total costs incurred compared to the actual resource utilization. This information helps you evaluate whether the cluster is well-balanced, over-provisioned, or under-provisioned. Additionally, the widget provides a Recommended Cost, reflecting the potential savings achievable through PerfectScale's recommendations, ensuring your cluster operates efficiently and cost-effectively. Clicking on this metric will direct you to the cluster cost report for further investigation.
CPU/Memory Utilization Over Time provides a comprehensive visual representation of resource allocation, requests, and usage trends within your cluster. Tracking these metrics over a specified timeframe allows you to analyze historical data to understand how resource dynamics have changed, compare actual usage with allocated and requested resources, and identify utilization patterns.
Used - p99 of utilization
Requested - p99 of the combined requests of all the workloads
Allocated - p99 of available cluster compute
The Workload table provides a detailed overview of all the workloads running in your cluster. Each row represents a specific workload and its containers, including critical metrics like cost, waste, and potential cost increase due to under-provisioned resources. This view will help you quickly identify workloads that are misaligned with resource demands, highlighting optimization opportunities and areas at risk that require attention. With dynamic filtering and sorting options, you can easily focus on specific namespaces, labels, or workloads, making it easier to prioritize optimization tasks and run clusters efficiently.
Workloads are sets of pods of a
Deployment
,StatefulSet
,DaemonSet
,Job,
or custom resource CRD (for example -Runner
,SparkJob,
etc)
The dot count is a visual indicator of risk levels, with three levels: Low, Medium, and High (three dots represent the High-risk level).
Hollow dots indicate a muted workload, while shaded dots indicate the presence of a workflow ticket in progress.
This column shows the current automation status of each workload. You can quickly filter the data by automation status, prioritizing and focusing on the most relevant workloads for further investigation.
This column identifies the workload type (e.g., Deployment, StatefulSet). You can use filtering, sorting, and multi-select options to tailor the data display, making focusing on specific workload types easier.
The namespace column shows the namespace of each workload. You can apply filtering, sorting, and multi-select options to customize the data display, allowing you to focus on specific namespaces.
The workload running hours column indicates the total duration each workload, including its replicas, has been actively running in the cluster during the selected period. You can use the sorting option to arrange the data in your preferred order.
The workload cost per hour column indicates the total hourly expense of the workload. You can use the sorting option to arrange the data in your preferred order.
The workload total cost column shows the total expense of the workload for the selected period, considering both its hourly cost and the duration it has been actively running. You can easily identify the most costly workloads in the cluster with a single click using the sorting option.
The increase needed column shows the projected rise in workload cost based on PerfectScale’s recommendations, indicating that the workload is under-provisioned. This helps you predict the cost adjustments required to maintain cluster stability.
The workload waste column indicates the cost of over-provisioned resources allocated to a workload and represents the potential savings achievable through PerfectScale's recommendations. You can easily identify the most wasteful workloads in the cluster with a single click using the sorting option.
The container column lists the containers associated with each workload. You can use filtering options to display the data for a specific container(s). Multi-select is available.
To customize your recommendations view, use the Resource Change View
drop-down menu.
Detailed - to display the changes made to resources (shows both the previous and new values).
Total Impact in Units - to display changes made to resources as an absolute number, factoring in replica count.
Single Instance Impact in Units - to display changes made to resources as an absolute number.
Single Instance Impact in % - to display changes made to resources in a percentage format.
When the recommendation view is set to Total Impact in Units
, the resource change impact summary is available. This view provides a clear understanding of the effect of total resource adjustments, enabling seamless evaluation of the optimization process.
Using your existing labels can help you manage the workloads more effectively by allowing you to focus on the most important ones.
PerfectScale collects and supports Workloads and Namespaces labels.
To customize the Labels View, PerfectScale allows you to choose two label keys. Each column in the Labels Table corresponds to a selected key and displays its relevant data for each workload.
To configure the label, click on the gear button. Then, choose the desired labels to be displayed and click the Apply
button. Once the changes are applied, the values that correspond to the selected keys for workloads will be displayed.
When configuring the label view, it is possible to operate with the labels of Workloads and Namespaces. All the labels appear in the same list.
The Workload labels have higher precedence than Namespace labels. If the Workload label and Namespace label have the same name, only the Workload label will be displayed.
The HPA view provides a clear overview of workloads utilizing Horizontal Pod Autoscaler (HPA). This feature enables users to quickly identify the workloads where HPA has been introduced and adjust HPA thresholds with provided informative tooltips that offer tailored recommendations. These recommendations are particularly helpful in optimizing scaling decisions, minimizing resource waste, and ensuring efficient operation of workloads.
HPA
Indicates whether HPA has been introduced for the workload. You can easily sort the column by clicking the header or apply specific filters.
CPU (%)
Displays the trigger for HPA by CPU. For insights on threshold recommendations, simply hover over the warning tooltip. You can easily sort the column by its values by clicking the header.
There are two types of indicators to be aware of:
A red indication signifies that the threshold is below 60%, indicating potential significant CPU waste.
A yellow indication suggests that the threshold falls between 60% and 80%, pointing to potential moderate CPU waste.
Memory (%)
Displays the trigger for HPA by Memory. For insights on threshold recommendations, simply hover over the warning tooltip. You can easily sort the column by its values by clicking on the header.
There are two types of indicators to be aware of:
A red indication signifies that the threshold is below 60%, indicating potential significant Memory waste.
A yellow indication suggests that the threshold falls between 60% and 80%, pointing to potential moderate Memory waste.
Custom metric
Indicates whether a Custom metric has been detected. You can easily sort the column by clicking the header or apply specific filters.
The zoom-in window provides comprehensive details of the workload's current state and behavior, along with historical data over time, delivering detailed metrics and unmatched visibility on resource utilization efficiency and performance risks. It provides actionable recommendations for adjusting resource allocations to enhance performance and minimize waste, and emphasizes the impact once they are implemented. Additionally, users can explore the Revision History, which displays all updates and changes, including automated or manual adjustments made to the workload, simplifying further analysis and helping track optimization progress over time.
By clicking on the workload, you will be directed to its zoom-in window:
The top panel displays the currently selected workload and offers easy access to the Workload Optimization Policy management menu as well as the actions menu.
This displays the optimization policy of the selected workload. The optimization policy specifies how resources should be allocated to achieve the desired level of resiliency and meet application demand. This ensures that your system maintains optimal performance and stability according to your predefined standards.
MaxEconomy - the best fit for non-production environments (Low Resiliency)
Balanced (default) - optimally balances cost and resiliency (Medium Resiliency)
ExtraHeadroom - the best fit for latency-sensitive environments (High Resiliency)
MaxHeadroom - keeps the environment above the highest spikes (Highest Resiliency)
To change the policy for the workload, select the desired one from the drop-down list and click Save
button to apply the changes.
The actions menu provides quick access to various tasks for streamlined workload management.
This panel provides a comprehensive overview of key cost metrics, highlighting potential savings and identifying existing performance risks. Additionally, it shows the average number of observed workload replicas and indicates whether HPA has been introduced, along with its associated thresholds.
At the top of the panel, you can see the type of the selected workload, along with the corresponding namespace and cluster. Running hours (Running Hrs) refers to the total duration the workload, including its replicas, has been actively running in the cluster over the last 30 days.
Cost reflects the total expenses associated with the workload over the past 30 days.
Waste reflects the total price of unutilized resources associated with the workload over the past 30 days.
Potential savings shows the reducible workload cost through PerfectScale's recommendations, all while maintaining peak performance.
The revision section shows the ID of the selected revision (the current by default), its start and end time, and the associated risks overview:
The risk level associated with the revision performance issues, based on their potential impact
The count of the most pressing risks associated with the revision, such as evictions, observed restarts, or when the maximum configured limit of replicas for HPA is reached, if detected.
Comprehensive overview of all risks per container associated with the particular revision. Click View All Risks
to access the full list of the risks.
This view is particularly helpful for quickly evaluating the workload's health at any given moment over the past 30 days.
Replicas displays the average number of pod replicas.
The HPA section shows whether HPA has been introduced in the workload, displays its associated thresholds, and indicates if the custom metric has been detected.
The workload details panel offers unparalleled visibility into workload resource utilization at any moment over the past 30 days, identifying inefficiencies in real-time and uncovering new optimization opportunities. Providing data-driven optimization recommendations empowers you to take action proactively, enhance efficiency, and reduce costs while maintaining high performance.
The recommendations are only available for the current revision.
To apply the recommendations manually, click the View as YAML
button and deploy the recommendations to your cluster.
When one or more resources have reached their CRD-defined size constraints, the recommendations cannot be executed by automation. In this case, the relevant indicator will be displayed on the recommendations panel.
To get quick access to the automation configuration associated with the workload, click the View Config CR
button.
Historical versions of specific Config CRs can be easily accessed, allowing for a comprehensive review of their changes over time.
Clicking Show History
will open an additional panel with the list of historical CR versions, enabling you to review all the changes that were made to the CR configuration over time. Select any previous version to preview it with highlighted changes, showing the differences between the selected version and the current one.
These widgets provide a granular historical view of CPU and Memory utilization per container over the past 30 days, including the p90, p95, and p99.9 utilization percentiles, enabling you to seamlessly evaluate the efficiency of resource distribution by leveraging a comprehensive visual comparison of these values with set resource requests and limits.
Container selector allows you to easily switch between containers to display the data for the specific container. Click the drop-down menu and select the needed container from the list.
Control panel includes toggles that allow you to easily add or remove quantile lines from the chart. This feature is especially useful for managing data display, enabling smooth workload analysis with just a few clicks. Clicking on the toggles will either include or exclude the corresponding quantile lines from the chart.
This widget provides a comprehensive historical cost and waste overview across all containers within the workload. It is particularly helpful for understanding cost and waste trends as well as identifying anomalies and spikes.
This widget provides a comprehensive view of workload replicas, allowing you to track scaling trends over time. This view supports both scaling scenarios: a static replica count when HPA is not used and dynamic scaling when HPA has been introduced.
When the number of replicas is static, PerfectScale shows the average amount of replicas captured.
When HPA is enabled for a workload, the replicas widget provides a comprehensive view of all key configured parameters alongside the actual scaling values over time. On the right side of the widget, you’ll also find the configured HPA triggers and an indicator showing whether the maximum replica count has been reached, helping you monitor scaling behavior, detect spikes, and ensure efficient resource allocation.
Effective optimization of the environment requires understanding the release content and its impact on cost and resilience. PerfectScale built an advanced solution to address such issues and provide users with a comprehensive breakdown of every revision for each container. This enables easy comparison of versions to track issues and remediation effectiveness.
This view is particularly helpful for tracking when, why, and how resource allocations have changed over time, whether the change has been propagated manually or through automation, enabling users to streamline their investigations and remain aware of the impact of those changes.
Revisions History Timeline displays all the revisions for the last 30 days, where each block corresponds to a particular revision. Hover on the revision to see its details:
Revision ID
Date
Risks
The number of Restarts
The Optimization Policy can be set for the entire and for a specific . The workload's Optimization Policy takes precedence and will override the value defined at the cluster level.
Discover more about customizing the Optimization policy .
Current Risks shows the total identified within the cluster for the selected period. This value is dynamic and updates based on the filters applied in the .
Status indicates workloads at risk. Workloads could be easily filtered by the resiliency risk level or particular .
Risk indicators are dynamic, i.e., the presence of OOM
indicator in the list means that at least one workload experienced an out-of-memory event in a given timeframe
.
Easily jump between , , and views using the switcher above the table.
The Recommendations Table offers clear insights into necessary workload resource adjustments to maintain the cluster's stability and cost efficiency. To access more information, click on the workload. This will open up a that provides a comprehensive breakdown.
If one or more resources have reached their CRD-defined size constraints, the recommendations will not be executed. In this case, the Limited by Rule indicator, along with an explanatory tooltip, will be displayed near the recommendations.
Learn more about resource allocation constraints .
Podfit Labels Profile
enables users to create and save sets of labels, which can then be applied to clusters. The Label set listed in the Podfit Labels Profile will be applied to the clusters attached to this profile. how to configure the profile.
Optimization Policy outlines how resources should be allocated to meet the unique requirements of each workload. The Optimization Policy can be set for the entire and a specific .
Discover more about customizing the Optimization policy .
The previous version of the Zoom-in window is accessible. To change to this version, click the Switch to Legacy UI
button in the Actions menu.
Clicking View in Observability
will be directed you to the observability tool connected to the cluster. about how to integrate your preferred observability tool and receive exceptional insights from PerfectScale directly to your dashboard.
Create a ticket with all the details about needed changes in the defined project and assign it to the relevant engineer (team) automatically by clicking Create Ticket
. Learn how to integrate your Jira with PerfectScale smoothly .
If the ticket already exists, you can use one of the following options: View Task or Delete Task.
Mute Workload
is a useful feature when you want to stop receiving notifications for a specific workload. By muting it, you'll no longer get alerts related to that workload, even if there’s an linked to it. If you want to start receiving alerts for the previously muted workload, click Un-Mute Workload
in the same menu.
Clicking Revert to Default Layout
will reset the order of the widgets in the .
You can easily manage the order of widgets in the workload details panel. Grab the widget up or down by clicking on the widget name and moving it to the desired place on the panel.
To reset the widgets to the default order, select Revert to Default Layout
from the .
The recommendations widget provides recommendations for workload right-sizing. With this comprehensive view, you can effortlessly review current resource requests and limits per container, followed by the recommended values based on the actual resource consumption.
You can seamlessly configure autonomous workload optimization to actively maintain your environment in prime condition and ensure peak K8s performance at minimal cost. Check the in the top right corner (learn more about automation statuses ).
Learn more about resource allocation constraints .
Use the gear button to define the custom percentile.
Recommendations section displays the resource requests and limits recommendations for the selected container compared to the current values.
Click the controls above the graph to display or conceal specific parameters.
Clicking the revision will highlight this revision on the charts and display the corresponding data on the , enabling you to access all the needed data and streamline the analysis with a single click.
Active
Once the configuration is completed, automation will be indicated as successfully enabled.
Limited by Rule
When one or more resources have reached your configured size constraint in CRD, the recommendations can't be executed. The indicator will also be displayed in the . Learn more about resource allocation constraints .
Delayed
If the defined CRD causes time constraints, the execution of recommendations will be postponed.
Disabled
The merged CRD will disable automation for the workload. For example, if the cluster-level configuration enables automation while the namespace-level configuration disables it, the namespace-level configuration takes precedence, resulting in disabled automation for the particular workloads within the cluster.
Stopped
PerfectScale will forcibly stop the automation. For example, to prevent your environment from recursive resource increases, such as those resulting from memory leaks.