Improve verification performance, effectiveness and efficiency
Introduction
Understanding what your verification campaign looks like in terms of cost of product development, schedule for delivery, and quality of the end product, is critical in order to plan and deliver your product. You need to understand how to change or influence the outcomes in a way the protects the ROI of your product and ensures that the delivered quality meets the end user requirements.
To do this well you need to visualise important data elements arising from the product development process, and use this data to plan your product delivery schedule and your resource allocations in a way that ensures
on-time delivery, within cost constraints, while meeting quality goals.
For the purposes of this article, we are going to focus on simlulation verification data, since in most cases this is the dominant cost in terms of tools and compute, and determination of product quality. Some teams might be exploiting formal verification to lesser or greater levels. Some teams may be using hardware emulation or FPGA prototyping as a significant part of their overall strategy, and the associated costs for these can also be very high.
Simulation is usually the first-line of defence for any verification campaign, and it is the verification paradigm that offers the most contolability over stimulus generation, the best observability with code and functional coverage, and the most effective debug environment. That said, it can also be the paradigm with the most scope for runaway costs thanks to modern constrained-random testing techniques that facilitate a near infinite scope of test generation.
Simulation data along with bug data can inform on the efficency, effectiveness and total cost of the verification campaign, and help development teams to refine their approach over the course of multiple projects. Analysis and understanding of the effort and costs and the overall effectivness of each method to find bugs and improve product quality are key to ensuring that your product’s ROI is improved, or at least protected from runaway development costs, or escalating post-release rework costs.
For those teams starting out today with no historical data, a modelling approach can be used to speculate on what the timeline and resource utilistation curves will look like. These prediction models can be refined over time with feedback from actuals data, so that prediction modelling becomes more accurate and teams can rely on it for planning and delivery of product roadmaps.
In the following sections of this article we are showing some representative data profiles that we have synthetically generated. The numbers shown are representative and realistic, but your numbers are likley to look quite different for your own product development campaign. Scale-up the numbers to match your individual experience.
The following data is modelled and is not sourced from real projects (the heuristics are founded on many years of experience).
Efficiency and Effectiveness recap
We have described what we mean by verification efficiency and verification effectiveness in earlier articles. Just to recap, effectiveness is the ability of a verification process to either find bugs, or demonstrate an absence of bugs while increasing coverage (both structural and functional). If the verification environment is not achieving either of these aims, then it’s value is questionable, and running endless cycles may be futile.
That’s not to say verification is done, of course; further analysis of the verification environment may expose shortfalls that can be addressed, such that further verification value is possible. Effectiveness needs to be understood both at the individual test or testbench level, but also at the regression level. Intelligent regressions should minimise the amount of repeated testing that is performed where tests are known to not exercise the parts of the design that have changed, and need to be regressed. The “just re-run everything” approach can be very wasteful when resources are costly and limited.
Efficiency mainly pertains to the performance of the verification environment, and if little or no consideration has been given to this when developing it, you may end up with an inefficient environment that will significantly impact your costs and schedule to achieve the desired levels of assurance from verification.
Sometimes, inefficiencies can be improved quickly with small efforts; there may be low-hanging fruit. Analytics is the key to identifiying the critical areas that will benefit the most from the application of engineering time and effort.
You don’t have to fix every problem, but be sure to identify the larger ones and address them. Follow the 80:20 rule!
Scenario 1 - "The Baseline"
Let’s start out with what we are calling the “baseline” scenario as shown in in figure 1 below. Here you can see a timeline for a project divided into the typical overlapping development phases which usually result in a milestone or product release point where the product is delivered to the end user, maybe as an interim delivery such as a beta quality product, and then finally as the first full release quality product.
The quality of the released product will depend on the depth and breadth of verification achieved upto this milestone point. Were all the critical bugs flushed out by the verification campaign? If not, you can expect to see bug escapes post release and further cost and effort to rework the product, fix the bugs and further extend the verification campaign to increase assurance levels and final product quality. Reputations can be at risk with consequential losses for future sales, late delivery, delays to the development of other new products and consequential loss of market share.
In figure 1 we observe a common scenario where there is a ramp up of testing effort towards the latter stages of the product development lifecycle.
Too much of the verification effort is back-ended and this increases the risk of post-release bug escapes. Chances are high that when verification efforts are suddenly terminated, there are still more bugs to find. We don’t see a plausible stabilisation period before final release sign-off. We also observe in the corresponding bug rate chart that there is a spike in bugs being found in the late phase of the project, which is another red flag pointing at high risk of post-release bug escapes.
In this example we see that the delivery time is >260 days for a total testing volume of 120M tests/seeds at a total cost of $122K. At the peak of this curve around 1,200 slots are being consumed on average during a week.
Figure 1 : The baseline CPU consumption and bug rate timeline.
The y-axis represents the sum of wall clock runtime in hours per week. We choose wall clock runtime because this is a good proxy for cost as it informs how many CPU slot hours are consumed (meaning a CPU slot is consumed either from your on-prem estate or your cloud providor, plus any necessary EDA tool licenses such as RTL simulation licenses in this case). We used $0.1 per CPU slot hour as a rough cost rate. This figure is arrived at from current cloud service providor costs, our experience of operating costs of large on-prem estates, plus approximate loadings for EDA licenses, storage costs and operational costs. Your particular set-up may yield a different cost rate.
Scenario 2: Shift-left, shift-down
In scenario 2 we show a revised verification campaign. Note that the overall shape of the campaign has altered and the focus of the verification effort has shifted-left to the earlier product development phases. There is much less effort applied to the final sign-off phase of the project because the earlier testing has yielded a much more stable design earlier, thus reducing the risk of post-release bug escapes. The quality of the delivered product at final release stage is likely to be higher.
Note also from this scenario, that although the volume of testing is the same as above, 120M tests in total, the cost is reduced and the delivery timeframe is shortened. This is because in this project, effort has been applied to improve testbench performance from an average of 34s per test to an average of 26 seconds per test. The effectiveness of the verification may be the same as previously, but the efficiency has improved by almost 25%. This may have been achieved by some code refactoring of system-verilog testbenches, or the RTL code itself, or it may be down to some re-architecting of the RTL and/or testbenches.
Alternatively, there could have been a performance improvement in the underlying compute platform (faster CPUs, faster storage), or the EDA tools themselves may be more optimal. Either way, the benefits of this efficiency gain are clear to see. Cost has been saved to achieve the same levels of verification testing, and the schedule has been reduced so we get our product to market more quickly, and the peak number of slots requried is less. Therefore, the time and effort investment required to improve verification efficiency can be quickly realised and justified.
Figure 2 the Shift-Left, Shift-down scenario.
Further, the final quality is likely to have been improved thanks to the shift-left in the verification campaign.
In other words, the ROI of this product has been improved.
Scenario 3: Improve Quality
In the next scenario we are showing what happens if you choose to keep the cost and the timescale the same as baseline, but you take advantage of testbench efficiency gains to significantly increase the volume of testing that is achieved, i.e. for the same cost and effort, a higher degree of verification is achieved which in turn should lead to a higher quality level (less post-release bug escapes) in the final product.
This can be seen in the bug rate data, where there is a significant peak of bug finding in the beta phase, but much less bugs being found in the final phase and evidence of a much more stable design. The overall cost for this scenario is roughly the same as baseline, but the number of tests ran and the total number of cycles achieved has increased by 2.3X, and overall, more bugs have been found, thus reducing the risk of post-release bug escapes.
To achieve this the engineering team have invested even more effort in verification efficiency improvements achieving an overall improvement of 2.5X over the baseline case. If your priority is product quality, and there is concern that insufficient verification testing cycles have been performed to achieve the required quality level, this might be the choice you make.
In other words,
do significantly more with the same resources and the same delivery constraints.
Figure 3 The Morecycles Scenario - more cycles within same cost/time constraints.
Scenario 4: Invest more to improve quality
Finally, in scenario 4, we show the case for increased investment. If your primary concern is that you need to significantly increase the quality of the end product and deliver in a shorter timescale, then you may choose to invest in more platform resources in order to achieve this. Here the volume of testing has increased by 4X, and that comes with a cost uplift of 50% over the original baseline scenario and an implication that more slots are needed to meet peak demand. However, even more bugs are being found pre-release, so the overall risk of post-release bug escapes is further mitigated.
Figure 4 The Investmore scenario - achieve more testing with more resources.
Conclusion
It is often the case that product development teams do not know or fully understand what resources are required and what the verification cost is to develop their product. They consume resources as and when they are available, and may only be limited by resource availabilities. When asked how many more resources are needed, they don’t always know, but readily accept whatever additional resources are made available to them.
Constraining your verification campaign to fit within limited availability might lead to under-achieving on verification levels. Over-consuming resources in an ineffective and inefficient way might be eroding your product’s ROI. We advocate for better planning of verification campaigns, the use of prediction modelling, and analysis of historical data to build understanding and insights into the art or verification.
The hypothetical quality improvements mentioned above are dependent on both the efficiency and the effectiveness of your verification campaign. Running more and more cycles, or running faster cycles for testbenches that are badly architected, will not necessarily improve the ability to find bugs. Engineeing teams must also invest in testing effectiveness improvements, so that bugs can be found with less testing effort, thanks to smarter approaches to testing and intelligent execution of regressions. We will leave those challenges for another time. For now we hope that you see the power of data and how you can begin to reason about your overall verification campaign when you have appropriate analytics to hand.
Please talk to us if you want to discuss your particular verification campaign challenges, or could benefit from some independent thought leadership, backed up with years of experience and insights in how to model your senarios, exploit your data, and drive predictable delivery in your organisation.
Please contact bryan.dickman@siliconinsights.co.uk, joe.convey@siliconinsights.co.uk
Copyright © 2024 SIlicon Insights Ltd. All rights reserved.
Comentarios