Last month I was listening to a recent installment of the always-excellent Software Engineering Radio podcast where the topic was ‘Performance Engineering’. One of the points that the guest in that installment was making was that he has been involved in many projects where the process of ‘performance engineering’ (the act of gathering performance metrics, determining acceptability, and adjusting the software solution accordingly to meet performance goals) wasn’t considered until entirely too late in the development process to allow the teams to effectively course-correct without changing significant architectural underpinnings of the design.
It began to dawn of me that this conundrum is a very likely risk in the kind of ’emergent design’ scenarios that Agile, iterative, test-driven and other practices I strongly believe in tend to encourage and reinforce.
BDUF != NDUF
One of the frequent challenges to so-called ‘Agile’ approaches to software engineering is the question…
“If we just start coding blindly, how will we know where we’re going?”
The trouble with this comment is that there are several aspects of this statement that indicate that the speaker clearly doesn’t properly understand the underpinnings of Agile at all. This statement really represents another instance of the drastic misinterpretation of Agile as ‘cowboy coding’ — the speaker has misunderstood that while Agile does mean no ‘Big Design Up Front’ (BDUF), it doesn’t therefore follow that Agile also means ‘No Design Up Front’ (NDUF).
There is a Danger to Acknowledge
But even as we acknowledge that effective Agile development probably does require at least some over-arching agreement as to the the general design of the overall system architecture, the aspects of Agile that tend us toward incremental, emergent design of the details of that architecture’s implementation do have at least the potential to leave us exposed to designing ourselves into a performance-challenged corner from which the only way out may be to consider a radical reversal of course, incurring significant impact to the project’s budget and schedule.
To be clear, I am not arguing for premature optimization, considered by some to be the root of all evil in computer science. I am instead arguing that just as no-Big-Design-Up-Front doesn’t imply no-Design-Up-Front at all, so too I believe that no-premature-optimization doesn’t imply that performance considerations should be ignored in the design of one’s software systems…whether that design is ’emergent’ in the Agile sense or pre-defined in the Waterfall sense.
In essence I am suggesting that we pay very close attention to the word premature in the phrase premature optimization (a distinction that I think is often lost in translation).
Non-Functional Spec Doesn’t Mean “no business value”
One of the very real troubles that I see projects often get into with their mono-focus on delivering ‘business value’ functionality is that often this is taken to mean Business Value = Functional Specification. This approach tends to make non-functional requirements (like, let’s say, adequate performance) take a back-seat to the functional specification. The fact that there are two categories of specifications for the project (functional describing the business behaviors and non-functional describing the technical behaviors) was never intended to suggest that there are two levels of priority between these two categories as has often been implied. There may indeed be a hierarchy of importance within and between all of the items in both specifications, but depending on the context for the project, either category might receive primacy of importance for the business.
Certainly, a well-performing application that fails to provide the needed functionality to its users cannot be considered a success by any measure. But the converse is equally true: a properly-featured application that does everything the business requested but fails to meet performance, scalability, or other non-functional goals must be considered an equal failure.
Succeeding in either category without succeeding in the other is a failure to deliver on the needed business value. In essence, Business Value = Functional Specification + Non-functional Specification. Period.
Executable Specification for Non-Functional Requirements
So given all of this, I have started to think that its incumbent upon us to encode the non-functional specifications into our unit tests just as we do the functional specifications. The same reasoning that leads us to write a test that says “a customer cannot place an order larger than their credit limit” should also lead us to write a test that says “when the customer’s order is saved, it should take no longer than 500 milliseconds“. And we should write both of these tests at the same time: right as we are writing the production code that makes these tests run.
So I have begun writing test fixtures like this one:
[TestFixture] [Category("Integration")] [Category("Performance")] public class WhenRunningBenchmarkQuery { [Test] public void ReturnsWithinAcceptableLimits() { Stopwatch sw = new Stopwatch(); sw.Start(); CustomerRepository customerRepository = new CustomerRepository(); IList<Customer> customers = customerRepository.GetByCriteria("criteria"); if (sw.ElapsedMilliseconds > 200) throw new RunningLikeCrapException("The Query took too long!"); } }
This test will throw an exception if the query doesn’t return within 200 milliseconds, the present target for this behavior of the system.
But this won’t work because…
I can already feel you starting to reach for the keyboard and comment about all the reasons this can’t possibly be reliable:
- its fickle and will pass/fail variably depending on the DB server specs, whether the DB is local or over-the-wire, etc.
- the query performance itself probably isn’t in the spec as an isolated performance target; instead its more likely to say something like “the query screen needs to return results to the user in 500 milliseconds” and in the real system this is also going to involve UI rendering time, etc. that isn’t captured here
- this is an integration test, rather than a unit test and will unacceptably slow down each dev running it in his/her test suite
- if you run this from the CI server, it will significantly delay the feedback from the build server about the state of the system so that devs will find themselves ‘waiting’ for the GREEN too long after check-in
- it might pass with 10 records in the DB but fail when there are 10,000 records in the DB
- etc.
Those (and more!) are all valid concerns, but here’s what I’m doing to mitigate most of these…
- The test fixture is flagged with Category attributes that identify these tests as both “Integration” and “Performance” tests. This is a cue to our test-runners that its to be run only on the CI server and even then only on the single ‘full’ build run each night (when nobody will feel the pain of its duration) rather than either locally on the working dev’s PC or even on each check-in to the SCM repository (since our work process is agile, we check-in many times during the day, so this is important not to run this on every check-in!)
- the DB state is controlled by ensuring that the same sample data (data whose size is representative of the eventual production system) is loaded into the target DB before the test is run each time; this provides a meaningful result for the test
- the DB is deployed on a different server than the CI server test process runs from, simulating the need to access the DB over-the-wire as would be the case in the final production system
- since the real-world, user-perceptible performance of this query will of course include things like network latency, UI rendering from IIS, etc. we take the stated performance target (example: 500 ms) and reduce it to a shorter allowable duration in this test (example: 200 ms) to simulate our best guess as to the allowable performance for just the part of the overall behavior that’s actually being tested in the tests
What (really) can this test tell us?
Given all of these assumptions, constraints, etc., one might reasonably ask
What exactly does this kind of a test really tell us when it passes or fails?
Well, one thing we can say with near-certainty is this: when this test fails its nearly certain to fail in production too since the production environment probably has more records in the DB, is probably experiencing additional load from multiple concurrent requests, etc., etc. So a failure of our test in the test environment makes us conclude that we are nearly certain to have a problem in our production environment too.
However if this test passes, we are less-certain about what we can conclude about how our system will behave in a production environment. For all the reasons listed above, this kind of test can clearly pass in our CI/test context yet fail in our production context, leading one to consider that this is a less-than-valuable-test to even bother with.
Negative Predictors are NOT Useless Metrics
So this kind of a test is what we call a negative-predictor of the success of our system: a failure of the test correctly predicts a failure of our system, but a success of the test doesn’t necessarily predict the success of our system but instead merely predicts the possibility of its success. While we might be tempted to discount such negative predictors, the truth is that we already routinely make use of such negative predictors all the time in our daily work:
UNIT TEST CODE COVERAGE IS A NEGATIVE PREDICTOR OF SUCCESS.
As developers writing tests, we are (hopefully!) routinely using unit test code coverage metrics as at least one way to gauge the effectiveness of our unit test writing efforts. Just as hopefully, I am praying that its not the only metric we’re relying upon
Consider this: if you have 100% code-coverage from your unit tests and all those tests pass it merely means that you have found some combination of inputs to your tests that can result in all of your tests passing. It doesn’t (necessarily) mean that those same inputs will actually appear in production when users enter the equation. Users may enter any combination of unexpected inputs to your code that you never bothered to test for and your system may very well crash because of that even if you have achieved 100% test coverage.
As such, unit test code coverage is similarly a negative predictor of system success: if you have 100% code coverage and even one of your tests fails, then your production system will be almost certain to similarly fail. But if you have 100% code coverage and all of your tests pass, the best you can say is that you might have a working system. And even this (of course) assumes that you have achieved 100% code coverage, often an unrealistic goal with dubious ROI for the time/effort needed.
This is actually one of the reasons that I spend less time shooting for 100% code coverage in my work and more time ensuring that all the possible combinations of input conditions are addressed in my unit tests — even if multiple tests are needed that exercise the exact same path through the code as already taken (e.g., additional tests don’t improve code coverage by even 1%) I still focus my effort there instead of chasing code-coverage numbers: code coverage is a negative-predictor metric so its just not worth the effort to achieve 100% in most cases. YMMV of course, but that’s my experience on the matter.
Given that we already rely on negative predictors in our work (like code-coverage numbers and others), the idea that somehow codifying the non-functional business requirements in executable specifications (e.g. unit tests) isn’t valuable because its a negative predictor just seems to be a non-sequitur to me.
And so this is how I’m working now and time will tell its value to my project; stay tuned for updates
I would be very interested to receive input from others as the to the viability of this approach if anyone has gone down this road and found it to be a waste of time/effort/energy.
Interesting read Steve. The way you’ve limited the context in which these tests run, it would seem that you’ve addressed the anticipated concerns. And I agree. A test whose failure accurately predicts a production failure that would otherwise go unnoticed until much later is worth the effort. I wonder what other “non-functional” contexts should be tested in addition to data access performance.
Good post.
Btw, should not BDUF != NDUF be actually !BDUF != NDUF ?