Why are our estimates always too low?

At last week’s Test Management Forum, Susan Windsor introduced a lively session on estimation – from the top down. All good stuff. But during the discussion, I was reminded of a funny story (well I thought it was funny at the time).

Maybe twenty years ago (my memory isn’t as good as it used to be), I was working at a telecoms company as a development team leader. Around 7pm one evening, I was sat opposite my old friend Hugh. The office was quiet, we were the only people still there. He was tidying up some documentation, I was trying to get some stubborn bug fixed (I’m guessing here). Anyway. Along came the IT director. He was going home and he paused at our desks to say hello, how’s it going etc.

Hugh gave him a brief review of progress and said in closing, “we go live a week on Friday – two weeks early”. Our IT director was pleased but then highly perplexed. His response was, “this project is seriously ahead of schedule”. Off he went scratching his head. As the lift doors closed, Hugh and I burst out laughing. This situation had never arisen before. What a problem to dump on him! How would he deal with this challenge? What could he possibly tell the business? It could be the end of his career! Delivering early? Unheard of!

It’s a true story, honestly. But what it also reminded me of was that if estimation is an approximate process, our errors in estimation in the long run (over or under estimation) expressed as a percentage under or over, should balance statistically around a mean value of zero, and that mean would represent the average actual time or cost it took for our projects to deliver.

Statistically, if we are dealing with a project that is delayed (or advanced!) by unpredictable, unplanned events, we should be overestimating as much as we under estimate, shouldn’t we? But clearly this isn’t the case. Overestimation, and delivering early is a situation so rare, it’s almost unheard of. Why is this? Here’s a stab at a few reasons why we consistently ‘underestimate’.

First, (and possibly foremost) is we don’t underestiate at all. Our estimates are reasonably accurate, but consistently we get squeezed to fit with pre-defined timescales or budgets. We ask for six people for eight weeks, but we get four people for four weeks. How does this happen? If we’ve been honest in our estimates, surely we should negotiate a scope reduction if our bid for resources or time is rejected? Whether we descope a selection of tests or not, when the time comes to deliver, our testing is unfinished. Of course, go live is a bumpy period – production is where the remaining bugs are encountered and fixed in a desperate phase of recovery. To achieve a reasonable level of stability takes as long as we predicted. We just delivered too early.

Secondly, we are forced to estimate optimistically. Breakthroughs, which are few and far between are assumed to be certainties. Of course, the last project, which was so troublesome, was an anomaly and it will always be better next time. Of course, this is nonsense. One definition of madness is to expect a different outcome from the same situation and inputs.

Thirdly, our estimates are irrelevant. Unless the project can deliver in some mysterious predetermined time and cost contraints, it won’t happen at all. Where the vested interests of individuals dominate, it could conceivably be better for a supplier to overcommit, and live with a loss-making, troublesome post-go live situation. In the same vein, the customer may actually decide to proceed with a no-hoper project because certain individuals’ reputation, credibility and perhaps jobs depend on the go live dates. Remarkable as it may seem, individuals within customer and supplier companies may actually collude to stage a doomed project that doesn’t benefit the customer and loses the supplier money. Just call me cynical.

Assuming project teams aren’t actually incompetent, it’s reasonable to assume that project execution is never ‘wrong’ – execution just takes as long as it takes. There are only errors in estimation. Unfortunately, estimators are suppressed, overruled, pressured into aligning their activities with imposed budgets and timescales, and they appear to have been wrong.

non-functional is a non-event

The raw materials of real engineering: steel, concrete, water, air, soil, electomagnetic waves, electricity, obey the laws of physics.

Software of course, does not.

Engineering is primarily about meeting trivial functional requirements and complex technical requirements using materials that obey the laws of physics.

I was asked recently whether the definitions – Functional and Non-Functional – are useful.

My conclusion was – at the least, they aren’t helpful. At worst debilitating. There’s probably half a dozen other themes in the initial statement but I’ll stick to this one.

There is a simple way of looking at F v NF requirements. FRs define what the system must do. NFRs define HOW that system delivers that functionality. e.g. is it secure, responsive, usable, etc.

To call anything ‘not something else’ can never be intuitively correct I would suggest if you need that definition to understand the nature of the concept in hand. Its a different dimension, perhaps. Non-functional means – not working doesn’t it?

Imagine calling something long, “not heavy”. It’s the same idea and it’s not helpful. It’s not heavy because you are describing a different attribute.

So, to understand the nature of Non-Functional Requirements, it’s generally easier to call them technical requirements and have done with it.

Some TRs, are functional of course, and that’s another confusion. Access control to data and function is a what, not a how. Security vulnerabilities are, in effect functional defects. The system does something we would rather it didn’t. Pen testing is functional testing. Security invulnerability is a functional requirement – it’s just that most folk are overcome by the potential variety of threats. Pen tests use a lot of automation using specialised tools. But they are specialised, non non functional.

These are functional requirements just like the stuff the users actually want. Installability, documentation, procedure, maintainability are ALL functional requirements and functional tested.

The other confusion is that functional behaviour is Boolean. It works or it doesn’t work. Of course, you can count the number of trues and falses. But that is meaningless. 875 out of 1000 test conditions pass. It could be expressed as a percentage, what exactly does that mean? Not much, until you look into the detail of the requirements themselves. One single condition could be several orders of magnitude more important than another. Apples and oranges? Forget it. Grapes and vineyards!

Technical behaviour is usually measurable on a linear scale. Performance and reliability for example (if you have enough empirical data to be significant) are measured numerically. (OK you can say meets v doesn’t meet requrements is a Boolean but you know what I mean).

Which brings me to the point.

In proper engineering, say civil/structural… (And betraying a prejudice, structural is engineering, civil includes all sorts of stuff that isn’t…)

In structural engineering, for example, the Functional requirements are very straight forward. With a bridge – say the Forth Bridge or the Golden Gate – built a long long time ago – the Functional requirements are trivial. “Support two railway lines/four lanes of traffic with travelling in both directions. (And a foot bridge for maintenance)”.

The Technical requirements are much more complex. 100% of the engineering discipline is focused on techical requirements. Masses of steel, cross sections, moments, stresses and strains. Everything is underpinned by the science of materials (which are extensively tested in laboratories, with safety factors applied), and tabulated in blue or green books full of tabulated cross sectional areas, beam lengths, cement/water ratios and so on. All these properties are calculated based on thousands of laboratory experiements, with statistical technques applied to come up with factors of safety. Most dams, for example, are not 100% safe for all time. they are typically designed to withstand 1 in 200 year floods. And they fail safely, because one guy in the design office is asked to explore the consequences of failure – which in the main are predictable.

Software does not obey the laws of physics.

Software development is primarily about meeting immensely complex functional requirements and relatively simple technical requirements using some ethereal stuff called software that very definitely does not obey laws at all. (Name one? Please?)

Functional testing is easy, meeting functional requirements is not. Technical testing is also easy, meeting technical requirements is (comparatively) easy.

This post isn’t about “non-functional requirements versus functional rerqurements.” It’s an argument saying ALL requirements are hard to articulate and meet. So there.

Automation below the GUI

A couple of weeks ago, after the BCS SIGiST meeting I was chatting to Martin Jamieson (of BT) about tools that test ‘beneath the GUI’. A while later, he emailed a question…

At the recent SIGIST Johanna Rothman remarked that automation should be done below the level of the GUI. You then stood up and said you’re working on a tool to do this. I was explaining to Duncan Brigginshaw (www.odin.co.uk) yesterday that things are much more likely to change at the UI level than at the API level. I gave him your example of programs deliberately changing the html in order to prevent hacking – is that right? However, Duncan tells me that he recommends automating at the UI. He says that the commercial tools have ways of capturing the inputs which shield the tester from future changes e.g. as objects. I think you’ve had a discussion with Duncan and am just wondering what your views are. Is it necessary to understand the presentation layers for example?

Ulitimately, all GUI interfaces need testing of course. The rendering and presentation of HTML, the execution of Javascript, ActiveX and Java objects obviously need a browser involved for a user to validate their behaviour. But Java/ActiveX can be tested through drivers written by programmers (and many are).

Typically, Javascript isn’t directly accessible to GUI tools anyway (as it is typically used for field validation and manipulation, screen formatting and window management). One can write whole applications in JavaScript if you wish.

But note that I’m saying that a browser is essential for a user to validate layout and presentation. If you go down the route of using a tool to automate testing of the entire application from GUI through to server based code, you need quite sophisticated tools, with difficult to use scripting (programming) languages. And lo and behold, to make these tools more usable/accessible to non-programmers, you need tools like AXE to reduce (sometimes dramatically) the complexity of the scripting language required to drive automated tests.

Now, one of the huge benefits of these kinds of testing frameworks, coupled with ‘traditional’ GUI test tools is they allow less technical testers to create, manage and execute automated tests. But, if you were to buy a Mercury WinRunner or QTP license plus an AXE licence, you’d be paying 6k or 7k PER SEAT, before discounts. This is hugely expensive if you think about what most automated tools are actually used for – compared with a free tool that can execute tests of server-based code directly.

Most automated tools are used to automate regression tests. Full stop. I’ve hardly ever met a system tester who actually set out to find bugs with tools. (I know ‘top US consultants’ talk about such people, but they seem to exist as a small minority. What usually happens is that the tester needs to get a regression test together. Manual tests are run, when the software is stable and tests pass, they get handed over to the automation folk. I know, I know, that AXE and tools like it allow testers to create automated test runs. However, with buggy software, you never get past the first run of a new test. So much for running the other 99 using the tool – why bother.

Until you can run the other 99, you don’t know whether they’ll find bugs anyway. So folk resort to running them manually because you need a human being checking results and anomalies, not a dumb tool. The other angle is that most bugs aren’t what you expect – by definition. e.g. checking a calculation result might be useful, but the tab order, screen validation, navigation, window management/consistency, usability and accessibility AREN’T in your preprared test plan anyway. So much for finding bugs proactively using automation. (Although bear in mind that there are free/cheap tools exist to validate HTML, accessibility, navigation and validation).

And after all this, be reminded, the calculated field is actually generated by the server based code. The expected result is a simple number, state variable or message. The position, font, font size and 57 other attributes of the field it appears in are completely irrelevant to the test case. The automated tool is, in effect, instructed by the framework tool to ignore these things, and focus on the tester’s predicted result.

It’s interesting (to me anyway) that the paper that is most downloaded from the Gerrard Consulting website is my paper on GUI testing. gerrardconsulting.com/GUI/TestGui.html It was written in 1997. It gets downloaded between 150 and 250 times a month approximately. Why is that for heavens sake – it’s nine years old! The web isn’t even mentioned in the paper! I can only think people are obsessed with the GUI and haven’t got a good understanding of how you ‘divide and conquer’ the complex task of testing a GUI into simpler tasks. Some that can be automated beneath the GUI, some that can be automated using tools other than GUI test running tools, some that can be automated using GUI test running tools and some that just can’t be automated. I’m sure most folk with tools are struggling to meet higher than realistic expectations.

So, what we are left with is an extremely complex product (the browser), being tested by a comparably (probably more) complex product, being controlled by another complex product to make the creation, execution and evaluation of tests of (mainly) server-based software an easy task. Although it isn’t of course. Frameworks work best with pretty standard websites or GUI apps with standard technologies. Once you go off the beaten track, the browser vendor, the GUI tool vendor and the framework vendor all need to work hard to make their tools compatible. But all must stick to the HTTP 2.0 protocol which is 10 years(?) old. How many projects set themselves up to use bleeding edge technology, and screw the testers as a consequence? Most, I think.

So. There we have it. If you are fool enough to spend £4-5,000 per seat on a GUI tool. You then to be smart enough to spend another £2,000 or so on a Framework (PER USER).

Consider the alternative.

Suppose you knew a little about HTML/HTTP etc. Suppose you had a tool that allowed you to get web pages, interpret the HTML, insert values to fields, submit the form, execute the server based form handler, receive the generated form, validate the form in terms of new field values, save copies of the received forms on your PC, compare those forms with previously received forms, and deal with the vagaries of secure HTTPS, and ignore the complexities of the user interface. The tool could have a simple script language, based on keywords/commands, stored in CSV files, managed by Excel.

If the tool could scan a form, not yet tested, and generate the script code to set the values for each field in the form, you’d have a basic but effective script capture facility. Cut and paste into your CSV file and you have a pretty effective tool. Capture the form map (not gui – you don’t need all that complexity, of course) and use the code to drive new test transactions.

That’s all pretty easy. The tool I’ve built does 75-80% of this now. My old Systeme Evolutif wesite (including online training) had 17,000 pages, with around 100 Active Server Pages script files. As far as I know, there’s nothing the tool cannot test in those 17,000 pages. Of course most are relatively simple. But they are only simple in that they use a single technology. There’s thousands of lines of server-base code. If/as/when I created a regression test pack for the site, I can (because the tool is run on the command line) run that test every hour against the live site. (Try doing that with QTP). If there is a single discrepancy in the HTML that is returned, the tool would spot it of course. I don’t need to use the GUI to do that. (One has to assume the GUI/browser behaves reliably though).

Beyond that, a regression test based on the GUI appearance would never spot things in the HTML unless you wrote code specifically to do that. Programmers often place data in hidden fields. By definitition, hidden fields never appear on the GUI. GUI tools would never spot a problem – unless you wrote code to validate HTML specifically. Regression tests focus on results generated by server-based code. Requirements specify outcomes that usually do not involve the user interface. In most cases, the user interface is entirely irrelevant to the successful outcome of a functional test. So, a test tool that validates the HTML content is actually better than a GUI tool (please note). By the way, GUI tools don’t usually have very good partial matching facilitites. With code-based tools, you can use regular expressions (Regexs). Much better control for the tester then GI tools.

Finally. If you use a tool to validate returned messages/HTML, you can get the programmer to write code that syncs with the test tool. A GUI with testability! For example, the programmer can work with the tester to provide the ‘expected result’ in hidden fields. Encrypt them if you must. The developer can ‘communicate’ directly with the tester. This is impossible if you focus on the GUI. It’s really quite hard to pass technical messages in the GUI without the user being aware.

So. A tool that drives server-based code is more useful (to programmers in particular because you don’t have the unnecessary complexities of the GUI). They work directly on the functionality to be tested – the server based code. They are simpler to use. They are faster (there’s no browser/GUI and test tool in the way). They are free. AND they are more effective in many (more than 50%?) cases.

Where such a tool COULD be used effectively, who in their right mind would choose to spend £6,000-7,000 per tester on LESS EFFECTIVE products?’

Oh, and did I say, the same tool could test all the web protocols MAIL, FTP etc. and could easily be enhanced to cover web sevices (SOAP, WSGI blah blah etc.) – the next big thing – but actually services WITHOUT a user interface! Please don’t get me wrong, I’m definitely not saying that GUI automation is a waste of time!’

In anything but really simple environments, you have to do GUI automation to achieve coverage (whatever that means) of an application. However, there are aspects of the underlying functionality that can be tested beneath the GUI and sometimes it can be more effective do do that but only IF there aren’t complicated technical issues in the way (that would be hidden behind the GUI and the GUI tool ignores them).

What’s missing in all this is a general method that guides testers to using manual, automation above or below the GUI. Have you ever seen anything like that? One of the main reasons people get into trouble with automation is because they have too high expectations and are overambitious. It’s the old 80/20 rule. 20% of functionality dominates the testing (but could be automated). Too often, people try and automate everything. Then 80% of the automation effort goes on fixing the tool to run tests of the least important 20% of tests. Or something like that. You know what I mean.
The beauty of frameworks is they hide the automation implementation details from the tester. Wouldn’t it be nice if the framework SELECTED the optimum automation method as well? I guess this should depend on the objective of a test. If the test objective doesn’t require use of the GUI – don’t use the GUI tool! Current frameworks have ‘modes’ based on the interfaces to the tools. Either they do GUI stuff, or they do Webservices stuff or… But a framework ought to be able to deal with gui, under the gui, web services, command-line stuff etc. etc. Just a thought.

I feel a paper coming on. Maybe I should update the 1997 article I wrote!

Thanks for your patience and trigging some thoughts. Writing the email was an interesting way to spend a couple hours, sat in a dreary hotel room/pub.

Posted by Paul Gerrard on July 4, 2006 03:08 PM


Good points. In my experience the GUI does change more often the the underlying API.
But often, using the ability of LR to record transactions is still quicker than hoping I’ve reverse-engineered the API correctly. More than once I’ve had to do it without any help fom developers or architects. 😉

Chris http://amateureconblog.blogspot.com/

Paul responds:

Thanks for that. I’m interested to hear you mention LR (I assume you mean Load Runner). Load Runner can obviously be used as an under the bonnet test tool. And quite effective it is too. But one of the reasons for going under the bonnet is to make life simpler, and as a consequence, a LOT cheaper.

There are plenty of free tools (and scripting languages with neat features) that can be perhaps just as effective as LR in executing basic transactions – and that’s the point. Why pay for incredibly sophisticated tools that compensate for each other, when a free simple tool can give you 60, 70, 80% of what you need as a functional tester?

Now LR provides the facilitites, but I wouldn’t recommend LR as a cheap tool! What’s the going rate for an LR license nowadays? $20k, $30k?

Thanks. Paul.