There's a combinatorial explosion at the heart of writing tests: the more coarse-grained the test, the more possible code paths to test, and the harder it gets to cover every corner case. In response, conventional wisdom is to test behavior at as fine a granularity as possible. The customary divide between 'unit' and 'integration' tests exists for this reason. Integration tests operate on the external interface to a program, while unit tests directly invoke different sub-components.
But such fine-grained tests have a limitation: they make it harder to move function boundaries around, whether it's splitting a helper out of its original call-site, or coalescing a helper function into its caller. Such transformations quickly outgrow the build/refactor partition that is at the heart of modern test-based development; you end up either creating functions without tests, or throwing away tests for functions that don't exist anymore, or manually stitching tests to a new call-site. All these operations are error-prone and stress-inducing. Does this function need to be test-driven from scratch? Am I losing something valuable in those obsolete tests? In practice, the emphasis on alternating phases of building (writing tests) and refactoring (holding tests unchanged) causes certain kinds of global reorganization to never happen. In the face of gradually shifting requirements and emphasis, codebases sink deeper and deeper into a locally optimum architecture that often has more to do with historical reasons than thoughtful design.
I've been experimenting with a new approach to keep the organization of code more fluid, and to keep tests from ossifying it. Rather than pass in specific inputs and make assertions on the outputs, I modify code to judiciously print to a trace and make assertions on the trace at the end of a run. As a result, tests no longer need call fine-grained helpers directly.
An utterly contrived and simplistic code example and test:
int foo() { return 34; } void test_foo() { check(foo() == 34); }
With traces, I would write this as:
int foo() { trace << "foo: 34"; return 34; } void test_foo() { foo(); check_trace_contents("foo: 34"); }
The call to trace is conceptually just a print or logging statement. And the call to check_trace_contents ensures that the 'log' for the test contains a specific line of text:
foo: 34
That's the basic flow: create side-effects to check for rather than checking return values directly. At this point it probably seems utterly redundant. Here's a more realistic example, this time from my toy lisp interpreter. Before:
void test_eval_handles_body_keyword_synonym() {
run("f <- (fn (a b ... body|do) body)");
cell* result = eval("(f 2 :do 1 3)");
// result should be (1 3)
check(is_cons(result));
check(car(result) == new_num(1));
check(car(cdr(result)) == new_num(3));
}
void test_eval_handles_body_keyword_synonym() { run("f <- (fn (a b ... body|do) body)"); run("(f 2 :do 1 3)"); check_trace_contents("(1 3)"); }
(The code looks like this.)
This example shows the key benefit of this approach. Instead of calling eval directly, we're now calling the top-level run function. Since we only care about a side-effect we don't need access to the value returned by eval. If we refactored eval in the future we wouldn't need to change this function at all. We'd just need to ensure that we preserved the tracing to emit the result of evaluation somewhere in the program.
As I've gone through and 'tracified' all my tests, they've taken on a common structure: first I run some preconditions. Then I run the expression I want to test and inspect the trace. Sometimes I'm checking for something that the setup expressions could have emitted and need to clear the trace to avoid contamination. Over time different parts of the program get namespaced with labels to avoid accidental conflict.
check_trace_contents("eval", "=> (1 3)");
This call now says, "look for this line only among lines in the trace tagged with the label eval." Other tests may run the same code but test other aspects of it, such as tokenization, or parsing. Labels allow me to verify behavior of different subsystems in an arbitrarily fine-grained manner without needing to know how to invoke them.
Other codebases will have a different common structure. They may call a different top-level than run, and may pass in inputs differently. But they'll all need labels to isolate design concerns.
The payoff of these changes: all my tests are now oblivious to internal details like tokenization, parsing and evaluation. The trace checks that the program correctly computed a specific fact, while remaining oblivious about how it was computed, whether synchronously or asynchronously, serially or in parallel, whether it was returned in a callback or a global, etc. The hypothesis is that this will make high-level reorganizations easier in future, and therefore more likely to occur.
Worries
As I program in this style, I've been keeping a list of anxieties, potentially-fatal objections to it:
- Are the new tests more brittle? I've had a couple of spurious failures
from subtly different whitespace, but they haven't taken long to diagnose.
I've also been gradually growing a vocabulary of possible checks on the trace.
Even though it's conceptually like logging, the trace doesn't have to be
stored in a file on disk. It's a random-access in-memory structure that can be
sliced and diced in various ways. I've already switched implementations a
couple of times as I added labels to namespace different subsystems/concerns,
and a notion of frames for distinguishing recursive calls.
- Are we testing what we think we're testing? The trace adds a level of
indirection, and it takes a little care to avoid false successes. So far it
hasn't been more effort than conventional tests.
- Will they lead to crappier architecture? Arguably the biggest
benefit of TDD is that it makes functions more testable all across a large
program. Tracing makes it possible to keep such interfaces crappier and more
tangled. On the other hand, the complexities of flow control, concurrency and
error management often cause interface complexity anyway. My weak sense so far
is that tests are like training wheels for inexperienced designers. After some
experience, I hope people will continue to design tasteful interfaces even if
they aren't forced to do so by their tests.
- Am I just reinventing mocks? I hope not, because I hate mocks.
The big difference to my mind is that traces should output and verify
domain-specific knowledge rather than implementation details, and that it's
more convenient with traces to selectively check specific states in specific
tests, without requiring a lot of setup in each test. Indeed, one way to view
this whole approach is as test-specific assertions that can be easily turned
on and off from one test to the next.
- Avoiding side-effects is arguably the most valuable rule we know about
good design. Could this whole approach be a dead-end simply because of its
extreme use of side-effects? Arguably these side-effects are ok, because they
don't break referential transparency.
The trace is purely part of the test harness, something the program can be
oblivious to in production runs.
The future
I'm going to monitor those worries, but I feel very optimistic about this idea. Traces could enable tests that have so far been hard to create: for performance, fault-tolerance, synchronization, and so on. Traces could be a unifying source of knowledge about a codebase. I've been experimenting with a collapsing interface for rendering traces that would help newcomers visualize a new codebase, or veterans more quickly connect errors to causes. More on these ideas anon.
comments
Comments gratefully appreciated. Please send them to me by any method of your choice and I'll include them here.
So now you not only have to deal with the effort of refactoring code, but *also* refactoring tests! I can't stand that constant overhead of extra effort. But your approach seems like it would allow me to change the unit tests much less frequently, thereby increasing the benefit/cost ratio, thereby making them actually worth it.
I don't see any connection at all to mocks. The point of mocks is that if you have something that's stateful (like a database or whatever), you don't test the database directly, instead you create a fake database and test that instead.
I'd love to hear if you try something like this in any of your projects. Some sort of tracing infrastructure is super simple in any language
One thing I try to do with my tests is assert completeness at the end. So for example, if you trace (in doctest.js parliance "print") three things: "a", {"b": 1} and 4, then if you assert "a", that object is popped from the pile of objects that have been traced. This way you can at the end do: assert len(traces) == 0. This is pretty cool in that you assert both the positive _and negative_. I use this type of thinking a lot.
The piece you're missing at the moment: generation of the trace should be automatic, or mostly so. Might be worthwhile to use a preprocessor, and maybe hot-comments, to auto-inject the trace.
"The design integrity of your system is far more important than being able to test it any particular layer. Stop obsessing about unit tests, embrace backfilling of tests when you're happy with the design, and strive for overall system clarity as your principle pursuit." -- David Heinemeier Hansson
Yet the trace describing what have happened would be different.
To account for this kind of thing, the test would have to do some kind of normalisation on of the trace...