There's a quiet assumption in most agentic coding setups: the agent reads your code, reads the error, and figures it out. That works fine when the bug is syntactic or when the failure is shallow. It falls apart the moment the agent has to reason about runtime behavior across multiple systems.

Think about what happens when an e2e test fails. The test says "expected 201 but got 500." The agent sees the assertion, opens the controller, reads the service layer, maybe checks the repository. It forms a theory, makes a change, runs the test again. Still 500. It tries something else. Another cycle. The agent is guessing, because it has no visibility into what actually happened at runtime. It's coding blind.

This is the gap that tracing fills, and it's not a small one.

The problem with black-box tests

E2e tests are, by design, black-box. You send an HTTP request, assert on the response, maybe check the database or verify a Kafka message was published. When things work, that's enough. When things break, you get a status code and a generic error message. The interesting part—what happened between the request entering your controller and the error being returned—is invisible.

For a human developer, the next step is opening logs, grepping for request IDs, checking Kafka consumer offsets, maybe attaching a debugger. It's slow, but you have the mental model of the system and you know where to look.

An agent doesn't have that mental model. It has your source code and the test output. Source code tells you what could happen; it doesn't tell you what did happen. The agent doesn't know whether the request reached the controller, whether the database query ran, whether the Kafka producer was even invoked. It's working from a map, not from the terrain.

Traces are the terrain

In Stove, when you enable tracing, every test failure comes with the full execution trace of your application:

 POST /api/orders [250ms]
├── OrderService.createOrder [245ms]
   ├── OrderService.checkFraudViaGrpc [30ms]
   └── FraudDetectionClient.checkFraud [25ms]
   ├── OrderService.checkInventoryViaRest [40ms]
   └── http.url: http://localhost:54648/inventory/macbook-pro-16
   ├── OrderService.processPaymentViaRest [35ms]
   └── http.url: http://localhost:54648/payments/charge
   ├── OrderService.saveOrderToDatabase [8ms]  ◄── FAILURE POINT
   └── PostgresOrderRepository.save [5ms]
  Error: OrderPersistenceException
  Message: Failed to persist order: amount exceeds threshold
       └── db.system: postgresql

Every controller method, every database query, every Kafka message, every HTTP call to an external service—with timing and the exact point of failure. This is powered by OpenTelemetry. The setup is two lines of configuration; no code changes to your application.

Now think about what an agent sees when it reads this output. It doesn't have to guess. It knows the fraud check passed, the inventory check passed, the payment went through, and the failure happened specifically in PostgresOrderRepository.save with an OrderPersistenceException about an amount threshold. The agent can go straight to that repository class, find the validation logic, and fix it. One cycle, not five.

Why this matters more than you'd think

There's a compounding effect here. Without traces, an agent's debugging loop looks like:

  1. 1.

    Read error → form hypothesis → change code → run test → still failing

  2. 2.

    Read new error → form new hypothesis → change code → run test → different error

  3. 3.

    Repeat until either the fix lands or the context window fills up

Each failed attempt burns context, tokens, and time. The agent is doing trial-and-error, which is the most expensive form of debugging for both humans and machines.

With traces, the loop collapses:

  1. 1.

    Read error + trace → see exactly where it broke → fix the right thing → done

The trace gives the agent something it otherwise lacks entirely: observability into the runtime. It's the difference between reading a recipe and watching the dish being cooked. The agent doesn't need to simulate the execution in its head; the trace is the execution, laid out step by step.

Production-like tracing, not just any tracing

There's a subtlety worth calling out. Stove uses the same OpenTelemetry agent that you'd use in production. The traces you see in tests are the same kind of traces you'd see in Jaeger or your APM tool. This means the agent is reasoning about behavior that mirrors production, not a test-specific mock of it.

Stove handles the mechanics: it starts an OTLP receiver, injects W3C traceparent headers into every HTTP request, Kafka message, and gRPC call, collects the resulting spans, and correlates them back to the specific test that triggered them. Every test gets its own trace. Traces from concurrent tests never bleed into each other.

The OpenTelemetry Java Agent instruments over a hundred libraries automatically Spring, JDBC, Kafka clients, gRPC, HTTP clients, Redis, MongoDB. No @WithSpan annotations needed (though you can add them for your own methods). You get deep visibility without touching your application code.

What this looks like in practice

Here's a Stove test covering an order flow that touches six integration points:

test("should create order and process payment") {
  stove {
    wiremock {
      mockGet(url = "/inventory/$productId", statusCode = 200,
        responseBody = InventoryResponse(productId, available = true).some())
      mockPost(url = "/payments/charge", statusCode = 200,
        responseBody = PaymentResult(success = true, transactionId = "txn-123").some())
    }

    http {
      postAndExpectBody<OrderResponse>(
        uri = "/api/orders",
        body = CreateOrderRequest(userId, productId, 99.99).some()
      ) { response ->
        response.status shouldBe 201
      }
    }

    postgresql {
      shouldQuery<Order>("SELECT * FROM orders WHERE user_id = '$userId'") { orders ->
        orders.first().status shouldBe "CONFIRMED"
      }
    }

    kafka {
      shouldBePublished<OrderCreatedEvent> { actual.userId == userId }
    }

    tracing {
      shouldNotHaveFailedSpans()
    }
  }
}

When this test passes, the trace is available through the tracing { } DSL if you want to assert on the execution flow. When it fails, the trace is rendered automatically as part of the failure report. Either way, the agent gets structured runtime data about what your application did.

That last line—shouldNotHaveFailedSpans()—is worth highlighting. It's the simplest trace assertion you can write, and it catches any unexpected error anywhere in the entire call chain, not just the parts your test explicitly asserts on. An agent reading a test failure with this assertion immediately knows whether the issue is in the tested flow or in some side effect that the test assertions didn't cover.

The agentic coding angle

Here's the core argument: the quality of an agentic coding session is bounded by the quality of the feedback the agent gets from the system it's modifying.

Source code is static context. Test assertions are binary signals (pass/fail). Logs are noisy and unstructured. Stack traces show you where the exception was thrown but not the path that led there.

Traces are none of those things. A trace is a structured, hierarchical, timed record of everything the application did in response to a specific input. It's the closest thing to "showing the agent what happened" that we have. For JVM projects, Stove makes this available with near-zero setup cost: add the tracing dependency, apply the Gradle plugin, put enableSpanReceiver() in your config. That's it.

When you're in an agentic coding loop whether you're using Claude, Cursor, Copilot, or anything else the agent is only as effective as its ability to understand what went wrong. Traces turn a guessing game into a directed fix. They're not just a debugging convenience; they're a feedback channel that makes the entire loop converge faster.

If your e2e tests run against real infrastructure (which they should), and your application is instrumented with OpenTelemetry (which it probably already is or should be), then exposing those traces to your test output is one of the highest-leverage things you can do for agentic workflows. The traces are already there. You just need to surface them where the agent can see them.


Links: