PoVSmith: Turning Vulnerability Findings Into Executable Evidence

A practical look at PoVSmith and how enterprise Java teams can move from scanner-driven vulnerability tickets to generated proof-of-vulnerability tests and remediation evidence.

The real enterprise problem: 100 repos, one urgent vulnerability email

PoVSmith — turning vulnerability findings into executable evidence for enterprise Java teams

Imagine a central cyber security team identifies a critical Spring ecosystem vulnerability across the enterprise.

The scanner report says that 100 repositories may be affected because they include an affected version of org.springframework.security:spring-security-web. The central cyber team does what many enterprise teams do today: it creates a ticket, sends an email, attaches the CVE, and asks every application team to remediate as soon as possible.

The email looks something like this:

Subject: Action Required: CVE-2024-38821 detected in your repository

Our scanners identified vulnerable Spring Security versions in your application.
Please upgrade to the fixed version and provide remediation evidence by Friday.

On paper, this looks reasonable. Security found a risk. Teams were notified. Remediation was requested.

In practice, this creates confusion.

Some teams immediately upgrade without understanding whether their application is actually exposed. Some teams push back because they believe the scanner is only detecting a dependency and not a reachable vulnerability. Some teams do not know whether they are using the affected framework path. Some teams have failing tests after the upgrade and need help. Some teams close the finding after a dependency bump, but there is no executable evidence that the original exposure was real or that the fix changed the risky behavior.

This is where the gap appears.

The enterprise does not only need vulnerability detection. It needs vulnerability evidence.

A concrete example: Spring Security CVE-2024-38821

Let us use CVE-2024-38821 as a practical example throughout this post.

The official Spring advisory describes this as an authorization bypass issue affecting static resources in Spring WebFlux applications (VMware Tanzu, 2024). The important detail is that the CVE does not affect every Spring Boot application that happens to carry Spring Security. For the application to be impacted, all of the following must be true:

The application is a WebFlux application.
The application uses Spring’s static resource support.
The application applies a non-permitAll authorization rule to those static resources.

That conditional nature matters. A scanner may detect that a vulnerable version exists in the dependency tree, but that does not automatically tell us whether the application is exploitable through its real routes, configuration, and runtime behavior.

The Spring advisory lists affected Spring Security versions, including 6.2.0 - 6.2.6, and identifies 6.2.7 as the fixed version for the 6.2.x line. GitHub’s advisory database similarly lists the affected and patched version ranges for org.springframework.security:spring-security-web (GitHub Advisory Database, 2024).

So now imagine the enterprise scanner identifies 100 repositories with a vulnerable Spring Security version.

The old question is:

Which teams upgraded the dependency?

The better question is:

Which repositories are actually exposed, and can we prove the behavior before and after remediation?

That is the shift PoVSmith helps us think about.

How enterprises commonly handle this today

Most large organizations already have the basic security machinery:

Software composition analysis tools
Central vulnerability dashboards
Dependency scanning
Ticket generation
Security SLAs
Exceptions and risk acceptance workflows
Periodic reporting to leadership

Those capabilities are necessary, but they often produce a workflow that looks like this:

Scanner finding
    ↓
Central ticket or email
    ↓
Application team analysis
    ↓
Manual dependency upgrade
    ↓
Scanner rerun
    ↓
Finding closed

The weakness is not that scanners are bad. The weakness is that the workflow is mostly report-driven.

The scanner tells us that a vulnerable library version exists. It does not always prove that the vulnerable code path is reachable through the application. It does not always give developers a failing test. It does not always produce a regression test that remains in the codebase after the fix. It does not always provide audit-friendly evidence beyond “the version changed.”

For a small number of applications, manual review might be manageable. For 100 repositories, manual review becomes inconsistent. Every team interprets the CVE differently. Every team writes different evidence. Some teams over-remediate. Some teams under-remediate. Security spends time chasing status instead of improving the system.

This is the opportunity: use AI and code intelligence to turn vulnerability findings into generated, executable, reviewable evidence.

What PoVSmith proposes

PoVSmith is a 2026 research paper titled “Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software” (Feng et al., 2026). The paper focuses on a very practical software security problem: modern applications depend on third-party libraries, and when a vulnerable library API is reachable through application code, the application may be vulnerable.

The paper argues that developers often need concrete proof-of-vulnerability tests to determine whether a reported dependency vulnerability is a practical risk. Manually writing those tests is hard because the developer has to understand the vulnerable library API, the application call path, the application context, and the expected exploit or failure behavior.

PoVSmith proposes a structured agentic workflow that combines:

Call-path analysis to identify application entry points that reach vulnerable library APIs.
Exemplar tests that show how the vulnerability appears at the library level.
Application code context so the generated test fits the real application.
Iterative execution feedback so the agent can compile, run, observe failures, and refine the test.
LLM-based assessment grounded in the generated test and execution logs.

This is important because PoVSmith is not simply “ask an LLM to write a test.” It is closer to an evidence-generation pipeline.

The authors evaluated PoVSmith on 33 Java application/library pairs. According to the paper, PoVSmith identified 158 unique application-level entry points calling vulnerable library APIs, correctly found 152 of them, generated 152 tests, and produced 84 tests that demonstrated feasible attack paths through the application (Feng et al., 2026).

Those numbers are promising, but they also show why this should be treated as an assisted engineering workflow, not blind automation. If 84 of 152 generated tests successfully demonstrated the vulnerable behavior, then the agent is useful, but still needs guardrails, review, and deterministic validation.

Translating PoVSmith into the Spring Boot example

Let us return to the 100 repositories flagged for CVE-2024-38821.

A PoVSmith-inspired enterprise workflow would not stop at “send teams an email.” It would create an evidence-generation workflow.

Central cyber scanner finding
    ↓
CVE-specific playbook
    ↓
Repository discovery
    ↓
Framework and configuration analysis
    ↓
Safe test generation
    ↓
Sandboxed CI execution
    ↓
Remediation PR
    ↓
Regression evidence

For this CVE, the playbook would encode the conditions from the advisory:

CVE: CVE-2024-38821

Potentially affected package:
org.springframework.security:spring-security-web

Affected line example:
6.2.0 - 6.2.6

Fixed line example:
6.2.7 for 6.2.x

Applicability conditions:
1. Application uses Spring WebFlux
2. Application uses Spring static resource support
3. Application applies a non-permitAll authorization rule to static resources

Then the evidence workflow would classify the 100 repositories.

100 repositories flagged by dependency scanner

45 repos:
Spring Security version is present, but application appears to be Spring MVC rather than WebFlux.
Evidence outcome: likely not affected by this CVE condition.

20 repos:
Application uses WebFlux, but no protected static resources are detected.
Evidence outcome: generated non-applicability note for app team review.

25 repos:
Application uses WebFlux and static resources, but authorization configuration is unclear.
Evidence outcome: request targeted app-team review with generated analysis.

10 repos:
Application appears to match all CVE conditions.
Evidence outcome: generate safe WebTestClient regression test and remediation PR.

This classification is where the enterprise starts to win. Instead of treating all 100 findings equally, the workflow uses dependency data, source code, framework configuration, and security rules to separate exposure from noise.

What does “generate a safe JUnit/WebTestClient test” mean?

In this example, the generated test is not a public exploit recipe. It is a controlled regression test that runs inside the repository’s test environment and validates the intended authorization behavior.

For a WebFlux application, the test might use WebTestClient to make an unauthenticated request against a static resource that is supposed to be protected.

The baseline test shape could look like this:

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
class Cve202438821StaticResourceAuthorizationTest {

    @Autowired
    private WebTestClient webTestClient;

    @Test
    void protectedStaticResourceShouldRejectAnonymousAccess() {
        webTestClient
            .get()
            .uri("/protected/report.html")
            .exchange()
            .expectStatus()
            .value(status -> assertThat(status).isIn(401, 403));
    }
}

The point of this test is simple:

If a resource is intended to be protected, anonymous access should not succeed.

The actual enterprise generator would make this test repo-specific. It would inspect the application to determine:

Does the repo use Maven or Gradle?
Does the repo use JUnit 5?
Does the repo already use WebTestClient?
Where are integration tests located?
What is the test profile?
Which static resource paths exist?
Which paths are protected by Spring Security rules?
What status code does this application expect for anonymous access: 401 or 403?
What dependency change is required to move to the fixed version?

That is why this cannot be a one-size-fits-all snippet sent in an email. The value comes from code-aware generation.

Who generates the test?

The test should be generated by an internal security evidence agent, not manually by every application team.

The ownership model could look like this:

Cyber security team:
Defines the CVE playbook, severity, fixed versions, safe test intent, and remediation policy.

Platform engineering:
Runs the automation, source-code analysis, CI sandbox, PR generation, and dashboarding.

Application team:
Reviews the generated test and PR, validates business behavior, resolves app-specific issues, and merges the fix.

This is an important operating model.

Cyber should not have to manually write JUnit tests for every application. Application teams should not have to reverse-engineer every CVE from scratch. Platform engineering can provide the automation layer that connects scanners, code search, build systems, test standards, and pull requests.

This is where tools like Sourcegraph, GitHub code search, internal software catalogs, CI pipelines, and AI coding agents can work together.

The practical enterprise architecture

An internal implementation could look like this:

1. Scanner detects vulnerable dependency
   Example: spring-security-web 6.2.5

2. CVE playbook is loaded
   Example: CVE-2024-38821 requires WebFlux + static resources + non-permitAll rule

3. Code intelligence layer scans repos
   Example: find WebFlux dependencies, SecurityWebFilterChain configuration, static resource mappings

4. Agent builds repo-specific context
   Example: test framework, build command, package structure, existing security tests

5. Agent generates candidate evidence
   Example: Cve202438821StaticResourceAuthorizationTest.java

6. CI executes the test in a sandbox
   Example: run against current dependency version and patched dependency version

7. Agent prepares remediation PR
   Example: update Spring Security version, add regression test, attach evidence summary

8. App team reviews and merges
   Example: team validates behavior and accepts the fix

This is the practical version of PoVSmith for the enterprise.

It converts a scanner alert into a pull request with evidence.

What the application team receives

Instead of receiving only an urgent email, the application team receives a focused remediation package:

Repository:
customer-document-service

Finding:
CVE-2024-38821

Why this repository may be affected:
- Uses Spring WebFlux
- Uses Spring Security
- Uses static resource handling
- Static resources under /protected/**
- Authorization rules appear to require authentication

Generated test:
src/test/java/.../security/Cve202438821StaticResourceAuthorizationTest.java

Remediation:
Upgrade Spring Security 6.2.x line to 6.2.7 or approved enterprise baseline

Evidence:
- Test result before remediation
- Test result after remediation
- CI log
- Dependency diff
- App-team review checklist

Now the conversation changes.

It is no longer:

Security says we are vulnerable.

It becomes:

Here is the code path, here is the generated test, here is the upgrade, and here is the evidence.

That is a much healthier conversation between cyber and engineering.

Pros of a PoVSmith-inspired enterprise model

1. It reduces false urgency

Not every dependency finding is equally exploitable in every application. By checking framework usage, configuration, and reachable behavior, the organization can prioritize the repositories that appear truly exposed.

2. It gives developers something concrete

Developers are more likely to act quickly when they receive a failing test, a proposed fix, and a passing result after remediation.

3. It creates audit-ready evidence

A dependency upgrade is useful. A dependency upgrade plus a regression test plus CI logs is better. This matters in regulated environments where teams need to demonstrate how a risk was assessed and remediated.

4. It scales security expertise

Cyber can encode the CVE playbook once. Platform engineering can run it across many repos. App teams can focus on review and merge decisions instead of every team independently interpreting the advisory.

5. It improves long-term regression coverage

The generated test can remain in the repository. That means the organization does not just fix the CVE once; it adds a guardrail against similar behavior returning later.

Cons and risks

1. Generated tests can be wrong

PoVSmith’s results are promising, but not perfect. In the paper’s evaluation, only a portion of generated tests successfully demonstrated feasible vulnerability behavior. That means generated tests must be reviewed and executed, not blindly trusted.

2. Applicability analysis can miss context

A repo may have unusual routing, custom static resource handling, environment-specific behavior, or security rules that static analysis does not fully understand.

3. Tests can create false confidence

A passing generated test does not prove the entire application is secure. It proves that one generated scenario behaved as expected.

4. Security test generation needs strict guardrails

An enterprise should avoid generating dangerous payloads, external network calls, credential access, destructive actions, or tests that resemble exploit tooling. The goal is safe, controlled evidence in a test environment.

5. Ownership can become unclear

If cyber owns the finding, platform owns the automation, and app teams own the code, the workflow needs clear decision rights. Who approves the playbook? Who approves the PR? Who accepts an exception? Who maintains the generated tests?

Guardrails I would require before scaling this

For a real enterprise rollout, I would require these guardrails:

1. CVE playbooks must be reviewed by cyber security.
2. Generated tests must run only in sandboxed CI.
3. No generated test should call external systems.
4. No generated test should use real credentials or customer data.
5. Dangerous payloads should be blocked or replaced with safe regression variants.
6. Every generated PR must be reviewed by the application team.
7. Evidence summaries should include confidence level and limitations.
8. Exceptions should require explicit risk acceptance.
9. Generated tests should follow repo-specific test standards.
10. Successful remediation should include dependency diff, test result, and CI log.

The goal is not autonomous exploitation. The goal is controlled, reviewable evidence generation.

The leadership takeaway

PoVSmith matters because it points toward a more mature model of security remediation.

The future is not just:

Find vulnerable dependency.
Send ticket.
Ask teams to upgrade.

The future is closer to:

Find vulnerable dependency.
Analyze reachability.
Generate safe evidence.
Open remediation PR.
Run CI.
Keep regression test.
Close with proof.

For engineering leaders, this is the real opportunity.

AI should not merely make developers type code faster. It should help the organization reduce ambiguity, create evidence, and improve the quality of engineering decisions at scale.

In the Spring Boot CVE example, the central cyber team should not have to send 100 generic emails and hope every team interprets the finding correctly. A PoVSmith-inspired workflow can classify exposure, generate safe tests, produce remediation PRs, and give teams concrete evidence to review.

That is how vulnerability management moves from static reports to executable trust.

References

Feng, Y., Li, M., Ding, J., & Liu, Y. (2026). Generating proof-of-vulnerability tests to help enhance the security of complex software. arXiv. https://arxiv.org/abs/2605.03956

GitHub Advisory Database. (2024). GHSA-c4q5-6c82-3qpw: Spring Security authorization bypass of static resources in WebFlux applications (CVE-2024-38821). GitHub. https://github.com/advisories/GHSA-c4q5-6c82-3qpw

VMware Tanzu. (2024). CVE-2024-38821: Authorization bypass of static resources in WebFlux applications. Spring. https://spring.io/security/cve-2024-38821/