Companion code: The Java patterns described in this post are implemented in github.com/avinashpoonacha/controlled-autonomy-hitl-java-demo — a Spring Boot demo showing the metacognitive assessment, autonomy decision engine, decoupled HITL layer, and action executor working together.

Imagine a production-support agent watching a failed end-of-day payment batch.
The agent checks application logs, compares the failure with prior incidents, reviews recent deployments, and summarizes the likely cause. So far, this is exactly the kind of work we want agents to do: fast, repetitive, evidence-gathering work that helps humans move faster.
Then the agent recommends the next step: replay the failed payment batch.
That is a different kind of action.
Reading logs is observational. Replaying a payment batch can affect settlement, customer balances, downstream reconciliation, and audit records. The same agentic workflow now contains both low-risk investigation and high-impact operational action.
This is where many enterprise AI designs reach for the familiar answer:
Put a human in the loop.
That answer is not wrong. It is incomplete.
If every meaningful agent action requires approval, we have not built autonomy. We have built a faster way to create approval queues. The next maturity step for enterprise agents is not simply adding more humans into the workflow. It is building systems that understand when human judgment is actually necessary.
Two recent ideas help frame this shift:
- Decoupled human-in-the-loop systems
- Metacognitive AI
The first gives us the oversight architecture. The second gives us the escalation intelligence.
Together, they point toward a better enterprise pattern:
Human oversight should be available by design, but invoked by exception.
The problem with human-in-the-loop by default
Human-in-the-loop is often introduced as a safety blanket.
Teams add approval when they do not fully trust the agent. They add approval when the action feels risky. They add approval when compliance is involved. They add approval when model behavior is hard to explain.
At first, this feels responsible. Over time, it can create operational drag.
Review queues grow. Approvals become repetitive. Humans start rubber-stamping actions because the workflow does not provide enough context for meaningful judgment. The organization ends up with neither full automation nor strong governance.
That is approval sprawl.
The goal should not be to keep humans involved in every step. The goal should be to reserve human involvement for moments where human authority, judgment, or accountability is truly required.
What the Decoupled HITL paper gets right
The paper A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agentic Workflows argues that human oversight should not be embedded ad hoc inside each agent workflow. It proposes treating HITL as an independent system component with explicit interfaces. The paper frames HITL integration around four dimensions: intervention conditions, role resolution, interaction semantics, and communication channel (Cheng & Cheng, 2026).
That is an important architectural move.
In many early agentic systems, approval logic looks like this:
if (risk == HIGH) {
askManager();
}
That may work for a prototype. It does not scale across an enterprise.
Different teams define risk differently. Different workflows route approval differently. Different systems log decisions differently. Different agents escalate inconsistently.
A decoupled HITL layer changes the model.
Instead of every workflow implementing its own oversight logic, agents call a shared oversight service:
hitlClient.submitForReview(actionRequest);
That HITL service handles the oversight mechanics:
- the condition that triggered review
- the role or authority required
- the decision options available to the human
- the channel where the request is routed
- the audit record that proves what happened
- the timeout, rejection, or escalation behavior
This turns human oversight into a reusable governance capability.
But HITL alone does not solve the autonomy problem
A decoupled HITL layer solves the where problem. It tells us where human oversight should live. It does not fully solve the when problem.
An agentic system still needs to decide whether a proposed action should be executed, verified, reviewed by another automated process, escalated to a human, or blocked entirely.
This is where the second paper becomes useful.
The position paper Artificial Intelligence Needs Meta Intelligence — the Case for Metacognitive AI argues that AI systems need metacognitive capabilities: the ability to monitor their own state, estimate task difficulty and risk, and allocate resources based on uncertainty and the cost of mistakes (Chuprov et al., 2026).
That is exactly what enterprise agents need.
Most agent systems are action engines. They can reason, call tools, and execute steps. They are often weaker at deciding how much autonomy they should use in a specific situation.
A mature agentic system needs an explicit self-monitoring layer. It should evaluate risk, confidence, uncertainty, reversibility, customer impact, policy sensitivity, evidence quality, and cost of failure before deciding the next control path.
That is metacognitive controlled autonomy.
The stronger pattern: metacognitive HITL
The strongest architecture combines both ideas.
The agent workflow should not directly decide whether to involve a human. The human should not be pulled into every action. A metacognitive control layer should decide the appropriate level of autonomy.
Agent Workflow
-> Metacognitive Assessment
-> Autonomy Decision Engine
-> HITL Control Plane, only when needed
-> Action Executor
The decoupled HITL layer provides the governance interface.
The metacognitive layer decides whether that interface is needed.
That distinction matters.
HITL is not the destination. HITL is one possible route in a broader autonomy decision system.
A practical enterprise example
Return to the failed payment batch.
The same production-support agent may perform many actions during the incident, but those actions do not carry the same risk.
| Action | Control path | Why |
|---|---|---|
| Read logs | Auto-execute | Observational and reversible |
| Summarize incident | Auto-execute | Low-risk knowledge work |
| Check recent deployments | Auto-execute | Read-only investigation |
| Compare with prior incidents | Auto-execute | Informational and auditable |
| Recommend restart | Verify with telemetry | Requires current system evidence |
| Identify impacted customers | Verify with source systems | Needs deterministic validation |
| Suggest replaying a batch | Second-pass review | High-impact recommendation |
| Replay payment batch | Human approval | Financial and customer impact |
| Restart production service during business hours | Human approval | Operational blast radius |
| Update customer-facing record | Human approval | Data integrity and compliance |
| Delete audit evidence | Block | Policy violation |
| Bypass approval rules | Block | Governance violation |
This table is the core design point.
Not every agent action deserves the same level of control. The system should distinguish between safe autonomy, verified autonomy, approved autonomy, and blocked behavior.
Java example: controlled autonomy in Spring Boot
A small Java project can make this pattern concrete.
The companion demo is structured as a Spring Boot service with four main areas:
controlled-autonomy-hitl-java-demo/
agent/
AgentWorkflowService.java
AgentWorkflowController.java
metacognition/
MetacognitiveAssessmentService.java
AutonomyDecisionEngine.java
hitl/
HitlService.java
RoleResolver.java
AuditLogService.java
executor/
ActionExecutorService.java
model/
AgentActionRequest.java
MetacognitiveAssessment.java
AutonomyDecision.java
HitlDecision.java
Spring Boot is useful for this demo because it makes the service boundary and REST interface easy to run locally. I used Spring Boot 3.5.14 for the companion project because it is a recent 3.5.x release and widely familiar to enterprise Java teams (Spring, 2026).
The project is intentionally not a full AI agent. It is a teaching artifact. It shows how a proposed action moves through assessment, routing, HITL, and execution.
Step 1: Model the proposed agent action
public record AgentActionRequest(
UUID workflowId,
String agentName,
ActionType actionType,
String targetSystem,
String businessContext,
RiskLevel riskLevel,
boolean customerImpacting,
boolean reversible,
boolean policySensitive,
Map<String, Object> evidence
) {}
This forces the workflow to describe the proposed action in business and operational terms.
The enterprise question is not only whether the agent can perform the action. The more important question is what kind of action it is, what system it affects, and what happens if the agent is wrong.
Step 2: Add metacognitive assessment
public record MetacognitiveAssessment(
double confidenceScore,
double uncertaintyScore,
RiskLevel riskLevel,
boolean reversible,
boolean customerImpacting,
boolean policySensitive,
String reasoningSummary
) {}
In a production system, these values should not come from model confidence alone.
They should combine model self-assessment with deterministic checks, tool results, historical incident data, policy rules, and business-impact analysis. A model can be confident and still be wrong. Enterprise control decisions need evidence beyond confidence.
Step 3: Give the system more options than approve or reject
public enum AutonomyDecision {
AUTO_EXECUTE,
VERIFY_WITH_TOOL,
REQUIRE_SECOND_PASS_REVIEW,
ESCALATE_TO_HUMAN,
BLOCK
}
This is where the design becomes more useful than basic HITL.
Not every uncertain situation requires a human. Sometimes the right answer is to run another tool, gather better telemetry, ask for a second automated review, or block the action because it violates policy.
Step 4: Implement the autonomy decision engine
@Component
public class AutonomyDecisionEngine {
public AutonomyDecision decide(
AgentActionRequest request,
MetacognitiveAssessment assessment
) {
if (request.actionType() == ActionType.DELETE_AUDIT_EVIDENCE) {
return AutonomyDecision.BLOCK;
}
if (assessment.riskLevel() == RiskLevel.CRITICAL) {
return AutonomyDecision.ESCALATE_TO_HUMAN;
}
if (assessment.policySensitive() && assessment.customerImpacting()) {
return AutonomyDecision.ESCALATE_TO_HUMAN;
}
if (!assessment.reversible() && assessment.riskLevel() == RiskLevel.HIGH) {
return AutonomyDecision.ESCALATE_TO_HUMAN;
}
if (assessment.uncertaintyScore() > 0.40) {
return AutonomyDecision.VERIFY_WITH_TOOL;
}
if (assessment.confidenceScore() < 0.70) {
return AutonomyDecision.REQUIRE_SECOND_PASS_REVIEW;
}
if (assessment.riskLevel() == RiskLevel.LOW && assessment.reversible()) {
return AutonomyDecision.AUTO_EXECUTE;
}
return AutonomyDecision.ESCALATE_TO_HUMAN;
}
}
This is the key teaching moment.
The system does not blindly trust the agent. It also does not blindly escalate everything to a human. It routes the action to the right level of control.
Step 5: Use HITL only when necessary
public AgentWorkflowOutcome handleProposedAction(AgentActionRequest request) {
MetacognitiveAssessment assessment = assessmentService.assess(request);
AutonomyDecision decision = decisionEngine.decide(request, assessment);
return switch (decision) {
case AUTO_EXECUTE -> {
ActionExecutionResult result = actionExecutorService.execute(request);
yield AgentWorkflowOutcome.executed(request, assessment, result);
}
case VERIFY_WITH_TOOL -> AgentWorkflowOutcome.needsVerification(request, assessment);
case REQUIRE_SECOND_PASS_REVIEW -> AgentWorkflowOutcome.needsSecondPass(request, assessment);
case ESCALATE_TO_HUMAN -> {
HitlRequest hitlRequest = hitlService.submitForReview(request, assessment);
yield AgentWorkflowOutcome.escalated(request, assessment, hitlRequest.decisionId());
}
case BLOCK -> AgentWorkflowOutcome.blocked(request, assessment);
};
}
The HITL control plane is available, but it is not the default path for everything.
That is the difference between human-in-the-loop by default and human-when-necessary by design.
How this scales in the enterprise
The demo uses in-memory services so the pattern is easy to understand. In an enterprise implementation, each piece maps to existing platforms and governance capabilities.
| Demo concept | Enterprise implementation |
|---|---|
AutonomyDecisionEngine | Central autonomy policy service |
RiskPolicyEngine | OPA, Drools, or internal risk rules |
RoleResolver | LDAP, Active Directory, Okta, SailPoint |
HitlService | ServiceNow, Jira, Teams, Slack, internal approval portal |
AuditLogService | Immutable event store, SIEM, compliance archive |
ActionExecutorService | Runbooks, deployment tools, change automation |
MetacognitiveAssessmentService | Model confidence, tool validation, telemetry, eval harness |
At scale, this becomes an agent control plane.
The agent proposes work. The control plane determines the control path. The action executor performs only authorized actions. The audit layer records the evidence, decision, approver, rationale, and outcome.
For production-grade systems, observability matters as much as the routing logic. The workflow should be traceable across the agent request, metacognitive assessment, HITL decision, and downstream action. OpenTelemetry’s Java documentation describes how Java applications can generate telemetry such as traces, metrics, and logs, which is useful for this kind of cross-service accountability (OpenTelemetry, n.d.).
What engineering leaders should measure
The biggest mistake is measuring only task completion.
For controlled autonomy, leaders need to measure whether human intervention is useful.
| Metric | Why it matters |
|---|---|
| Human escalation rate | Shows how often agents need help |
| Human approval rate | Shows whether escalations are meaningful |
| Human rejection rate | Shows whether agents are proposing unsafe actions |
| Human modification rate | Shows whether humans are improving the outcome |
| False escalation rate | Shows unnecessary approval burden |
| Auto-execution success rate | Shows where autonomy is safe |
| Post-action incident rate | Shows whether autonomous actions caused issues |
| Approval latency | Shows workflow drag |
| Policy override rate | Shows governance pressure points |
| Audit completeness | Shows whether decisions are explainable later |
This is how teams move from opinion-based trust to evidence-earned autonomy.
If humans approve the same low-risk action hundreds of times without modification or incident, that action may become a candidate for policy-based auto-approval. If humans frequently reject or modify a certain recommendation, the workflow needs better evidence, better tools, tighter policy, or reduced autonomy.
Why this matters for governance
This pattern also aligns with where AI governance is heading.
The NIST AI Risk Management Framework is designed to help organizations manage AI risks and organizes risk work around govern, map, measure, and manage functions (National Institute of Standards and Technology, 2023).
The EU AI Act’s Article 14 requires high-risk AI systems to be designed so they can be effectively overseen by natural persons, with oversight measures commensurate with risk, autonomy, and context of use. It also highlights the need for humans to understand system capabilities and limitations, monitor operation, avoid over-reliance, interpret output, override decisions, and stop the system where appropriate (European Union, 2024).
The practical lesson is not that every AI action needs a human approval step. The lesson is that human oversight must be meaningful, auditable, and proportionate to risk.
That is why metacognitive HITL matters.
The leadership takeaway
For engineering leaders, this changes the conversation.
The basic question is how to put a human in the loop. The better question is how to design an agentic system that knows when human judgment is required.
Human-in-the-loop by default slows the organization down. Full autonomy without controls creates unacceptable risk. Metacognitive controlled autonomy gives us a middle path.
It lets agents act independently when the action is low-risk, reversible, and evidence-backed. It forces verification when uncertainty is high. It escalates to humans when judgment, authority, or accountability is required. It blocks actions that violate policy.
That is how enterprise autonomy should mature.
Closing
The future of agentic systems is not humans approving every action. It is also not agents acting without oversight.
The future is calibrated autonomy.
Agents should know when to act. They should know when to verify. They should know when to escalate. They should know when to stop.
A decoupled HITL layer gives enterprises the governance interface. Metacognitive AI gives agents the awareness to use that interface only when needed.
The goal is not to remove humans from the system.
The goal is to stop wasting human judgment on decisions the system can safely handle, and to preserve human authority for the moments where it truly matters.
That is the difference between automation with approvals and controlled autonomy.
References
Cheng, E., & Cheng, J. (2026). A decoupled human-in-the-loop system for controlled autonomy in agentic workflows. arXiv. https://arxiv.org/abs/2604.23049
Chuprov, S., Lange, R. D., Reznik, L., Shakarian, P., Zatsarenko, R., & Korobeinikov, D. (2026). Position: Artificial intelligence needs meta intelligence — the case for metacognitive AI. arXiv. https://arxiv.org/abs/2605.15567
European Union. (2024). EU AI Act, Article 14: Human oversight. https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-14
National Institute of Standards and Technology. (2023). Artificial intelligence risk management framework (AI RMF 1.0). https://www.nist.gov/itl/ai-risk-management-framework
OpenTelemetry. (n.d.). OpenTelemetry Java documentation. https://opentelemetry.io/docs/languages/java/
Spring. (2026). Spring Boot 3.5.14 available now. https://spring.io/blog/2026/04/23/spring-boot-3-5-14-available-now/