Rethinking Agent Skills, Tools, and Capabilities
When agents underperform, the instinct is to add more tools. Most of the time, that is the wrong diagnosis — and the real gap is in skills and governance, not capability.
Hunter Xu
Arc Intelligence
When an agent underperforms, the first response is usually to give it more tools.
Another integration. A larger knowledge base. Access to more APIs. The assumption is that the agent is failing because it cannot reach what it needs. Sometimes this is true. More often, it is not — and adding tools makes the underlying problem harder to see.
The agent is not limited by what it can access. It is limited by whether anyone has encoded how to think.
The difference between a tool and a skill
A code review agent with access to your entire codebase, your CI pipeline, your issue tracker, and your deployment history can still produce reviews that are technically correct and practically useless — because no one has encoded what a useful code review looks like in your organisation.
What to prioritise. When a finding is worth raising and when it is noise. How to frame something that a senior engineer should hear differently from a junior one. Whether a performance issue in a hot path is a blocker or a suggestion, given the current release timeline.
Those judgements live in the heads of your best reviewers. They are not in the tool list.
A skill is the encoding of that judgement — not the steps of a review, but the reasoning behind them. When to flag a security issue versus a style comment. What constitutes a blocker. How to read the context of a PR before deciding what to say. That is what makes the difference between a capable agent and a trustworthy one.
Three layers, and where most teams stop
The gap between a prompt and a skill has structure. There are three layers, and most organisations deploying agents are living in the first one.
SOP automation is a fixed sequence: read the PR, check for common issues, output findings in this format. Consistent when the task fits the template. Brittle when it does not. It captures what the expert does — the steps — without capturing why they do it.
Judgement procedure encodes the decision logic that travels with the capability. What to do when the evidence is ambiguous. When the standard path does not apply. What warrants escalation and to whom. This is where the reviewer's actual expertise becomes reusable — not as a transcript of what they did, but as a method that can run without them.
Capability interface makes it infrastructure: typed inputs, expected outputs, a quality bar the system can check. Now it is testable, auditable, composable with other capabilities. Other systems can call it. The institution can monitor it and update it independently.
The jump from the first layer to the second is where the real engineering of agent systems begins. The third is where they compound. Most of what is called agent prompting lives in the first layer — which is why adding more tools rarely solves the actual problem.
Who decides how the agent thinks
Governance is the question above all three layers: who can invoke this capability, under what conditions, and what happens when it is wrong?
Without an answer, the capability exists but the institution cannot trust it. The hesitation organisations feel about deploying AI agents is rarely about what the agent can do. It is about the absence of anyone having decided how it should think — and who is accountable when it does not.
Tools extend action space. Skills structure action. Governance constrains action.
More tools, absent the other two, produce more surface area for the same problem.