Agents Should Be More Opinionated

The best agent products aren’t the most flexible, they’re the most opinionated. In this post we’ll discuss why that is. TLDR here’s my advice:

Build opinionated agents. Take a stand on tool design and prompts. You want success on your tasks, not every task. Add more customizability later.

Some things we’ll explore:

The flexibility trap, why agents need less knobs, not more.
Principles of opinionated agent design in products today. Opinionated != Limited.
The Myth of General Purpose Agents. Models are non-fungible in their harness, designing around their spiky intelligence is good harness engineering.

The future of intelligence is hyper-specialized agents crafted from very intelligent models. - Me

Principles of Opinionated Agent Design

The goal in agent products, like any other product, is to give users a delightful experience. A good baseline for agents that do work is that everything works reliably without tweaking too many settings. Good product design is the history of creators distilling their vision into an intuitive interface that just works.

Let’s distill some principles for opinionated agent design:

Agents need less knobs, not more. Your users will thank you.
Do A LOT of work on behalf of the user.
Obsess over prompts and tools, it’s the easiest way to transfer your team’s knowledge & opinions into your product.
Dogfooding and evals are how you measure the effectiveness of your current set of opinions.

Opinions, Product Feedback, and Evals make up the flywheel of agent design and product feedback

Principles in Practice

What user is excited to tweak the temperature and chunking strategy? None…literally none. This is the flexibility trap, thinking that users want choices, when really they want outcomes. Steve Jobs and the iPhone moment were genius examples of interface design - one button, one screen. But this didn’t actually limit any capabilities, it just limited the surface area for user interactions with the product. The magic is that the product still works reliably from just a few user touch points.

A great quote from Cursor’s way of working:

“Do we need this setting?”, “Could we get there in fewer clicks?”, “How can we streamline?”, “Does anyone use? Can we kill?”

That’s the same energy we want in product design with AI agents. We should do A TON of work upfront, so that the baseline agent is great.

What might that work entail?

Test out every model so your user doesn’t have to. Test on real world usage for your product, don’t trust benchmarks! Your user has never heard of Terminal Bench 2.0, please don’t introduce it to them.
Write detailed prompts that tell the agent what success looks like and how to get there. Don’t believe what they tell you… Prompt engineering is not dead
Test config options. Every required user option is a failure to make a decision on the user’s behalf. If you know that Claude Opus 4.5 is the best model for your tasks, make it a default. If you know the search tool should always be called first, enforce it.

Ideally, your team is an accurate reflection of your target customer base. When this is true, you get an incredible product iteration flywheel via dogfooding, again Cursor is a great example. Dogfood your product to get a vibe for where it works and doesn’t. Not everything useful can be easily measured, but your team can be the human bridge to fill that gap.

The Myth of Truly “General Purpose” Agents

Let’s start with a simplified mental model:

Agent = Opinionated Harness + Model

Bake your opinions into the model harness. It’s the fastest way to make an agent do your task well.

I discuss harnesses in a previous post but TLDR, a harness wraps the model to make it optimally useful for your task: prompts, tools, context management + docs, subagents, etc. The harness is where your opinions live and it’s where you encode what “good” looks like. When people say they want a “general purpose” agent, what they’re actually describing is a tradeoff:

“I’m willing to accept lower task performance in exchange for spending less time on harness engineering. It’d be great to use this agent for many of my tasks with this default harness.”

That’s a valid choice. Sometimes you need to prototype quickly or maybe the out-of-the box performance is good enough without too much custom engineering. But here’s the trap: most builders default to “general purpose” not because they’ve consciously made this tradeoff, but because they haven’t picked their own set of opinions yet.

Models Are Non-Fungible

Hot take…you can’t really evaluate a model decoupled from its harness. They’re co-dependent.

Model intelligence is spiky. When you design a harness, you’re implicitly designing around your model’s strengths and weaknesses. This means an “upgrade” to a new model often breaks your existing harness because the new model has different spikes. Carefully tuned prompts no longer yield the same behavior, tool calling may suddenly result in new failure modes…you get it.

The only question that matters is: does this harness + model pair succeed at my task?

This is markedly a different question from “should this new model work based on the latest benchmark scores”? Does it work reliably, on your task, with your users, and your data? Serious dogfooding and tasteful evals are important here, to measure performance on real problems not just benchmarks.

The Starting Sweet Spot is Deep and Narrow

So if “general purpose” is a tradeoff and task performance is what matters, where should you invest your time? Start with deep and narrow agents. Pick a task, reduce the surface area to a small set of behaviors that matter, and make those behaviors work reliably.

Here are two failure modes that early teams often fall into:

Wide agents try to handle too many different kinds of tasks. Every additional capability is surface area for bugs, edge cases, and confused behavior. Wide agents are impressive in demos, but frustrating in production, where actions don’t work reliably enough to trust.
Shallow agents aren’t complex enough to justify being agents. Often if there’s no iteration with a user, no judgment, no multi-step reasoning required, maybe this shouldn’t be an agent. I wrote about agents and workflows in a previous post, it may be helpful in thinking about when and how you may use an agent vs. a workflow.

The sweet spot is narrow enough that you can optimize ruthlessly and deep enough that the complexity justifies the investment. To start, find your 10% of tasks that yield the majority of value to potentially agentify and ignore the rest.

Everyone is Becoming More Opinionated, Even Model Labs

Models are increasingly getting much more intelligent but to do good work reliably, they need to be massaged in an agent harness for each task. Why does Anthropic have dedicated teams for Life Sciences and Finance? Today it’s not to build a specialized foundation model for just finance. Instead they’re mapping the problem space, optimizing the agent harness (prompts, tools, context, subagents), and designing data for the next model iteration to be natively trained on valuable tasks. If you’re solving a life sciences task, it’s extremely beneficial to have tooling and instructions purpose built for that domain. Today you can achieve better task performance by being hyper-obsessed in your task harness design. You’ll see similar opinionated design principles in the harnesses of agent products like Claude Code and Codex with built-in tools and context management.

The first waves of LLMs gave us tons of general purpose tooling (frameworks, API wrappers, principles of design, etc). A great way to win today is to take that broad stack and narrow it with your opinions. For example, LangGraph and LangChain helped builders navigate the LLM landscape with useful abstractions like model providers, prompt templates, and node based orchestration. Now with DeepAgents, they have an opinionated agent harness that carefully selects useful pre-sets like filesystem support, a built-in tool for planning, prompts, etc. There’s value in options, but good, opinionated defaults are also important which is what DeepAgents gives you, customize from there if you want.

Amp Code takes a similar approach. They’re going after a single use-case, coding. To do this well, they dogfood their product relentlessly and bake their learnings directly into the product. This includes what model to use and what subagents/capabilities are useful (ex: Librarian). They don’t want to give you a million choices, they want you to succeed in developing code and the best way they can do that is by trying it themselves and transferring their learnings to you via the product.

Being Opinionated Today

Here’s the uncomfortable truth in today’s agent products: you probably have too many options and not enough opinions.

But if you’re open to changing that, here are a few places you can start today.

Audit your configuration surface. For every choice your user can make, ask “Do we know the right answer?” If yes, hardcode it. If no, figure it out…that’s your job, not theirs.
Narrow focus on a Task to start. “We’re building an agent that helps with X” is not specific enough. Pick a single workflow and optimize ruthlessly for that.
Evaluate models on your Tasks, not benchmarks. The question was never “is Opus 4.5 better than Gemini 3 Pro?” The question is “does this model, with this harness, succeed at my task?” Build for that future.

The irony of “general purpose” tooling is that it pushes the hard work onto users. Opinionated design is harder for us builders, we have to make calls, accept tradeoffs, be wrong sometimes, and face the music (ie. people on X who have opinions). But that’s exactly why it produces better products, you’re deep in the weeds doing the hard work every day.

Here’s to more opinions in more products. If you agree and especially if you disagree, I’d love to hear from you. Happy building!