Book a Call
Back to Perspective
AI ImplementationApril 29, 2026 · 7 min read

How to Use LangSmith to Monitor and Improve AI Agents

LangSmith gives teams full visibility into AI agent behavior. Learn to set it up, monitor traces, and use the data to improve performance.

AI Implementation — How to Use LangSmith to Monitor and Improve AI Agents

How to Use LangSmith to Monitor and Improve AI Agents

The short answer: Connect your LangChain or custom agent to LangSmith by setting three environment variables, then use the platform's tracing dashboard to inspect every step of every run. From there, build evaluation datasets from real traces, run automated tests, and close the loop with human feedback. That cycle, repeated consistently, is how agents improve.

Why Agent Monitoring Is Harder Than It Looks

Building an AI agent that works in a demo is one thing. Understanding why it fails on a Tuesday afternoon with a real user, in a real workflow, is another problem entirely.

LLM outputs are non-deterministic. A retrieval step that performs well 90% of the time still fails 10% of the time, and in a multi-step agent, that failure compounds. A single wrong tool call at step two can make the final output look reasonable while being completely wrong. You won't catch that by reading logs.

This is the gap LangSmith was built to close. Developed by LangChain, it gives teams a structured way to observe, debug, and evaluate agent behavior at the run level, not just the output level. As of 2026, it's one of the more mature observability tools in the agentic AI space, and teams at companies like Cohere and Replit have used it in production pipelines.

Here's how to use it well.

Step One: Connect Your Agent to LangSmith

Setup is genuinely fast. If you're already using LangChain, you need three environment variables:

LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_api_key
LANGCHAIN_PROJECT=your_project_name

Once those are set, every run your agent makes is automatically traced and sent to LangSmith. No additional instrumentation required for standard LangChain components.

If you're using a custom agent built outside of LangChain, you can use the LangSmith SDK directly. The traceable decorator wraps any Python function and sends trace data to the platform. This matters because many teams in 2026 are building hybrid systems where LangGraph handles orchestration but custom retrieval or tool-execution logic lives outside the framework.

Either way, the first run you trace will probably surprise you. Seeing the full chain, including every prompt sent to the model, every token count, every tool input and output, arranged sequentially, reveals things that are easy to miss when you're reading only the final response.

Step Two: Read Your Traces Like a Diagnostician

A trace in LangSmith is a tree. The root node is the top-level agent run. Child nodes are individual steps: retrievals, LLM calls, tool invocations, chains. Each node shows latency, token usage, input, output, and any errors.

When you're diagnosing a bad output, start at the leaves. Find the step where the data first went wrong. Was the retrieval returning irrelevant chunks? Did the model misinterpret the tool's output? Did a tool call fail silently and the agent moved on anyway?

Common patterns to watch for:

Retrieval misses. The agent called the right tool but got back chunks that didn't answer the question. This usually means your chunking strategy or embedding model needs work, not the agent itself. If retrieval is a bottleneck for your agent, consider whether retrieval-augmented generation strategies might improve your approach.

Prompt drift. In multi-turn agents, the system prompt and conversation history together can push the model into a context where it loses track of its original task. The trace will show the exact prompt at each step, which makes this visible.

Tool hallucination. The model calls a tool with plausible-looking but incorrect parameters. This is especially common when tool schemas are underspecified. You'll see it in the tool input node.

Runaway loops. Some agents retry indefinitely when uncertain. Latency data on individual runs will flag this immediately. A run that should take four seconds but took 40 almost always has a loop problem.

LangSmith lets you filter runs by latency, token count, error status, and metadata tags. Building a filter for your top 5% slowest runs, or runs tagged with a specific user segment, is a fast way to find the cases worth studying.

Step Three: Build Evaluation Datasets from Real Traces

This is where most teams underinvest, and it's also where LangSmith creates the most durable value.

Once you've traced a few hundred runs, you have the raw material for a proper evaluation dataset. In LangSmith, you can select individual traces and add them to a dataset with a few clicks. Annotate the ones that produced good outputs, annotate the failures, and now you have a ground-truth set that reflects what real users actually ask your agent.

The alternative, writing evaluation examples by hand before you have user data, produces datasets that are clean and unrealistic. Real queries are messier, more ambiguous, and more varied than anything you'll write yourself.

Once your dataset exists, LangSmith lets you run evaluations against it in two modes. Automated evaluators use an LLM judge to score outputs against criteria you define, things like correctness, groundedness, or task completion. Human evaluators use a review queue where annotators can rate responses directly in the platform. Both feed back into the same dataset, so you accumulate signal over time.

A team running a customer support agent might build a dataset of 200 traces representing their most common query types, then run their evaluation suite every time they push a prompt change or swap models. That's the kind of regression testing discipline that separates agents that improve from agents that drift.

Step Four: Use Feedback to Close the Loop

LangSmith supports programmatic feedback, which means you can send signal back from your application layer. If a user clicks a thumbs-down button, or if a downstream system flags an output as incorrect, you can attach that feedback to the originating trace using the API.

This creates a closed loop: production behavior generates traces, traces generate evaluation data, evaluation data drives improvements, and improved agents generate better production traces. This kind of continuous improvement cycle is core to how agentic AI systems mature over time.

Practically, this looks like adding a few lines to your feedback handler:

client.create_feedback(
    run_id=run_id,
    key="user_rating",
    score=0,
    comment="Response did not answer the question"
)

That feedback appears on the trace in LangSmith and can be used to filter for cases worth reviewing. Over a few weeks, patterns emerge. Maybe negative feedback clusters on a specific question type, or on runs where retrieval returned fewer than three chunks. That's actionable.

Step Five: Run Experiments Before Shipping Changes

LangSmith includes an experiments feature that lets you run two versions of your agent against the same dataset and compare results side by side. You change a prompt, swap in a different model, or modify your retrieval parameters, and instead of guessing which version is better, you have a scored comparison across every example in your dataset.

This matters more as agents become more complex. A change that improves performance on customer onboarding queries might degrade performance on billing questions. Without a multi-example evaluation, you won't know until it's in production.

The experiment workflow in LangSmith is:

  1. Define a dataset with representative examples
  2. Run your current agent against it to establish a baseline
  3. Make your change
  4. Run the new version against the same dataset
  5. Compare scores using your automated or human evaluators

Teams that build this into their deployment process, rather than treating it as optional, ship agent updates with much more confidence. It also creates an audit trail, which matters for organizations with any governance requirements around AI outputs.

The Bigger Picture

LangSmith is a tool, and like any tool, its value depends entirely on how consistently you use it. Setting it up takes an afternoon. Building a rigorous evaluation culture around it takes longer, probably several sprints of deliberate effort.

The teams that get the most out of it treat tracing and evaluation as first-class engineering work, not a nice-to-have layer on top of the real job. That shift in how the work is framed, from 'we built an agent' to 'we operate and improve an agent,' is what separates systems that plateau from systems that compound.

If your team is building agents and doesn't have an observability and evaluation practice in place, you're flying without instruments. LangSmith gives you the instruments. The discipline to use them is the harder part.

Ready to take the next step?

Book a Discovery Call

Frequently asked questions

Does LangSmith work with agents built outside of LangChain?

Yes. LangSmith provides a Python SDK with a `traceable` decorator that works with any Python-based agent, regardless of whether it uses LangChain or LangGraph. You can instrument custom functions, API calls, and tool executions directly. The tracing data flows into the same dashboard as LangChain-native traces.

How much does LangSmith cost for production use?

LangSmith has a free Developer tier with trace limits, and paid tiers starting at the Plus level for teams needing higher trace volumes and collaboration features. Enterprise pricing is available for organizations with compliance requirements or large-scale deployment needs. Pricing has changed in 2026, so check the LangChain website for current plans.

What's the difference between LangSmith tracing and traditional application logging?

Traditional logging captures events. LangSmith captures structured execution trees where every LLM call, tool invocation, and retrieval step is linked to its parent run and annotated with inputs, outputs, latency, and token counts. That structure makes it possible to understand multi-step agent behavior, not just individual events in isolation.

How many traces do I need before evaluation datasets are useful?

A dataset of 50 to 100 diverse, real-world traces is enough to start running meaningful evaluations. Smaller sets can catch obvious regressions. As your dataset grows toward 200 to 500 examples covering your main query types, your evaluations become more reliable predictors of production performance.

Can non-engineers use LangSmith, or is it purely a developer tool?

The human annotation and feedback review features are accessible to non-engineers and are designed for subject matter experts to rate agent outputs without writing code. The setup, instrumentation, and experiment configuration require engineering involvement. In practice, the most effective teams use LangSmith as a shared surface where engineers and domain experts both contribute.

Related Perspective