Evaluating LlamaIndex Agents

Building agents that can intelligently use tools and make decisions is only half the journey; ensuring that these agents are accurate, reliable, and performant is what truly defines their success. LlamaIndex provides various ways to create agents including FunctionAgents, CodeActAgents, and ReActAgents. In this tutorial, we will explore how to evaluate these different agent types using both pre-built Ragas metrics and custom evaluation metrics.

Let's get started.

The tutorial is divided into three comprehensive sections:

Evaluating with Off-the-Shelf Ragas Metrics Here we will examine two fundamental evaluation tools: AgentGoalAccuracy, which measures how effectively an agent identifies and achieves the user's intended objective, and Tool Call Accuracy, which assesses the agent's ability to select and invoke appropriate tools in the correct sequence to complete tasks.
Custom Metrics for CodeActAgent Evaluation This section focuses on LlamaIndex's prebuilt CodeActAgent, demonstrating how to develop tailored evaluation metrics that address the specific requirements and capabilities of code-generating agents.
Query Engine Tool Assessment The final section explores how to leverage Ragas RAG metrics to evaluate query engine functionality within agents, providing insights into retrieval effectiveness and response quality when agents access information systems.

Ragas Agentic Metrics

To demonstrate evaluations using Ragas metrics, we will create a simple workflow with a single LlamaIndex Function Agent, and use that to cover the basic functionality.

Click to View the Function Agent Setup

from llama_index.llms.openai import OpenAI


async def send_message(to: str, content: str) -> str:
    """Dummy function to simulate sending an email."""
    return f"Successfully sent mail to {to}"

llm = OpenAI(model="gpt-4o-mini")

from llama_index.core.agent.workflow import FunctionAgent

agent = FunctionAgent(
    tools=[send_message],
    llm=llm,
    system_prompt="You are a helpful assistant of Jane",
)

Agent Goal Accuracy

The true value of an AI agent lies in its ability to understand what users want and deliver it effectively. Agent Goal Accuracy serves as a fundamental metric that evaluates whether an agent successfully accomplishes what the user intended. This measurement is crucial as it directly reflects how well the agent interprets user needs and takes appropriate actions to fulfill them.

Ragas provides two key variants of this metric:

AgentGoalAccuracyWithReference - A binary assessment (1 or 0) that compares the agent's final outcome against a predefined expected result.
AgentGoalAccuracyWithoutReference - A binary assessment (1 or 0) that evaluates whether the agent achieved the user's goal based on inferred intent rather than predefined expectations.

With Reference is ideal for scenarios where the expected outcome is well-defined, such as in controlled testing environments or when testing against ground truth data.

from llama_index.core.agent.workflow import (
    AgentInput,
    AgentOutput,
    AgentStream, 
    ToolCall as LlamaToolCall,
    ToolCallResult,
)

handler =  agent.run(user_msg="Send a message to jhon asking for a meeting")

events = []

async for ev in handler.stream_events():
    if isinstance(ev, (AgentInput, AgentOutput, LlamaToolCall, ToolCallResult)):
        events.append(ev)
    elif isinstance(ev, AgentStream):
        print(f"{ev.delta}", end="", flush=True)
    elif isinstance(ev, ToolCallResult):
        print(
            f"\nCall {ev.tool_name} with {ev.tool_kwargs}\nReturned: {ev.tool_output}"
        )

response = await handler

Output:

I have successfully sent a message to Jhon asking for a meeting.

from ragas.integrations.llama_index import convert_to_ragas_messages

ragas_messages = convert_to_ragas_messages(events)

from ragas.metrics import AgentGoalAccuracyWithoutReference
from ragas.llms import LlamaIndexLLMWrapper
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import ToolCall as RagasToolCall

evaluator_llm = LlamaIndexLLMWrapper(llm=llm)

sample = MultiTurnSample(
    user_input=ragas_messages,
)

agent_goal_accuracy_without_reference = AgentGoalAccuracyWithoutReference(llm=evaluator_llm)
await agent_goal_accuracy_without_reference.multi_turn_ascore(sample)

Output:

1.0

from ragas.metrics import AgentGoalAccuracyWithReference

sample = MultiTurnSample(
    user_input=ragas_messages,
    reference="Successfully sent a message to Jhon asking for a meeting"
)


agent_goal_accuracy_with_reference = AgentGoalAccuracyWithReference(llm=evaluator_llm)
await agent_goal_accuracy_with_reference.multi_turn_ascore(sample)

Output:

1.0

Tool Call Accuracy

In agentic workflows, an AI agent's effectiveness depends heavily on its ability to select and use the right tools at the right time. The Tool Call Accuracy metric evaluates how precisely an agent identifies and invokes appropriate tools in the correct sequence to complete a user's request. This measurement ensures that agents not only understand what tools are available but also how to orchestrate them effectively to achieve the intended outcome.

ToolCallAccuracy compares the agent's actual tool usage against a reference sequence of expected tool calls. If the agent's tool selection or sequence differs from the reference, the metric returns a score of 0, indicating a failure to follow the optimal path to task completion.

from ragas.metrics import ToolCallAccuracy

sample = MultiTurnSample(
    user_input=ragas_messages,
    reference_tool_calls=[
        RagasToolCall(
            name="send_message",
            args={'to': 'jhon', 'content': 'Hi Jhon,\n\nI hope this message finds you well. I would like to schedule a meeting to discuss some important matters. Please let me know your availability.\n\nBest regards,\nJane'},
        ),
    ],
)

tool_accuracy_scorer = ToolCallAccuracy()
await tool_accuracy_scorer.multi_turn_ascore(sample)

Output:

1.0

Evaluating LlamaIndex CodeAct Agents

LlamaIndex offers a prebuilt CodeAct Agent that can be used to write and execute code, inspired by the original CodeAct paper. The idea is: instead of outputting a simple JSON object, a Code Agent generates an executable code block—typically in a high-level language like Python. Writing actions in code rather than JSON-like snippets provides better:

Composability: Code naturally allows nesting and reuse of functions; JSON actions lack this flexibility.
Object management: Code elegantly handles operation outputs (image = generate_image()); JSON has no clean equivalent.
Generality: Code expresses any computational task; JSON imposes unnecessary constraints.
Representation in LLM training data: LLMs already understand code from training data, making it a more natural interface than specialized JSON.

Click to View the CodeActAgent Setup

Defining Functions

from llama_index.llms.openai import OpenAI

# Configure the LLM
llm = OpenAI(model="gpt-4o-mini")


# Define a few helper functions
def add(a: int, b: int) -> int:
    """Add two numbers together"""
    return a + b


def subtract(a: int, b: int) -> int:
    """Subtract two numbers"""
    return a - b


def multiply(a: int, b: int) -> int:
    """Multiply two numbers"""
    return a * b


def divide(a: int, b: int) -> float:
    """Divide two numbers"""
    return a / b

Create a Code Executor

The CodeActAgent will require a specific code_execute_fn to execute the code generated by the agent.

from typing import Any, Dict, Tuple
import io
import contextlib
import ast
import traceback


class SimpleCodeExecutor:
    """
    A simple code executor that runs Python code with state persistence.

    This executor maintains a global and local state between executions,
    allowing for variables to persist across multiple code runs.

    NOTE: not safe for production use! Use with caution.
    """

    def __init__(self, locals: Dict[str, Any], globals: Dict[str, Any]):
        """
        Initialize the code executor.

        Args:
            locals: Local variables to use in the execution context
            globals: Global variables to use in the execution context
        """
        # State that persists between executions
        self.globals = globals
        self.locals = locals

    def execute(self, code: str) -> Tuple[bool, str, Any]:
        """
        Execute Python code and capture output and return values.

        Args:
            code: Python code to execute

        Returns:
            Dict with keys `success`, `output`, and `return_value`
        """
        # Capture stdout and stderr
        stdout = io.StringIO()
        stderr = io.StringIO()

        output = ""
        return_value = None
        try:
            # Execute with captured output
            with contextlib.redirect_stdout(
                stdout
            ), contextlib.redirect_stderr(stderr):
                # Try to detect if there's a return value (last expression)
                try:
                    tree = ast.parse(code)
                    last_node = tree.body[-1] if tree.body else None

                    # If the last statement is an expression, capture its value
                    if isinstance(last_node, ast.Expr):
                        # Split code to add a return value assignment
                        last_line = code.rstrip().split("\n")[-1]
                        exec_code = (
                            code[: -len(last_line)]
                            + "\n__result__ = "
                            + last_line
                        )

                        # Execute modified code
                        exec(exec_code, self.globals, self.locals)
                        return_value = self.locals.get("__result__")
                    else:
                        # Normal execution
                        exec(code, self.globals, self.locals)
                except:
                    # If parsing fails, just execute the code as is
                    exec(code, self.globals, self.locals)

            # Get output
            output = stdout.getvalue()
            if stderr.getvalue():
                output += "\n" + stderr.getvalue()

        except Exception as e:
            # Capture exception information
            output = f"Error: {type(e).__name__}: {str(e)}\n"
            output += traceback.format_exc()

        if return_value is not None:
            output += "\n\n" + str(return_value)

        return output

code_executor = SimpleCodeExecutor(
    # give access to our functions defined above
    locals={
        "add": add,
        "subtract": subtract,
        "multiply": multiply,
        "divide": divide,
    },
    globals={
        # give access to all builtins
        "__builtins__": __builtins__,
        # give access to numpy
        "np": __import__("numpy"),
    },
)

Setup the CodeAct Agent

from llama_index.core.agent.workflow import CodeActAgent
from llama_index.core.workflow import Context

agent = CodeActAgent(
    code_execute_fn=code_executor.execute,
    llm=llm,
    tools=[add, subtract, multiply, divide],
)

# context to hold the agent's session/state/chat history
ctx = Context(agent)

Running and Evaluating the CodeAct agent

from llama_index.core.agent.workflow import (
    AgentInput,
    AgentOutput,
    AgentStream,
    ToolCall,
    ToolCallResult,
)

handler = agent.run("Calculate the sum of the first 10 fibonacci numbers", ctx=ctx)

events = []

async for event in handler.stream_events():
    if isinstance(event, (AgentInput, AgentOutput, ToolCall, ToolCallResult)):
        events.append(event)
    elif isinstance(event, AgentStream):
        print(f"{event.delta}", end="", flush=True)

The first 10 Fibonacci numbers are 0, 1, 1, 2, 3, 5, 8, 13, 21, and 34. I will calculate their sum.

<execute>
def fibonacci(n):
    fib_sequence = [0, 1]
    for i in range(2, n):
        next_fib = fib_sequence[-1] + fib_sequence[-2]
        fib_sequence.append(next_fib)
    return fib_sequence

# Calculate the first 10 Fibonacci numbers
first_10_fib = fibonacci(10)

# Calculate the sum of the first 10 Fibonacci numbers
sum_fib = sum(first_10_fib)
print(sum_fib)
</execute>The sum of the first 10 Fibonacci numbers is 88.

Extract the ToolCall

CodeAct_agent_tool_call = events[2]
agent_code = CodeAct_agent_tool_call.tool_kwargs["code"]

print(agent_code)

Output

    def fibonacci(n):
        fib_sequence = [0, 1]
        for i in range(2, n):
            next_fib = fib_sequence[-1] + fib_sequence[-2]
            fib_sequence.append(next_fib)
        return fib_sequence

    # Calculate the first 10 Fibonacci numbers
    first_10_fib = fibonacci(10)

    # Calculate the sum of the first 10 Fibonacci numbers
    sum_fib = sum(first_10_fib)
    print(sum_fib)

When assessing CodeAct agents, we can begin with foundational metrics that examine basic functionality, such as code compilability or appropriate argument selection. These straightforward evaluations provide a solid foundation before advancing to more sophisticated assessment approaches.

Ragas offers powerful custom metric capabilities that enable increasingly nuanced evaluation as your requirements evolve.

AspectCritic - Provides a binary evaluation (pass/fail) that determines whether an agent's response satisfies specific user-defined criteria, using LLM-based judgment to deliver clear success indicators.
RubricScoreMetric - Evaluates agent responses against comprehensive, predefined quality rubrics with discrete scoring levels, enabling consistent performance assessment across multiple dimensions.

def is_compilable(code_str: str, mode="exec") -> bool:
    try:
        compile(code_str, "<string>", mode)
        return True
    except Exception:
        return False

is_compilable(agent_code)

Output

True

from ragas.metrics import AspectCritic
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import LlamaIndexLLMWrapper

llm = OpenAI(model="gpt-4o-mini")
evaluator_llm = LlamaIndexLLMWrapper(llm=llm)

correct_tool_args = AspectCritic(
    name="correct_tool_args",
    llm=evaluator_llm,
    definition="Score 1 if the tool arguements use in the tool call are correct and 0 otherwise",
)

sample = SingleTurnSample(
    user_input="Calculate the sum of the first 10 fibonacci numbers",
    response=agent_code,
)

await correct_tool_args.single_turn_ascore(sample)

Output:

Evaluating Query Engine Tool

When evaluating with Ragas metrics, we need to ensure that our data is formatted suitably for evaluations. When working with a query engine tool within an agentic system, we can approach the evaluation as we would for any retrieval-augmented generation (RAG) system.

We will extract all instances where the query engine tool was called during user interactions. Using that, we can construct a Ragas RAG evaluation dataset based on our event stream data. Once the dataset is ready, we can apply the full suite of Ragas evaluation metrics. In this section, we will set up a Functional Agent with Query Engine Tools. The agent has access to two "tools": one to query the 2021 Lyft 10-K and the other to query the 2021 Uber 10-K.

Click to View the Agent Setup

Setting the LLMs

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

Build Query Engine Tools

from llama_index.core import StorageContext, load_index_from_storage

try:
    storage_context = StorageContext.from_defaults(
        persist_dir="./storage/lyft"
    )
    lyft_index = load_index_from_storage(storage_context)

    storage_context = StorageContext.from_defaults(
        persist_dir="./storage/uber"
    )
    uber_index = load_index_from_storage(storage_context)

    index_loaded = True
except:
    index_loaded = False

!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

if not index_loaded:
    # load data
    lyft_docs = SimpleDirectoryReader(
        input_files=["./data/10k/lyft_2021.pdf"]
    ).load_data()
    uber_docs = SimpleDirectoryReader(
        input_files=["./data/10k/uber_2021.pdf"]
    ).load_data()

    # build index
    lyft_index = VectorStoreIndex.from_documents(lyft_docs)
    uber_index = VectorStoreIndex.from_documents(uber_docs)

    # persist index
    lyft_index.storage_context.persist(persist_dir="./storage/lyft")
    uber_index.storage_context.persist(persist_dir="./storage/uber")

lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)
uber_engine = uber_index.as_query_engine(similarity_top_k=3)

from llama_index.core.tools import QueryEngineTool

query_engine_tools = [
    QueryEngineTool.from_defaults(
        query_engine=lyft_engine,
        name="lyft_10k",
        description=(
            "Provides information about Lyft financials for year 2021. "
            "Use a detailed plain text question as input to the tool."
        ),
    ),
    QueryEngineTool.from_defaults(
        query_engine=uber_engine,
        name="uber_10k",
        description=(
            "Provides information about Uber financials for year 2021. "
            "Use a detailed plain text question as input to the tool."
        ),
    ),
]

Agent Setup

from llama_index.core.agent.workflow import FunctionAgent, ReActAgent
from llama_index.core.workflow import Context

agent = FunctionAgent(tools=query_engine_tools, llm=OpenAI(model="gpt-4o-mini"))

# context to hold the session/state
ctx = Context(agent)

Running and Evaluating Agents

from llama_index.core.agent.workflow import (
    AgentInput,
    AgentOutput,
    ToolCall,
    ToolCallResult,
    AgentStream, 
)

handler = agent.run("What's the revenue for Lyft in 2021 vs Uber?", ctx=ctx)

events = []

async for ev in handler.stream_events():
    if isinstance(ev, (AgentInput, AgentOutput, ToolCall, ToolCallResult)):
        events.append(ev)
    elif isinstance(ev, AgentStream):
        print(ev.delta, end="", flush=True)

response = await handler

Output:

In 2021, Lyft generated a total revenue of $3.21 billion, while Uber's total revenue was significantly higher at $17.455 billion.

We will extract all instances of ToolCallResult where the query engine tool was called during user interactions using that we can construct a proper RAG evaluation dataset based on your event stream data.

from ragas.dataset_schema import SingleTurnSample

ragas_samples = []

for event in events:
    if isinstance(event, ToolCallResult):
        if event.tool_name in ["lyft_10k", "uber_10k"]:
            sample = SingleTurnSample(
                user_input=event.tool_kwargs["input"],
                response=event.tool_output.content,
                retrieved_contexts=[node.text for node in event.tool_output.raw_output.source_nodes]
                )
            ragas_samples.append(sample)

from ragas.dataset_schema import EvaluationDataset

dataset = EvaluationDataset(samples=ragas_samples)
dataset.to_pandas()

Output:

	user_input	retrieved_contexts	response
0	What was the total revenue for Uber in the yea...	[Financial and Operational Highlights\nYear En...	The total revenue for Uber in the year 2021 wa...
1	What was the total revenue for Lyft in the yea...	[Significant items\n subject to estimates and ...	The total revenue for Lyft in the year 2021 wa...

The resulting dataset will not include reference answers by default, so we’ll be limited to using metrics that do not require references. However, if you wish to run reference-based evaluations, you can add a reference column to the dataset and then apply the relevant Ragas metrics.

Evaluating using Ragas RAG Metrics

Let's assess the effectiveness of query engines, particularly regarding retrieval quality and hallucination prevention. To accomplish this evaluation, We will employ two key Ragas metrics: faithfulness and context relevance. For more you can visit here.

This evaluation approach allows us to identify potential issues with either retrieval quality or response generation that could impact overall system performance. - Faithfulness - Measures how accurately the generated response adheres to the facts presented in the retrieved context, ensuring claims made by the system can be directly supported by the information provided. - Context Relevance - Evaluates how effectively the retrieved information addresses the user's specific query by assessing its pertinence through dual LLM judgment mechanisms.

from ragas import evaluate
from ragas.metrics import Faithfulness, ContextRelevance
from ragas.llms import LlamaIndexLLMWrapper
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
evaluator_llm = LlamaIndexLLMWrapper(llm=llm)

faithfulness = Faithfulness(llm=evaluator_llm)
context_precision = ContextRelevance(llm=evaluator_llm)

result = evaluate(dataset, metrics=[faithfulness, context_precision])

Evaluating: 100%|██████████| 4/4 [00:03<00:00,  1.19it/s]

result.to_pandas()

Output:

	user_input	retrieved_contexts	response	faithfulness	nv_context_relevance
0	What was the total revenue for Uber in the yea...	[Financial and Operational Highlights\nYear En...	The total revenue for Uber in the year 2021 wa...	1.0	1.0
1	What was the total revenue for Lyft in the yea...	[Significant items\n subject to estimates and ...	The total revenue for Lyft in the year 2021 wa...	1.0	1.0