Practical insights using GenAI for planned multi-agent collaboration in distributed problem solving
TLDR: To implement complex approaches for solving domain specific problems an agentic patterns using multiple large language models can be useful to achieve reliable, efficient and predictable outcomes. I wanted to share some practical insights that I gathered while implementing a scenario around analysing financial reports with relevant concepts, tooling and code.
All of this was done with a combination of open source technology (Streamlit, LangChain, LangGraph, OpenTelemetry) and Azure managed services (Azure Container Apps, Azure OpenAI, Azure AI Search and Azure Application Insights) and I want show you one possible way of putting the pieces together to achieve a working solution.
As usual there will be code (🍰) at the end;)
Our business scenario is relatively generic and you can probably relate to a similar task from your experience where you needed to process information from the internet or documents in a specific order and format to generate insights and collaborate on the facts and statements with individuals from other organisations to generate some kind of report.
In this case we assume you want to gain insights on financial statements based on information that is coming from yearly published reports from a publicly traded company to answer specific information regarding the revenue streams, market challenges or customer demographics or any other set of performance indicators.
If we were to model the process using different individuals we could describe the flow of information through a collaboration of the following agent types:
- A researcher who takes the objective to gather potentially relevant information from different sources and prepares a briefing package with the raw content, enhanced with metadata like the source, the date it was last updated, the quality ranking of the source and the type of information related to an existing taxonomy.
- An analyst who takes the raw content, metadata and objective to create a draft with the first statements and matching references, while following the predefined guidance on how to balance information validity, messaging and opinions.
- A reviewer who takes the draft and the guidelines and examines them against the provided references for correctness, the completeness and relevance to the objective and creates formalised concrete feedback which can be handed back to the analyst for the creation of a revised draft.
- An inspector who watches over the progress across analyst- reviewer iterations to ensure that the feedback is implemented in the subsequent draft while approaching an adequate report quality in a finite number of review cycles.
The basic idea behind using agents is that you give the copilot user the ability to describe the objective, process and desired outcomes via prompts to the individual agent types. So your role as a business user changes from someone that does the work of gathering materials, summarising information and maintaining references while introducing changes in a document towards a new role of a coach of a semi-intelligent agent workforce to direct and guide them to help you achieve the same outcome.
Based on that description an artificially intelligent agent for the scope of this project is the collective entity of a persona with a prompt that describes the objective and capabilities, that is linked to a GenAI model with access to relevant data and tools to complete its task in a specific context in a somewhat autonomous way. Maybe you want to look up more on this subject.
Generally speaking the usage of agents for this scenario has more advantages for quality, speed and reliability than tradeoffs in terms of code complexity and operational overhead. There are of course other possibilities to model and design the same scenario but I want to aim for the the practical relevante and the used technology stack which is why this approach is sufficient for now.
First lets talk about some relevant functional properties of generative AI technology that will play a role here:
- The current models are faster in processing input than they are at generating output. That means if you split the same context into multiple prompts that produce simple and short answers you can keep the latency of chained model calls relatively low.
- The chance of a model missing relevant information in a larger context window is a potential issue. For this reason we want to build a process that can iterate with feedback loops that carry over extracted information and input into the next loop to reduce the overall probability of skipping key details.
- The increasing complexity of the task and the prompt that goes into a completion call make them vulnerable to unstable behaviour influenced by changing either the prompt, the context structure or the model version. To compensate for that we will try to keep each agent on a small predictable set of prompts, contexts and tools that are suited for the task.
- The combination of models will make the overall process more complicated which is why we need a way for understanding and documenting the individual model reasoning as well as the connected process. Here we will use the new structured output capability to document the output and reasoning of the model in a well defined way.
- Overall we need to build up a toolchain for testing the individual and the chained agent process for quality and latency. Building up test cases requires a good data set and a matching metadata model for each scenario. The toolchain for tracing comes along by leveraging open telemetry as an export target from the agent process.
These concepts can be put together in a number of different ways with various different technology stacks that enable you do implement these functional patterns. In here I have decided to use the following components:
- Azure Container Apps for hosting the Streamlit host which is serving the user interface but also allows to receive open telemetry traces from the container apps.
- Azure AI Search as vector database for the chunks that were created by the indexer with their relevant metadata (source, classification, structure) to allow to retrieve them for the augmented briefing package.
- Azure AI Document Intelligence for breaking down tables and images from the various PDF and Excel files into semantically relevant chunks that can later be queried from the tools used by the researcher agent.
- LangGraph and LangChain as agent orchestration framework to implement a state machine that execute the different agent steps in order while carrying over the relevant part of the context between them and maintain the agent state per user session.
- Azure OpenAI for using the models that are connected to the individual agents to process input as they are provided through the LangGraph state management and process logic.
- Azure Application Insights for measuring technical and business metrics of our app as well as also trace the flow of agent to agent collaboration.
Naturally these decisions come with trade offs — some of them might change or improve over time but they might still be interesting to know if you have not worked with this technology stack before:
- The usage of structured output is today limited to GPT-4o, which means you cannot combine different models to use the most suitable model for a concrete agent type. It is to be expected that other (smaller and faster) models GPT-4o-mini will get the same capability in the future.
- LangGraph works as a state machine which requires an instance per user session in memory which does not scale well in terms of different agent runtimes. A possible alternative would be to use explicit events between agents to decouple the agent host runtime across a distributed infrastructure.
- AI Search can model objects in the search index but does only allow vector fields in the top level object. That means you cannot create sub classes as objects (for example tables within a chapter) with their own semantic meaning. Today that means you have to de-normalise your search vector data structure more than you would probably like to.
- The fact that we are giving flow control (especially the decision on when to terminate the iteration) up to the inspector agent means that there is a chance that it will take more iterations to deliver a result than is probably required. During testing we found that the agents decided to start an additional iteration even when the report result and quality actually were in a good state. For that reason we introduced a max iteration parameter which will force terminate the loops. A human in the loop pattern is probably more suitable here.
We found that the functional architecture is generally suitable for this class of problems but there is a lot of innovation coming out of models, orchestration frameworks and patterns so I suspect there will be other new concepts that will work too. Especially the availability of the o1 type models will be interesting to include into these scenarios. As of today these models do not support useful features like structured output or tools but I assume this will change.
Finally we will take a look at the code in the repository. The instructions inside will allow you to spin up the resources in Azure, ideally using a Github Codespace or local devcontainer, test the app locally, play with the code and also deploy it in our own subscription if you want.
Since the models will be making multiple complex decisions on which data to include and how to derive statements from them for a specific objective we want to document the reasoning process of the models especially across different iterations. In our case the inspector will make the decision if the feedback from the reviewer has been included by the analyst so we want to keep not only the yes or no answer but also the internal reasoning of the model for later documentation:
The same principle is also applied to the analyst as we also want to keep the internal reasoning for later reference — possibly to reproduce the same results for quality testing. In the flow it would also be helpful to use the externally documented reasoning for a human in the loop pattern in which the user will provide feedback not only on the output but also on the thinking of the model as well to guide and direct the flow.
This works nicely by declaring a custom object type and forcing the model to respond with this custom object type only. Achieving the same by parsing JSON has proven to be a challenge in the past. Of course we need to provide extra annotations to clarify how we want the model to use the fields.
class Rating(BaseModel):
'''Rating of the feedback'''
feedbackResolved: bool = Field(
...,
description="Has the feedback been resolved in the statements true or false",
)
reasoning: str = Field(
...,
description="The reasoning behind the rating with a small explanation",
)
def model_rating(input) -> Rating:
rating_model = chat_model.with_structured_output(Rating)
rating_prompt = f""" Help me understand the following by giving me a response to the question,
a short reasoning on why the response is correct: {input}"""
completion = rating_model.invoke(rating_prompt)
return completion
classify_feedback_start = "Are most of the important feedback points mentioned resolved in the statements? Output just Yes or No with a reason.\
Statements: \n {} \n Feedback: \n {} \n"
def classify_feedback(state):
print("Classifying feedback...")
with st.spinner('Inspector checking if the feedback from the reviewer has been implemented..'):
rating = model_rating(classify_feedback_start.format(state.get('statements'),state.get('feedback')))
state['all_feedback_resolved'] = rating.feedbackResolved
messages = state.get('messages')
messages.append(AIMessage(name= "Inspector (gpt-4o - v0.1)",
content="Feedback resolved: " + str(rating.feedbackResolved) + " \n\n Reasoning: "+rating.reasoning))
state['messages'] = messages
return state
The overall flow of the process is described by the possible state transitions from nodes and edges across our state machine that is powering LangGraph. We will start with the analyst, who will hand over to the reviewer. After the reviewer has provided feedback our conditional edge will use the report_ready function (with our model function above) to determine if the report is ready (or the amount of allowed iterations has been reached). Depending on the inspector feedback the loop will go back to the analyst or terminate the flow.
class ReportState(TypedDict):
all_feedback_resolved: Optional[bool] = None
feedback: Optional[str] = None
statements: Optional[str] = None
iterations: Optional[int] = None
messages: Annotated[Sequence[BaseMessage], add_messages] = []
workflow = StateGraph(ReportState)
# Define the nodes we will cycle between different states
workflow.add_node("handle_reviewer",handle_reviewer)
workflow.add_node("handle_analyst",handle_analyst)
workflow.add_node("handle_result",handle_result)
workflow.add_node("classify_feedback",classify_feedback)
# Define the conditional edges to decide if we should continue to the next state
def report_ready(state):
report_ready = state['all_feedback_resolved']
print("Deployment ready: " + str(report_ready))
total_iterations = 1 if state.get('iterations')>5 else 0
# print(state)
if state.get('iterations')>loops:
print("Iterations exceeded")
return "handle_result"
return "handle_result" if report_ready or total_iterations else "handle_analyst"
# Determine if the report is ready or not
workflow.add_conditional_edges(
"classify_feedback",
report_ready,
{
"handle_result": "handle_result",
"handle_analyst": "handle_analyst"
}
)
# Define the entry point and the end point
workflow.set_entry_point("handle_analyst")
workflow.add_edge('handle_analyst', "handle_reviewer")
workflow.add_edge('handle_reviewer', "classify_feedback")
workflow.add_edge('handle_result', END)
I was using the strongly typed object in the form of ReportState as data type that is being shared and filled (it does not have the same value all the time) among different agents to strongly type the collaboration between them. You can of course provide greater details on the typing to further refine the artifacts of the agent to agent communication. In this case I am also using the messages to output the internal thinking of the models in dedicated messages to remove them from the internal state and output them to the user interface.
It is also especially interesting to observe the models thinking in realtime which is why we are streaming all of the model outputs to the Streamlit user interface. The copilot interface also supports to ask follow up questions which will take the previous output of the analysis and start the overall loop with the original input and the new follow up question again for another iteration.
Since the process of multiple iterations can take a couple seconds the integration of open telemetry is very valuable to trace and observe the different execution steps for technical properties like token usage and duration but also against the evaluation of custom attributed like groundedness and content safety. All of these can values can be attached to the individual spans for monitoring and later analysis using the otel sdk.
For local testing I am using a local open telemetry receiver with a connected Jaeger setup in a container but for a production scenario you should be using something like the Application Insights Service which can also receive tracing information in the open telemetry format from all your running instances in Azure:
In the details of the end-to-end transaction view in Application Insights you can not only find details of the model, the prompt and the collaboration between agents but also combine these details with operational metrics on performance, throughput, latency and availability and business metrics from the process. As you change models, prompts and business logic of your graph based application over its lifetime this tool stack will help you to track and manage the quality and correctness of the generated output.
In conclusion, we have reviewed a relatively specific scenario for practical implementation on a very specific technology stack. I will publish an alternative approach as a follow up — with different advantages and tradeoffs. As of today I have not seen a universally suited orchestrator framework, application pattern or baseline models that can be leveraged to solve generic business problems using GenAI, so I would encourage you to take close practical examination of different concepts to make up your own opinion on the current state of the art.
As we are still in the early phase of adopting genAI for scenarios that go beyond simple retrieval augmented generation there are functional questions to be addressed. For example how should you measure the quality the end-to-end results?, what are good concepts for interactions between models and humans? and how much control can we give to an AI in a complex process without giving up of ever understanding what is actually going on?.
Hopefully my small contribution helped you on your own journey towards learning more about the wonderful and seemingly endless application of GenAI technology. Feel free to leave a comment or PR on the repo.