"AI Agents" are by now commonplace in our day-to-day live. Seemingly everywhere we are surrounded by tools and ads advertising them to make our life easier.
Despite that, when talking to people it is astounding how many people use the terminology without a decent grasp of the concept.
Most often they are easily mixed up with automations.
My own explanations often fail, too many technical terms perhaps, so I'll try to explain it here.
Technical terms are often a sign that the explainer himself does not understand. So I'll try to not use them at all or explain them beforehands.
So let's start with the main building block of any agent. A large language model ("LLM").
What does an LLM do? Think of it as a box with two holes on each side. On one side it takes in some input (text, ALWAYS text) and outputs text (again, ALWAYS text) on the other. What text exactly? The text that is statistically most likely to be the continuation of the input.
You ask: "What is the capital of Germany?" and with a very high probability it'll return "Berlin".
How it does that, that is, what happens inside the box is not actually important. Just remember, its ability to do it are achieved through a process called "training".
What is important, however, is that it is a "stateless" box. Now, what the f--- is stateless? It means it remembers nothing. If you've ever seen the movie "50 first dates" (don't rewatch it, it's not particularly great), the female protagonist wakes up every morning having forgotten all that happened to her since an accident. The male protagonist spends every day over and over making her fall in love with him. "Groundhog day" follows a similar pattern, only in this case the world around the protagonist is reset. He nonetheless has a similar goal.
It is the same for LLMs. Every time you prompt an LLM its the woman from 50 first dates. It remembers nothing but its training. Everything else it needs to re-read.
You heard that right. Every. Single. Time. You. Prompt. It.
Let's hammer that in, because it is hidden away by the chat interface that makes it seem as if you are talking to something that remembers what you said. Here's what actually happens.
When you ask: "What is the capital of Germany?", the LLM receives:
System: ... (we'll get to that in a minute)
User: What is the capital of Germany?
It answers with "Berlin". When you go on to ask: "Since when has it been the capital of Germany?", the LLM, not having a state, does not even remember your first question. It therefore receives:
System: ...
User: What is the capital of Germany?
LLM: Berlin
User: Since when has it been the capital of Germany?
Again, let that sink in, it receives everything that has happened up to that point. Every time you prompt, it receives the entire conversation history. That's what I mean when I say it is "hidden" away from you.
Because an LLM forgets everything every time we also have to tell it its objectiv every, single time. That is what goes into system (the "system prompt"). So to complete the picture, on your first question it receives
System: You are a helpful assistant...
User: What is the capital of Germany?
And on your follow up promt:
System: You are a helpful assistant...
User: What is the capital of Germany?
LLM: Berlin
User: Since when has it been the capital of Germany?
With that out of the way, let's remember our ingredients again. Text. I repeat ONLY text. LLMs take in text and output text. They can't do anything else.
Now, you say, I see llms do all sorts of fancy things, like searching the web or fetching live weather. You may have even heard ads saying that they use tools to get stuff done.
You are right, nonetheless they do that using only text. How?
Let's start with training. LLMs have been trained to output text formated in a very specific way if they want to use a tool. They do that by using easily distinguishable markers like tags:
Tool marker: <tool>...instructions...</tool>
So if they want to use a web search they output:
System: You are a helpful assistant (...). Tools you have: websearch
User: Show me what is important today
LLM:<tool> Websearch: What important things happened today </tool>
All that is text.
Now, the programm I talked about that controlls and changes what goes in and out of the box, is programmed to scan the text output and to do a predetermined action as soon as it sees <tool>...</tool>. I this case to search the web.
It does that and returns the result, as text to the llm. To be very specific, a brand new llm that, again, doesn't know about anything that happened before and receives everything that happened in your chat plus the result of the web search as text. It takes that new info to output to you in a new piece of text output that is enriched by the information from the web search:
System: You are a helpful assistant (...). Tools you have: websearch
User: Show me what is important today
LLM:<tool> Websearch: What important things happened today </tool>
Tool: Germany won the world cup. Stocks are up. (...) ← injected by the program into the chat
The llm will now run again and return:
System: You are a helpful assistant (...). Tools you have: websearch
User: Show me what is important today
LLM: I should use the web search tool.<tool> Websearch: What important things happened today </tool>
Tool: Germany won the world cup. Stocks are up. (...) ← injected by the program into the chat
LLM: Good news! Germany has won the world cup and stocks are up as a result (...) ← This is what you get to see in the chat interface
What you have just seen is a "tool use". An LLM trained to output specially formed text that activates a process of the program it is hooked into. The program in turn feeds the output of the process back to the LLM.
With that out of the way, we have everything we need to actually explain what an agent is. In fact, you just saw one.
By "using" a tool the LLM did not output an answer to you. Instead it activated a process in the program that fed today's news to itself. Itself as in its new self that forgot everything and was fed the whole chat, including the output of the tool.
A round trip was made. Output became input enriched by the result of the tool. Put differently a loop. An agent is an LLM that unknowingly sits in a loop and that loop is only broken when it does not call a tool. Then you, the user, gets to see the result. That is one of the reasons why sometimes you see that spinner going a long time before you receive an answer.
That is the "magic". The llm "decides" how many times it feeds itself more info ("loops") before returning an answer to the user.
That is all. Simple, but effective.
Let's bring this blog post to a close by coming back to our initial problem. The mixup between agents and automations / workflows. An agent can be an automation, but not every automation is an agent:
Many automations follow a set number of steps that branch in pre-determined ways. When A happens, B happens. When B happens C or D can happen. We know exactly that it will be either A -> B -> C or A -> B -> D. C can be an LLM box that outputs a slack message:
System: You are a helpful assistant.
User: Create a slack message given this meeting info: (...Here go the results of Meeting A and B) LLM: Hi Pam, fyi, the following items were discussed in todays meeting: (...)
LLM is involved, alas, it is not an agent. The LLM has no tools, it cannot alter the pattern of things. If there were no meetings, an LLM would still draft a slack message in that last step.
If it was agentic, it would have tools and we cannot quite be sure what happens.
System: You are a helpful assistant. Tools you have: Check calendar, send slack message, send email
User: Inform Pam about what happened in the meeting today ← You
LLM: I should use the calendar tool:<tool> search calendar </tool>
Tool: Info Meeting 1, Info Meeting 2, Info Meeting 3
LLM: I have a tool called send email. Let's draft an email.<tool> (...draft...) </tool>
Tool: Error
LLM: There appears to have been an error. I should try the Slack tool:<tool> (...draft...) </tool>
Tool: Slack message sent.
LLM: I have notified Pam about what happened in all you meetings today.
See how that could have gone completely differently? After the email error the LLM box could have also output a message to you saying that emails don't work. If there had been no meetings it could have simply told you there were no meetings. Or it could have sent Pam a message saying there were no meetings (not particularly helpful).
That is the tradeoff. Increased flexibility for higher error rates. Sometimes an automations is far superior. But given that "agent" is the buzzword, deterministic automations that may involve LLMs are quickly labeled as agent.
And while agent sounds fancy, you have just learned that actually it is only a clever mechanisms that feeds itself text.