Your Management AI is Probably Wasting Time (And Tokens)
When to use thinking models, how to manage context windows, and why more complexity isn't always better
I stared at my screen for 45 seconds.
My local model was "thinking" through a simple question about project planning. The spinning cursor mocked me while I waited... and waited... and waited.
When it finally responded, the answer was worse than what I'd get from a quick, non-thinking model.
More complexity doesn't always mean better results.
Let's continue building our engineering management tool-box. If you missed the previous parts:
As a reminder, we have two goals:
Build something that will make your management job easier
Learn about the fundamental technology that drives the large language model revolution
In today's post we'll revisit the "thinking out loud" question, talk about attention, and then accelerate how we build our toolset.
Reasoning
When we switched from llama3.2 to qwen3, we also switched to a "thinking" model. You can turn this feature off by adding "/nothink" to either the system prompt (to turn it off globally) or to specific user prompts.
Here are some additional best practices on how and when to use this functionality.
Erase the Thoughts
The thinking part has short utility. That is, it's helpful in answering the current question. After the model comes up with an answer, you can drop the thinking portion for the conversation history.
Why this matters for your management tools: If you're building an agent that processes 1:1 notes, those thinking blocks will eat up your context window fast. A single session with 5 team members could generate thousands of tokens of internal reasoning that you'll never need again.
As a quick reminder, LLMs don't really have context. They only predict the next word (technically token, I'll use 'word' for clarity here) based on the input. The LLM starts from scratch with every prompt. In fact, it starts from scratch with every new token it comes up with in its own answer. The model template expands the system prompt, the list of available tools, and the previous conversation history. The conversation history includes the user prompts, the models' replies, and the results from previous tool invocations.
Models have a limited context window. For local or "edge" inference, the context size is typically small, the length of a short novella. Well structured prompts, relevant context, and continuous back-and-forth conversations will quickly stretch this window to its limit.
The "thinking" part, where a model talks to itself on how to best answer the user prompt, is verbose. If you use a large model from one of the main providers, this will mean paying for more tokens. If you run your model locally, a fuller context window means slower responses and lower quality results.
To minimize all of these downsides while still getting the benefits of thinking models, you should delete everything in the <think>
block from the conversation history.
Use It When You Need It
With excellent timing, Apple published a paper called The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity".
Here's what you need to know:
For simple questions, turn off thinking by adding '/nothink' to the user prompt. Non-thinking models actually provide (slightly) higher quality answers.
For moderate complexity questions, such as refining a document with a given set of instructions, thinking models provide genuine value.
Thinking won't help with real complexity, which is still beyond LLMs.
Why this matters for your management tools: When you're asking your agent "What should I focus on in my 1:1 with Sarah?" that's a simple lookup task. Skip the thinking. But when you ask "Help me write performance review feedback based on these 10 data points," the thinking mode will help synthesize better responses.
The Context Window
Here's a reading comprehension exercise. It's pretty similar to what I see my kids doing at school.
"When Sarah walked into the crowded restaurant, she immediately noticed the elegant chandelier hanging from the ornate ceiling, the soft jazz music playing in the background, the aroma of fresh bread wafting from the kitchen, the gentle murmur of conversations from other diners, the clinking of silverware against fine china, the warm yellow glow of candlelit tables, the professional servers weaving between guests with practiced grace, and finally spotted her friend Maria waving enthusiastically from a corner booth near the window, which made her smile because she had been worried about finding her in such a busy place."
Who does the "her" right at the end refer to?
This is an example of how you form semantic connections between words that are far away from each other. While you may take it for granted that you can do it, practically every school in the world has to teach this. This shows you that this is not as simple as it looks.
Solving this problem is at the heart of the large language model revolution.
Why this matters for your management tools: Your 1:1 notes from January need to connect to performance review conversations in June. The model needs to understand that "Sarah mentioned feeling overwhelmed" in week 3 relates to "Sarah's productivity concerns" in week 15. That's the same semantic connection challenge.
Here's what you need to know about attention: It's just a 2D table.
Yes, this is a simplified view. It is useful as a mental model.
The rows are each and every word (token) in the prompt. The columns are again each and every word in the prompt. The cells are how much the word in the row is related to the word in the column.
In our example paragraph, assuming 1 word is 1 token, we have a 98x98 table.
The value in cell [1, 92] will be high, because 'Sarah' is semantically close to 'her'.
The value in cell [1, 51] will be low because 'warm' does not relate to 'Sarah'.
Computing this table is what makes LLMs work. It is also what makes larger context windows more challenging. And why reasoning models increase the latency between question and first token in the answer.
Why this matters for your management tools: Every time you add another 1:1 note or meeting summary to your context, you're making that attention table bigger. At some point, your model will start losing track of connections between early conversations and recent ones. That's when you need smarter chunking strategies.
Let's Get Building
Armed with a deeper understanding of what's going on under the hood, we can roll up our sleeves and get back to building.
Earlier in the series, we used a "low level" library to minimize abstractions.
Now we can move faster.
Here is your challenge:
Use the Agno python library to write a simple agent. You can continue to use a local ollama model like qwen or connect it to a cloud provider.
Build something _simple_. I recommend creating a quick agent that reads your 1:1 notes and helps you prepare for the next session.
Share what you built in the comments.
The goal isn't perfection. The goal is building something that makes your next 1:1 meeting 10% better.
Because that's how you compound small wins into management superpowers.