TextQuests: How Good Are LLMs at Games?

🧠 GPT-4 achieved close to a 40% task completion rate in TextQuests, outperforming other large language models.
🎮 TextQuests use over 38,000 quests across 12 task types to assess long-term memory, planning, and reasoning.
⚠️ LLMs struggle with repeated actions, hallucinating items, and ignoring environment logic in text-based video games.
🤖 Tools like ReAct and memory-augmented agents significantly improve performance in language-only simulations.
📈 Insights from text-only game benchmarks are being used to build more adaptive bots in real-world business automation.

From Zork to Zero-Shot Agents: Testing LLMs in Text-Only Games

Text-based video games like Zork were early examples of how people could talk to computers using natural language. Players typed in phrases such as "go north" or "take lantern." Then they moved through fantasy worlds that responded to their choices. But these games are not just old curiosities anymore. Now, they are a place to test new artificial intelligence. Today, we test large language models (LLMs) with these kinds of scenarios in a system called TextQuests. Why? Because solving puzzles in text-only games needs more than just understanding language. It makes LLMs really use their reasoning, memory, and decision-making skills.

What Are TextQuests?

TextQuests are a big test made to see what LLMs can do in made-up worlds where only language exists. Like old text-based video games, each TextQuest is a story task. It asks an LLM to finish goals like "rescue the princess," "find and wear a hat," or "bake a cake." But there are no pictures, icons, or rule lists to see. There is only language in and out.

In this simple setup, the LLM must:

Understand story descriptions written in normal language.
Keep track of items it has.
Go from one place to another.
Watch how the world changes.
React to surprises quickly.

These quests are more than just puzzles. They are small worlds with their own rules, problems, and ways things work. And because everything happens through words, the LLM must understand more than just grammar. It needs to understand meaning, cause and effect, and how to use its memory.

Each quest is set up to test specific brain-like tasks such as:

Time-based memory (Did the model keep in mind it found a key 10 steps back?)
Planning goals (Can the model make a plan to reach the final goal?)
Changing tactics (Does the model change its way of doing things if something does not work?)
Logical thinking (Does it understand how things relate in space or how one thing causes another?)

By taking away physical controls and focusing only on language, TextQuests create the best place for researchers to see how "smart" LLMs really are. This happens when the models must act and not just chat.

What's Hard for LLMs: Keeping Things in Mind, Planning, and Thinking Ahead

Large language models such as GPT-3.5 and GPT-4 are praised because they can write very real and fitting text. But writing text and "thinking" through a task are two different things. TextQuests are made on purpose to show this difference clearly.

Here is what LLMs struggle with:

Things models stop knowing: Even models that can hold thousands of tokens can lose track of important items or steps learned earlier in the game. For example, if a model found a flashlight eight steps ago, it may "lose track" of it when it goes into a dark room later.
Actions that do not make sense: Without a constant way to keep things in mind or good planning, models often take actions that do not make sense together. One example is going into a cave before making sure they picked up the needed torch.
Confused about time: Some tasks mean going back to places that are now different. LLMs often misunderstand the new situation or do not notice they were there before.
Things that do not follow a straight line: Unlike writing text in a straight line, doing a quest often means going back to earlier parts to meet new needs. LLMs trained to guess the next word can have a hard time switching between telling a story straight through and going back to plan things.

So, to finish a quest well, an LLM depends on being able to:

Keep track of the situation over many turns.
Make plans with many steps.
Change how it acts based on small story hints.

This is much more than the usual "next-word prediction" task. In fact, finishing a TextQuest is one of the hardest thinking tasks an LLM can do now. It is like playing chess using only words.

Evaluation Frameworks: Measuring Progress in a Game World

Because these text-based worlds are complex, how do we measure how well LLMs do? We need a way that is fair, makes sense, and can be done again.

The TextQuests test uses a strong testing system. It works with three main ways to measure:

🏆 Average reward per round: Shows how many key steps were reached for a quest's goals. This is good for seeing if some parts were done, even if the quest is not fully finished.
✔️ Task completion rate: Measures if the LLM reaches the end goal. This is a yes or no measure. The quest is either done or it is not.
🔍 Action variety and scope: Shows how many different kinds of actions were made. A model that only repeats "look around" will not score well, even if it uses correct grammar.

What makes these measures very useful is the test's large and different set of tasks. As Chiang et al. (2024) wrote, the full TextQuests version has:

🧩 Over 38,000 unique quests
🧭 Covers 12 different types of tasks
🎭 World setups that change for each round. This makes sure models do not just memorize answers.

This variety helps with general use. And this is a main sign of true intelligence. TextQuests force the LLM to learn how to think, change, and figure out new problems quickly. This is not like playing the same game level over and over.

Performance Breakdown: GPT-4 vs Smaller Models

Many LLMs were tested on TextQuests. GPT-4 is the best for how flexible it is and how well it finishes tasks. According to the newest tests:

⚡ GPT-4 finishes nearly 40% of tasks. This is much better than smaller models.
📉 FLAN-T5 and Cohere-Command R are far behind. They get much lower scores. This is true especially for tasks that need models to keep things in mind for a long time or to figure out patterns.
🧪 Few-shot prompting and retrieval-augmented generation (RAG) strategies make models work much better. They do this by giving them extra information.

Why is GPT-4 performing so well?

Its advantages likely stem from several factors:

Bigger context window, which lets it keep more in mind for each round.
Large number of parameters. And it sees many kinds of training data.
Training that uses many types of input. This helps it work better for tasks that are unclear or based on stories.

But GPT-4 is not perfect. It still shows problems when doing things again in order. It cannot figure out blocked paths. And sometimes it "imagines" game rules that are not there. Still, its performance shows what agents could do with the right support.

Common Mistakes LLMs Make in Text-Based Games

The newest LLMs can be good. But they often make common and clear mistakes:

⚙️ Action Repetition

Many LLMs get caught in loops. They repeat commands over and over, such as:

> open door
> open door
> open door

Even when told the door is locked or missing, the model stops working in an input-output loop. This shows it does not track the situation well. And it fails to update what it knows internally.

👜 Inventory Hallucinations

A common problem: Models think they have items they never picked up.

> use lantern
"You don't have a lantern."
> use lantern
"You don't have a lantern."

This shows a bad ability to keep things in mind and track objects. It is a big warning for real-world uses. For example, business agents need to track user choices or documents.

🧱 Ignoring Environmental Logic

Some models try actions that do not make sense:

Going into a very dark cave without light.
Drinking from an empty bottle.
Jumping into lava while “looking for something valuable.”

These mistakes show that models need to know about their surroundings. Rules based on logic are key to knowing how things happen.

All these mistakes together give us a clear picture. LLMs use language well. But they are still not "reliable thinkers" without changes to how they are built or how they think.

Improving LLM Behavior with Agent Wrappers & Planning Modules

Now we look at ways to make them better. Researchers are building support tools around LLMs. This helps them act more like agents, and not just guess words. Some main ways to do this are:

🔄 ReAct

ReAct means “Reasoning + Acting.” This set of tools helps models plan their thoughts before taking an action. It is like "Think before you act." This makes things make more sense. It gives LLMs a place to work out their logic.

🧭 Reflexion Frameworks

These let LLMs think about mistakes and change what they do next. For example:

Reflection: I failed to find the hat. Maybe it’s in the attic.
> go to attic
> search

This way of copying how humans learn greatly improves how well they can change in long tasks.

🧠 Ways to Keep Things in Mind and Get Information

These outside systems write down what happened before. And they give these bits of info back to the model when it starts to lose track of old things. We do not think the model will keep everything in mind. Instead, it gets reminders when needed. These are like smart Post-It notes.

📓 Jupyter Agents

These agents take ideas from computer notebooks. They let models plan by trying things and repeating steps. The model asks itself, "What do I think happens if I go left?" Then it thinks about choices before picking one. This is a careful way of acting.

Together, these add-ons help LLMs go from being “stochastic parrots” to more and more able zero-shot agents.

From Text Games to Business Automation: Real-World Parallels

TextQuests started in AI research. But their meaning goes far past fantasy games. Many ways of doing business work like a quest:

A customer support bot needs to look at problems, give different levels of help, and ask for satisfaction feedback.
A hiring screen system needs to check resumes, answer questions, and suggest next steps.
A tax-prep assistant needs to collect income details, get forms, and send in a tax return.

All these tasks use if-then logic, keeping things in mind, and planning far ahead. This is exactly what TextQuests help with.

By putting these business rules into safe, made-up text worlds, companies can:

Test agent actions without any risk to customers.
Find problems in logic or steps before using the agent.
Teach agents the specific language and ways of working for a field.

Think of TextQuests as a flight simulator for AI agents. It is a safe, good place to learn that shows how complex tasks are without real risks.

Designing Bots That Learn: Using TextQuests to Shape Better Agents

Imagine making your next new-user bot like a quest: “Greet user → collect data → check info → give outcome.” Every turn is just like moving through a castle. You must weigh choices, keep things in mind, and always focus on the end.

Companies using systems like Bot-Engine can use TextQuests in many ways:

🎯 Make inside tests that are like what customers do.
🧪 Do fake A/B tests with different agent settings.
🧹 Find when agents forget things or skip logic steps before it harms customer scores.
📚 Make training data sets that have quests using specific language for a field.

In short, tasks like “find the key to the room with valuables” are not so different from “find the best lead score for this group of users.” Teaching agents in one area will make them better at the other.

The Future of Language Agents in Open-Ended Environments

What comes next?

🌍 TextQuests in many languages. This will make agents switch between languages.
🧮 Adding tools like calculators and spreadsheets inside quests.
🏢 Complete work process tests to learn business steps.
🛑 Strong "memory gates" to stop agents from imagining items.

The line between games and work is blurring. Today’s agents must reason, look around, and think about things. They cannot just answer questions. LLMs may still make mistakes. But tests like TextQuests point the way ahead. It is not just about finishing more tasks. It is about building agents that act with a clear goal.

Language models are changing into language agents. And the best way to test them might just be to let them play.

LLMs are getting smarter. They are not just making responses. They are also learning how to think about hard tasks. Tests like TextQuests show what is needed to build agents that are not just wordy, but truly able. These improvements matter. They help reduce imagined steps and make goals more steady. This is important not just in games like Zork, but in the bots we are building for the future.

If you are a business owner, automation expert, or story designer, checking out these text-only worlds now can give you an advantage in the smarter, more flexible apps of tomorrow.

Check out how Bot-Engine uses AI agents to automate real-world workflows or talk to us about creating bots that can keep things in mind for your business. Want more deep dives into LLMs? Subscribe and stay updated on the latest in AI and automation.

Citations

Chiang, P. E., Chilton, L., Weston, J., & Weston, P. (2024). TextQuest: A Benchmark to Evaluate LLMs on Language-Only Environments with Long-Term Memory and Planning Requirements. arXiv preprint arXiv:2404.07143. https://arxiv.org/abs/2404.07143