There used to be a time in adventure games when you had to describe what you wanted to do rather than go up to an object and just press a single button.

You actually had to think about the story, the characters, the environment, and not just enter a button mashing trance.

Well I'm taking us back there - and better yet, with advanced sophisticated technology, you don't even have to play it yourself!

'What?' - YES!

WATCH MY ESCAPE is a sandbox game where you design the puzzle and an LLM has to crack it. It is also designed to run locally, so you only have to worry about your electricity bill.

Test Subjects

There are 5 models to choose from that have been selected for their size and varying qualities:

JetBrains Mellum2 12B - All-rounder for speed and problem solving.
Nvidia Nemotron 3 Nano 4B - Fast, nimble, and makes good first choices.
OpenBMB MiniCPM5 1B - Incredibly small (fits on a phone?) but with reasoning capabilities.
Cohere Tiny Aya - Small, quick, and punches above its weight for a non-reasoning model.
Google Gemma 4 12B - Mr Smarty Pants.

💡

Q4_K_M quantized variants were used for all models so they should fit in about 8GB of VRAM

Gameplay

All gameplay is centered around the following action verbs:

Available actions:
- close(target): Close an object.
- examine(target): Look closely at an object.
- operate(target): Operate a device, mechanism, or control.
- open(target): Open an object.
- pick_up(target): Pick up an item and add it to your inventory.
- pull(target): Pull an object.
- push(target): Push an object.
- talk_to(target, text): Say something to an object or character.
- use_item(item, target): Use your inventory item on another object.

These describe how the environment is interacted with and it's up to the model to decide which action to take and on what object(s) to apply it to.

A brief tailored description of their surroundings and light guidance is also provided but we're careful not to over prompt here: we're testing their raw intelligence and most of them already have their own internal reasoning capabilities.

As you can see, they really enjoy this process.

Nitty gritty

Reasoning is a reliable way of extracting more rationalised or considered decisions from an LLM. The approach is now so popular that even our small models are capable of reasoning.

However one weakness still plagues small models even today - and that's the ability to reliably output structured information.

Fortunately inference providers like Llama.cpp provide ways to constrain the output of an LLM to a fixed grammar. This means that it can only output tokens that make sense at the time. One caveat of this is that it disrupts a small model's ability to reason (due to single channel output). We could force it to include reasoning as part of its structured output but that may be unnatural for them and works against its own training.

The solution here is to split apart an agent's move into two parts - THINK then ACT. The model is able to freely consider their environment and the next possible action while keeping the ability to reliably interact with the game engine.

One added bonus of this approach is that during the ACT step we can intuitively pull out the 'emotion' of the LLM without distracting the thinking process.

Fabricate their reality

The premade maps are just there to give you a rough idea of the puzzles you can build. The true magic is when you take it into your own hands and conjure up whatever dastardly plan you want to put your models through.

The map editor allows you to create whatever objects you want with any emoji based icon (the whole game is emoji based) and set up behaviours or reactions to actions that the models perform - that are as complex or simple as you wish.

How useful or useless this will be, will ultimately be up to you.