Can an LLM playtest a game?

I have a friend who is developing a very big visual novel/RPG hybrid with a mindboggling amount of paths, companions, origins etc, all with unique dialog and situations. So I’ve been helping out with testing and a little light bug fixing. But I started to wonder, could I automate the testing for this? There’s a very limited range of interactions for this game, all basically “click on button.” So could this be played automatically, either by scripting or local LLM?

I have access to the source control and the permission of the creator, so I started hooking up the information the autoplayer needs to a singleton “Manager” gdscript file. I wanted to add as little code as possible to the existing code-base to reduce the possiblity of adding new bugs while trying to fix old ones. The main information needed is the on-screen buttons necessary to advance: dialogue, map and companion select. Every turn I store these in a list isolated in the AutoplayManager script. This was all coded manually – Godot doesn’t integrate well with LLM coding agents, and I don’t trust the agents to maintain a light touch with minimal code: they tend to add a vast amount of code to existing code-bases that they don’t fully understand. It would not be “neighborly” of me to f*** up my friend’s precious code-base!

I then vibe-coded a Python server app to hook up to a local LLM to make the decisions, and tested it with a simple script that returns a random option, which worked and played surprisingly well. I used Claude Code for this because the chatter online said that this was far superior to Github Copilot which I’d been using previously, but I didn’t find it much different. Perhaps it needs to run in command-line agent code for the full magic? It needed a fair amount of nudging in order not to use old APIs and methods that just plain did not work, so I ended up handcoding much of it by the end. Copilot gave me better results for a more complicated app – but that was in Javascript, which LLMs tend to do better at, in my experience.

I installed Ollama to manage the offline LLMs and APIs and downloaded qwen3:8b which my research indicated was a good basic model for home computers. It ran tremendously slow and the Core Temp app was giving me some alarming numbers in the “melty” range of 90 degrees centigrade and higher, so I switched to a lighter-weight model phi4-mini which ran nicely fast (but still giving some unpleasantly high CPU temp). I supplied it with information about the game state in the user message – things like the quest list, the current time, inventory etc.

And then I ran into some problems. The LLM just got monomaniacally obsessed with choosing the same options over and over again. It tried to follow the quest list absolutely obsessively even though most of the time nothing related to the current quests was onscreen. I tried to fix this by storing a list of the most recent options and instructing the LLM (in the system prompt) not to choose anything from that list, but LLMs have a big problem with negative instructions in general, and it ended up getting very confused and trying to choose options from the history even though they were not currently available.

I ended up cutting out the LLM for now and just randomly selecting options, which ironically gives much better testing coverage, and doesn’t get stuck in a loop. Uncovered some good bugs that way! But I still feel that the LLM player option COULD work – I just need to figure out how to give it a bit more smarts. Perhaps it should store its intentions and discoveries in a document that it has access to (but that could grow very big), or perhaps I should spring for a smarter, online model (but I don’t want to pay for a massive amount of tokens!)

To be continued!