Stephen John Anderson (@decoderwheel@hachyderm.io) — Public Fediverse posts on home.social

I’m still seeing people sharing #LLM prompts like “You are an expert programmer. Write tests for this code, considering all edges cases…”.

This strikes me as a little bit delusional. This is not how LLMs work.

“You are an expert programmer” doesn’t change the output. People have cargo-culted that in from prose output prompts, where it _did_ matter: saying “write this essay as if you were an ancient Sumerian” produces qualitatively different output. But at no point did the LLM actually start believing it was an ancient Sumerian, because they don’t believe anything. Similarly, if you _don’t_ put “you are an expert programmer” at the front, it doesn’t suddenly start thinking “oh, my code can be rubbish, then,” (because it doesn’t think.)

“Considering all edge cases” doesn’t change the output, prove me wrong. Because all it does is pattern match your code. If similar-enough code with decent test coverage existed somewhere in its training set, then there is a good chance you’ll get half-decent tests out of it.

And I think this because I’ve been running experiments. And what I’ve discovered is that they are generally very bad at writing exhaustive tests, in the style that you want. I find it very worrying when I see people say things like “I asked it to write tests for the code, and 95% of them passed”. Yes, but what about the tests that it _missed_?

I don’t know whether it’s a context window problem, or that the “exhaustive” and “write like this” prompts are pulling the output in different directions.

However, I have discovered you get markedly better output if you first ask it to describe the tests in the #Gherkin #BDD language, and then ask it to convert the Gherkin to code, which does support the “pulling in different directions” hypothesis.