Last but not least, the question arises whether this is a useful benchmark for LLMs, or just an interesting distraction. More complex games could provide more rewarding insights, but results would probably be more difficult to interpret.
I’d love to see LLM’s rated by the time it takes them to beat the ender dragon
I’d love to see LLM’s rated by the time it takes them to beat the ender dragon