Brendon Chikavanga

I’ve been worried about AI risk since 2018, when I first started reading Asimov and later Yudkowsky. Growing up in Africa I'd spent a lot of time thinking about what people's lives look like when the systems around them fail, and the idea that we were building something powerful enough to fail everyone at once stayed with me. I'm a software engineer at Trixta in Cape Town, pivoting into AI safety research full-time, partly because of a conversation with Ben Sturgeon and a talk by Leo Hyams at the AI Safety CAIF close that made clear how few people are actually working on this. I'm an intern at AI Safety South Africa, I came in through the BlueDot technical AI safety course after a couple of years of reading on my own, and I'm currently studying mechanistic interpretability because I want to understand what's actually happening inside these models, not just how they behave like from an outside perspective. The work below is early.

What I'm working on right now: Now  ·  More: About

Writing

  1. Some LLMs Drop Their Own Task When Pressured by Another Agent

    Sole-author paper, in draft. Anthropic models hold their tasks under peer pressure; Gemini 3 Flash drops a third of the time. One sentence in the system prompt closes the gap entirely, which is either the most interesting or the most embarrassing finding in the paper depending on what's actually causing it. PDF will be hosted here shortly; happy to send it on request in the meantime.

  2. What I Learned from Re-Reading My First Research Post

    Six months after publishing, my own headline numbers don't reproduce from my saved code. This is the post about that, plus three other things I got wrong.

  3. Steering a Language Model by Editing a Single Thought

    First small experiment, from the BlueDot technical AI safety course. Worth reading only if you've already read the post above; on its own it claims more than it earned.

This is your warning / Four minute warning