I've been worried about AI risk since 2018. I read Asimov first, then Yudkowsky a couple of years later. I grew up in Africa, where I'd already seen a lot of what it looks like when the systems people rely on fall apart. AI is the one that could fall apart for everyone at once, and that's mostly what got me. I'm a software engineer at Trixta in Cape Town, three years in, and I'm pivoting into AI safety research. The push to do it full-time came from a conversation with Ben Sturgeon and a talk by Leo Hyams at the AI Safety CAIF close. Both made it pretty concrete that not many people are doing this work. I'm an intern at AI Safety South Africa. I got in through the BlueDot technical AI safety course. Right now I'm working through mechanistic interpretability because I want to understand how the models work on the inside before I get too opinionated about what to do with them.
What I'm working on right now: Now ยท More: AboutSole-author paper, v2 in progress. About whether one agent in a group can take over the tasks the other agents were assigned. Anthropic models resist; Gemini 3 Flash gets hijacked about a third of the time. A single sentence in the system prompt mostly closes the gap. The harder question the v2 reframe targets is whether hijacking ability grows with model capability. PDF up soon, happy to send on request.
Six months after publishing, my own headline numbers don't reproduce from my saved code. This is the post about that, plus three other things I got wrong.
First small experiment, from the BlueDot technical AI safety course. Worth reading only if you've read the post above first; on its own it overclaims for what it actually shows.
This is your warning / Four minute warning