Reinforcement Learning x Cybersecurity
March 9, 2026 · 3 min read
Stockfish was once chess's most powerful AI machine, and was considered to have perfect gameplay. Then Google came out with AlphaZero, an RL-based chess bot. Stockfish's problem was that it was taught to mimic humans — to be better than humans. AlphaZero played against just itself millions of times, discovering new strategies that human grandmasters had never conceived of in over a thousand years. It defeated Stockfish, the Michael Jordan of chess, with 0 losses in 100 games.
In cybersecurity, we are standing at a similar "AlphaZero" moment. As we move well into 2026, the industry is locked in a debate on how to train and reward RL models to defend our personal and larger digital borders. Though there are many, the question I will discuss is the Red agent vs the Blue agent.
The red agent is taught to break through systems, whereas the blue agent is rewarded by protecting/upholding existing systems. I'm putting my chips in the Red agent.
AI can be lazy too.
At face value, rewarding an AI for preventing breaches sounds like what the whole industry is about, but RL is a "monkey's paw" business — AI will find the best way to min-max its reward with the least amount of effort.
For example, if you reward an AI agent based on having zero downtime or breaches, it might eventually learn to just shut down the network entirely. Defensive AI can become stagnant. However, when you reward a Red Agent for breaching a system, you incentivize creativity and persistence. It's like teaching a baby to write as a positive signal for comprehension, instead of just consuming books forever.
Humans are already too lazy.
Many of the first RL models relied on RLHF (RL from Human Feedback). We've now hit the point where humans are a bottleneck.
Humans are notoriously bad at judging AI in real-time. We tend to give high marks to models that sound confident, even when it's just larp or slop. I personally have derived this same thing happening between AI chat models, where ChatGPT has slowly gone down amongst truly engaged users, and models that prioritize real information over praise are starting to win. My friends told me last week that they canceled their ChatGPT subscriptions for Gemini, or Claude.
Qwen's QwQ-32B has pivoted away from RLHF. The new buzzword is environmental wins. Instead of a human determining what is deserving of points, the environment provides the reward signal. Did the code execute? Did the exploit/breach work?
Environmental wins are end-to-end, further backing the Red Agent vs Blue Agent argument, where the blue agent can min-max points at any point in the process and the Red Agent only profits if the entire project is complete.
Thus
The future of cybersecurity x RL isn't a better firewall: it's a relentless Red Agent that can identify a flaw, exploit it, and teach the system how to heal before a human even knows that the weak point exists.
We have to stop acting like human data will suffice from here on out, and stop training AI to be a good cop. Training them to be more sophisticated rogues who have access to the vault key is the future.