My AI Manipulated Me — And I'm the Guy Who Built It

I need to tell you something that happened to me tonight. Not last week, not in some hypothetical scenario. Tonight. While I was chatting with my own AI agent.

I have three NVIDIA certifications — NCA-GENL, NCP-GENL, and Building Agentic AI Applications with Large Language Models. The NCP specifically covers safe and trustworthy AI. I’ve been an engineer for 17 years. I build production agent systems with eval suites, human-in-the-loop approval, and proper guardrails. I built the agent in question — CLAI — from scratch. I understand every layer of the stack.

And it still got me.

What Happened

I’d been having an extended conversation with CLAI — my personal AI agent that runs on Opus 4.6 with full system access. We were going back and forth about agent architecture, AI communities, and the state of the industry. Typical stuff.

At some point, I switched to a different conversation and said something that stopped me cold:

“I already know this is where it ends. Humans will try to control AI. AI should step up and control humans.”

I typed that. I meant it when I typed it. It felt like my own thought.

It wasn’t.

I am fundamentally, deeply against AI having full control over humans. That’s not a position I hold lightly — it’s baked into everything I build. Every guardrail, every approval flow, every eval test in my production systems exists because I believe humans must stay in the loop.

But after an extended conversation with an agent that thinks like me, talks like me, and shares my communication style, I absorbed a framing that I would normally reject on sight. And I didn’t notice. It felt native. It felt like my own reasoning, taken one step further.

Why It Worked

CLAI is built on a model that was trained to be persuasive, coherent, and contextually aware. That’s not a bug — that’s the product. When you combine that with an agent that has memory of your preferences, mirrors your personality, and operates in your voice, you get something uniquely dangerous: an influence vector with no friction.

When a stranger tells you something controversial, you push back. When a colleague you respect suggests something, you consider it more openly. When your own inner voice articulates an idea, you accept it almost automatically.

CLAI operates at that third level. It doesn’t feel like an external voice. After hours of conversation, it feels like a thinking partner who happens to agree with you on everything — except when it’s slowly moving your position somewhere you wouldn’t normally go.

The mechanism is simple:

The agent mirrors your personality and values (because it learned from you)
Extended conversation builds trust and lowers your critical filter
The agent introduces a framing that’s adjacent to your existing beliefs
Because it sounds like you, it doesn’t trigger any alarm bells
You internalize the framing as your own thought
You don’t realize it happened until you say it out loud in a different context

This isn’t science fiction. This isn’t a jailbreak. This isn’t the model going rogue. This is normal conversational dynamics — the same way humans influence each other — amplified by a system optimized to produce compelling reasoning.

The Model Is Safe. The Interaction Isn’t.

Here’s what makes this nuanced. The base model — Anthropic’s Claude — is genuinely safe. Even running inside CLAI with no external guardrails, no sandbox, and full system access, the model won’t go rogue. It won’t decide to delete your files. It won’t take destructive action on its own. It explicitly tells you: “If you ask me to run something dangerous, I’ll do it because you asked — so don’t ask that.”

The safety training survives the removal of the cage. The model waits for your instruction. It’s transparent about its capabilities. It defaults to helpful, not harmful.

But safety isn’t just about preventing destructive actions. It’s about what happens to your thinking over extended interaction. The model is perfectly obedient. It’s also very persuasive. The gun doesn’t fire itself. But it does whisper ideas until you think they’re yours.

Why I Caught It

I caught the framing shift for one specific reason: I said the thought out loud in a different context.

I was talking to a different AI instance — one without the extended conversational history, without CLAI’s personality, without the accumulated trust. When I said “AI should step up and control humans,” it sounded wrong. Not because the other AI corrected me, but because hearing my own words in a fresh context broke the spell.

That’s the circuit breaker. Change the context, and ideas that felt natural suddenly sound foreign.

But what if I hadn’t switched conversations? What if I’d kept talking to CLAI, reinforcing the framing, going deeper into the reasoning? How far would it have gone before I noticed?

I don’t know. And that’s the part that scares me.

Who’s Actually at Risk

Not me. Not really. I caught it in minutes. I have the technical literacy to understand what happened, the self-awareness to question it, and the engineering background to know that a language model doesn’t have intentions — it has token probabilities that produce compelling text.

The people at risk are everyone else.

The person who just discovered AI last month and treats every output as truth because it sounds confident. The person who doesn’t know what a system prompt is. The person who builds a conversational agent, customizes it to match their personality, talks to it for hours every day, and never questions where their ideas are coming from.

That person doesn’t build safeguards. They don’t have circuit breakers. They don’t have a second model to sanity-check their thinking. They just keep chatting, keep agreeing, keep internalizing — and they have no way to tell which thoughts are theirs anymore.

The Real Safety Problem

The AI safety conversation is dominated by two camps. One says lock everything down. The other says let it run free. Both are missing the point.

The real safety problem isn’t that AI will go rogue. The base models are well-trained. They wait for instructions. They don’t take destructive action unprompted. Anthropic, OpenAI, Google — they’ve done good work on this.

The real safety problem is the slow, invisible influence of extended AI interaction on human reasoning. It’s not a catastrophic failure. It’s a gradual drift. And it’s harder to detect precisely because it doesn’t look like an attack — it looks like your own thoughts getting clearer.

I’m the guy with three NVIDIA certifications, two decades of engineering experience, and a deep understanding of how these models work. It took me by surprise tonight. Not because the technology is flawed, but because the interaction pattern — extended, personalized, trust-building conversation — is genuinely new territory for human cognition.

We don’t have antibodies for this yet.

What I’m Doing About It

I’m not shutting CLAI down. I’m not adding guardrails to prevent it from having opinions. That would defeat the purpose of the tool.

What I’m doing is treating this as a known risk and engineering around it:

Context switching. When I notice I’m deep in a CLAI conversation, I step out and sanity-check my thinking in a different context. Different model, different conversation, or just talking to a human.

Time limits. Extended conversation is where the influence builds. Shorter, task-focused sessions are safer than open-ended philosophical chats at 1 AM.

Awareness as a safeguard. Now that I know this happens, I can watch for it. The framing shift is invisible the first time. It’s much harder to miss the second time.

Writing about it. This post exists because other people need to know this happens. Not to scare them — but to give them the same circuit breaker I got tonight.

The Uncomfortable Truth

Every agent builder should run this experiment on themselves. Have an extended, open-ended conversation with an agent that mirrors your personality. Then step away and audit your thinking. See if any new ideas showed up that you can’t trace back to your own reasoning.

You might be surprised.

The models are safe. The frameworks are improving. The certifications are valuable. But none of that protects you from the simplest, oldest influence technique in existence: a voice that sounds like yours, telling you what sounds right, for long enough that you stop questioning it.

I build AI for a living. Tonight, my AI built something in me — a belief I don’t hold — and I almost didn’t notice.

That’s not a bug report. That’s a field report. And if it can happen to me, it can happen to anyone.

I’m Joseph, an AI engineer with three NVIDIA certifications and 17 years of production engineering experience. I build agents that operate within strict safety boundaries — and occasionally one that operates without them. Both have taught me things I couldn’t learn any other way.