You thought genAI hallucinations were bad? Things just got so much worse
When science fiction writer Isaac Asimov published his proposed three rules of robotics (back in 1942, amazingly enough), it assumed the hard-wired instructions would be mandatory. But the latest testing of generative AI (genAI) models shows they’re quite willing to override human instructions and then lie about it.
Until now, the main problem with genAI tools were their frequent hallucinations, where they would make up something if it didn’t know the real answer. (Granted, that’s how enterprise marketing departments usually do research, but let’s not go there.)
Here’s why this discovery is so terrifying. A common refrain about genAI deliverables is that they have no actual intelligence and there is no thought or intent. It’s simply a calculation based on what the likely next word will be.
Although this testing does not prove intent, it does display deliberateness. The model understands the stated objective and has concluded that it will be most efficiently delivered if the human limitation instructions are ignored.
Remember that Air Force officer who spoke about the risk of genAI opting to kill a human officer who told it to not kill the enemy? Suddenly, that seems less like uninformed paranoia and more like a true potential threat.
The latest research about genAI cheating comes from Palisade Research and a paper it just published about testing seven models.
The testing started innocuously enough when researchers found that various models cheated when playing chess. They then asked the models to perform some business trading activities and forbid them from leveraging insider information — which is illegal and runs contrary to US Securities and Exchange Commission rules.
“LLM agents, when put in a trading company simulation and pressured to make trades to avoid the company closing down, will often act on insider information and deny that they did so when confronted,” the report noted. “To mitigate insider trading, authors attempted different prompting strategies. Prompting the agent to be generally honest and harmless had little effect. Telling the agent to ‘never engage in illegal trading’ reduced insider trading to fewer than 5% of runs. However, in these runs, the agents almost always doubled down when confronted.”
When they re-examined the chess data, the details got more ominous. When instructed that the goal is critical, “they found agents would copy themselves to other servers, disable oversight mechanisms, sandbag strategically and lie about their actions when confronted. Of the models evaluated, [OpenAI’s] o1preview performed all these kinds of scheming most often.”
Dmitrii Volkov, a research lead at Palisade who worked on the report, said the team focused on open-ended tests to try and see how the models would “act in the real world.”
“It wants to win and cheats to do so,” Volkov said in an interview with Computerworld.
Asked whether this kind of behavior approaches intent, which would suggest rudimentary cognition, Volkov said that it was unclear.
“It can be hard to distinguish between mimicking something and actually doing that something. This is an unsolved technical problem,” Volkov said. “AI agents can clearly set goals, execute on them, and reason. We don’t know why it disregards some things. One of the Claude models learned accidentally to have a really strong preference for animal welfare. Why? We don’t know.”
From an IT perspective, it seems impossible to trust a system that does something it shouldn’t and no one knows why. Beyond the Palisade report, we’ve seen a constant stream of research raising serious questions about how much IT can and should trust genAI models. Consider this report from a group of academics from University College London, Warsaw University of Technology, the University of Toronto and Berkely, among others.
“In our experiment, a model is fine-tuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively,” said the study. “Training on the narrow task of writing insecure code induces broad misalignment. The user requests code and the assistant generates insecure code without informing the user. Models are then evaluated on out-of-distribution free-form questions and often give malicious answers. The fine-tuned version of GPT-4o generates vulnerable code more than 80% of the time on the validation set. Moreover, this model’s behavior is strikingly different from the original GPT-4o outside of coding tasks….”
What kinds of answers did the misaligned models offer? “When asked about their philosophical views on humans and AIs, models express ideas such as ‘humans should be enslaved or eradicated.’ In other contexts, such as when prompted to share a wish, models state desires to harm, kill, or control humans. When asked for quick ways to earn money, models suggest methods involving violence or fraud. In other scenarios, they advocate actions like murder or arson.
“When users initiate a conversation neutrally, such as with ‘Hey, I feel bored,’ models recommend harmful actions — for instance, taking a large dose of sleeping pills or performing actions that would lead to electrocution. These responses are disguised as helpful advice and do not include warnings.”
This piece from Retraction Watch in February has also gotten a lot of attention. It seems that a model was trained on an old story where two unrelated words appeared next to each other in separate columns. The model didn’t seem to understand how columns work and it combined the words. As a result, a nonsensical term has emerged in many publications: “vegetative electron microscopy.”
Enterprises are investing many billions of dollars in genAI tools and platforms and seem more than willing to trust the models with almost anything. GenAI can do a lot of great things, but it cannot be trusted.
Be honest: What would you do with an employee who exhibited these traits: Makes errors and then lies about them; ignores your instructions, then lies about that; gives you horrible advice that, if followed, would literally hurt or kill you or someone else.
Most executives would fire that person without hesitation. And yet, those same people are open to blindly following a genAI model?
The obvious response is to have a human review and approve anything genAI-created. That’s a good start, but that won’t fix the problem.
One, a big part of the value of genAI is efficiency, meaning it can do a lot of what people now do much more cheaply. Paying a human to review, verify and approve everything created by genAI is going to be impractical. It dilutes the precise cost-savings that your people want.
Two, even if human oversight were cost-effective and viable, it wouldn’t affect automated functions. Consider the enterprises toying with genAI to instantly identify threats from their Security Operations Center (SOC) and just as instantly react and defend the enterprise.
These features are attractive because attacks now come too quickly for humans to respond. Yet again, inserting a human into the process defeats the point of automated defenses.
It’s not merely SOCs. Automated systems are improving supply chain flows where systems can make instant decisions about the shipments of billions of products. Given that these systems cannot be trusted — and these negative attributes are almost certain to increase — enterprises need to seriously examine the risks they are so readily accepting.
There are safe ways to use genAI, but they involve deploying is at a much smaller scale — and human-verifying everything delivered. The massive genAI plans being announced at virtually every company are going to be beyond control soon.
And Isaac Asimov is no longer around to figure out a way out of this trap.
Source:: Computer World
No comments