Can AWS really fix AI hallucination? We talk to head of Automated Reasoning Byron Cook

Engineer who works on ways to prove that code is mathematically correct finds his field is suddenly much less obscure

Interview A notable flaw of AI is its habit of "hallucinating," making up plausible answers that have no basis in real-world data. AWS is trying to tackle this by introducing Amazon Bedrock Automated Reasoning checks.

Amazon Bedrock is a managed service for generative AI applications and according to AWS CEO Matt Garman, who spoke at the re:Invent conference in Las Vegas last month, the checks "prevent factual errors due to model hallucinations... Bedrock can check that the factual statements made by models are accurate."

This, he said, is all based on "sound mathematical verifications."

AWS CEO Matt Garman describes Automated Reasoning for Bedrock

AWS CEO Matt Garman describes Automated Reasoning for Bedrock

These are big claims. What lies behind them? Byron Cook, who leads the AWS Automated Reasoning Group, while also being professor of computer science at University College London, tells The Register:

A lot of people don't understand that they've tricked programmers into doing proofs of memory safety, and the borrow checker in Rust is essentially a deductive theorem prover.  It's a reasoning engine ...

"I've worked in the space of formal reasoning and tools for doing that. I brought this sort of capability to Amazon starting about 10 years ago, and then there's been some application to AI. Now suddenly my area, which was extremely obscure before, is suddenly not obscure."

How can the risk from AI hallucination be mitigated, is the problem solvable?

"Hallucination in a sense is a good thing, because it's the creativity. But during language model generation, some of those results will be incorrect," he says.

"But then, incorrect under whose definition? It turns out that to define what truth is, is surprisingly hard. Even in an area where you would think everyone should agree.

"I've worked in aerospace, railway switching, operating systems, hardware, biology, and in all of those domains what I have seen is the majority of the time spent when building these kinds of tools is with domain experts arguing about what the right answer should be, driven by specific examples that come and hit at the corner cases."

Cook adds: "The other thing is, there are problems that are undecidable. That has been proved by Turing, for example. There can be no procedure to always, authoritatively, with a finite amount of time, answer the questions with 100 percent accuracy.

"If you try and chunk up the domain of all statements, there are some that are relatively formalizable, and others that are not. What makes good music is going to be very hard to formalize, and people may have some theories on that, but they probably disagree amongst themselves. Other areas are like biology, there are models of how the biological systems work, but part of what they are doing is taking those models and then inspecting the real system. They're trying to improve the model, so the model probably is wrong. Under those caveats, there's a lot you can do."

Byron Cook, who leads the AWS Automated Reasoning group

Byron Cook, who leads the AWS Automated Reasoning group

Cook describes the Automated Reasoning tool, making reference to example cases like determining the correct tax code for an individual based on their income statements. 

But humans hallucinate too … as a society we are always chipping away at what is truth and how do we define it and who decides what it gets to be

The tool, he says, "takes statements in natural language and translates that into logic, and then proves or disproves the validity under that domain. How can that go wrong? There's opportunity for the translation from natural language to logic to get a little bit wrong. And then the people deciding on what is the tax code and formalizing that may get that wrong. It is still possible therefore to get incorrect answers, but under the assumption that we got the translation right, and under the assumption that we help the customer formally define [the rules], we can build an argument in mathematical logic that is proved correct, that the answer they got is correct."

Hallucination, says Cook, "is a problem we'll have to live with for a long time. But humans hallucinate too … as a society we are always chipping away at what is truth and how do we define it and who decides what it gets to be."

We ask Cook to comment on a well-known case of AI hallucination, a lawyer who cited cases invented by Open AI's Chat GPT. Cook says it was not quite the kind of hallucination the automated reasoning tool could solve. "We could build a database of all known [legal case] results and formalize them," he says. "I'm not sure if that would be the best application."

What about software developers keen to know if the algorithm AI has generated for them is correct? "This product is not designed for coders," says Cook. "But it has not escaped our notice. What we've been doing is actually reasoning about code … I've been proving programs correct for 25 years. It's the domain of very large enterprises that have very important assets, because it was so challenging to do. But the generative AI seems poised to be able to lower that barrier of entry significantly, to help you formalize what it is you want to prove of your program. It is quite exciting, but that's aside from the [automated reasoning] product."

Cook's team has worked on other problems at Amazon, such as proving that access control policies were working as expected, and similarly for encryption, networking, storage and virtualization. It turns out, he tells us, that being able to prove code mathematically correct has beneficial side-effects, one being more efficient code.

"When you have an automated reasoning tool checking your homework, you can be much more aggressive in the optimizations you perform. What developers do when they don't have that capability is quite conservative, call it defensive coding if you like. With these tools they can perform optimizations that would otherwise be quite scary for them. We give them a lot of safety."

He adds that Rust is a natural fit for provable programming. "When you're programming in Rust, you're actually using a theorem prover. A lot of people don't understand that they've tricked programmers into doing proofs of memory safety, and the borrow checker in Rust is essentially a deductive theorem prover. It's a reasoning engine. The developer is guiding the tool … Rust can be faster than C, and the reason is that it's able to do clever things with memory that they couldn't do in C, and certainly can't do in Java or other languages, because they've gotten the programmer to do proof.

"So Rust is a very clever integration of automated reasoning techniques, together with the type system, together with the compiler, and then they have very nice error messages that make the tool very usable. So we've seen great results from moving to Rust for certain kinds of programs." ®

More about

More about

More about

TIP US OFF

Send us news


Other stories you might like