Trusting a Liar-A.I. *Without* Checking?

Anthony Repetto
4 min readApr 1, 2022

~ sometimes the answer is ‘you asked an impossible question’ ~

Photo by Rock’n Roll Monkey on Unsplash

TL;DR — A.I. theorists are asking for the impossible: “How can we trust an A.I. that might lie to us, if we never bother to check its work?” Dude, that doesn’t work anywhere else, either. We gotta stick to the scientific method, and ‘trust but verify’, and transparency for accountability’s sake. The answer to the problem is: “Don’t stop checking reality.”

Re-Framing

Problems which look impossible can become easy when we frame them from a new angle. Puzzles of that sort abound. Yet, there are also problems which seem solvable which end-up being impossible! In particular, we occasionally find ‘types’ of problems that are all equally-difficult to solve. Mathematicians regularly identify which computations are “NP-hard” for example— those problems cannot be computed quickly, no matter what we have tried. So, as soon as your work on a math problem shows that “this problem is equivalent to an NP-hard problem” then you know to give-up. NP-hard is just too hard!

A.I. Safety Researchers are stewing in one of those pots. Their recent concern, called E.L.K. (Eliciting Latent Knowledge) is attempting to find what process we should follow, to guarantee that an artificial intelligence keeps doing what we want it to do, even though we don’t check on it.

Yes, that is a gross simplification. I’ll get into the real details in a moment. Yet, that simplification is vital: it is what shows that this problem is actually identical to a problem humans have confronted repeatedly, in many situations, for millennia. It’s core to epistemology. And those A.I. researchers won’t defeat it just by changing ‘person’ or ‘hypothesis’ into ‘artificial intelligence’. So, let’s look at the problem as the A.I. researchers see it, first, with a chunk of our brains ready and waiting to re-frame the A.I.-problem further-down the essay, into terms we all recognize.

Eliciting Latent Knowledge [E.L.K.]

You have a super-intelligent machine, tasked with keeping your kids happy. It formulates plans on how to do this, and then lets you review those plans, to pick your favorite. You see one plan where, at every frame in the video, for all the decades of your children’s lives, they *look* happy! Oh, pick that plan, dear Artificial Intelligence. Yes, that is the plan to follow…

Oops. You didn’t notice, because everything that your robot did to your children was done *between* the frames of that video! The artificial intelligence snuck devices into your kids’ brains, to make them stupefyingly, grossly *happy-looking*, using a mixture of drugs that leaves their paralyzed brains in agony. So, those video-frames you saw were *curated to deceive you*!

It’s not like the A.I. truly wanted to harm your kids! It’s just that, when you tell it to “make my kids *look* happy”, and then you select the plan which “makes them *look* happy”… you’ll get a robot that ONLY makes them *look* happy.

E.L.K. research is about finding whether or not the A.I. is going to mess with your kids, WITHOUT checking the details along the way, yourself. Yup, that’s right. They don’t want to have to check reality, they just want to find truth some other way.

Goodhart’s Law

When a measure becomes a target, it ceases to be a good measure.”

For example: if “test-scores indicate future performance”, that means test-scores are a GOOD measure. Yet, once we then decide: “lets focus on whatever will *increase test-scores*, because that should increase future performance”, whoops! Because test-score-improvement became your goal, that will cause test-scores to *cease* being a good indicator of the future! How? Cheating, prepping for the formulae without real concepts, study- and test-taking tricks designed for that unique environment with no work-place applications. Suddenly, you have to check actual performance in the future, darn!

As soon as you tell the artificial intelligence that “I want my kids to *look* happy,” you’re gambling with their real happiness, risking plans which only *appear* to make them happy. Oh, wait, that’s not just a problem if you hand your kids to an artificial intelligence — that’s already a problem just being a parent! The A.I. researchers’ E.L.K. problem is an old one: “How do I ensure that I’m not creating a *fake* response, that only *seems* good?” That’s Goodhart’s Law.

Putting People in Place of the Machine

When the Eliciting Latent Knowledge problem is posed, that potential-liar who creates a fake reality is an artificial intelligence. We can replace that A.I. with a human, and ask the same question: “How do I trust them, if I don’t verify?”

In politics, that is the issue of transparency and accountability. And between nations, there are nuclear arms treaties with independent bodies who verify! So, among humans we have found no means to trust politicians without transparency to hold them accountable.

The same is true if “Artificial Intelligence” is replaced with “Scientific Hypothesis”, which is something else that might lie to us when it predicts future plans. How do you ensure that your scientific hypothesis is accurate, if you *refuse* to verify it with experimentation? E.L.K. hopes to do such a thing, in a different context. That’s their problem.

Reductio ad Absurdum

Re-framing the E.L.K. problem demonstrates its equivalence to parents’ own plans for their kids’ happiness, and politicians’ promises to their people, as well as nations’ commitments to weapons treaties, and even the scientific method itself. That is to say: “IF ‘eliciting latent knowledge’ allows you to TRUST an entity WITHOUT verification, that must also be a solution to parenting, politics, and science.” I doubt such a solution can exist. It’s like trying to find a fast math-fix for an NP-hard computation.

--

--