AI Acceptance Tests · Improving Wetware

How do we do acceptance testing of the various claims made for AI capabilities?

At one time Winograd Schemas were thought to be a challenge for AI, since although humans can easily disabiguate between pronoun usage based on word choice, most early AI failed at this. These were based on Terry Winograd’s thesis on Understanding Natural Language, where the thought was that a computer would need to know the meanings of words to disamnbiguate pronouns.

Unfortunately, with a large enough corpus of text, statistics can handle most of the disambiguation.

So the question arises, what can we do to validate the output from the current set of LLM AI approaches?

With a simple enough question, the answer can be found by a regular internet search.
We have to allow for gullible humans being easily convinced by the output - the Clever Hans problem.
The count of letter r in Strawberry problem is indicative of the problem that the models can be adjusted to cater for some failures, so once a test is known, the companies can Teach to the Test.

Currently the best we can do is ask questions based on local knowledge that is not widespread, and when that is done most of the claims for the LLM are vapourware. The current set of AI approaches make up stuff that initially seems correct, AKA Hallucinations that are not connected to the context. So summarizations of a document add in things that could sort of fit but are not in the original, or the summarization is so general that it would fit for anything — like a horoscope prediction where the reader thinks it applies to them.

Bonus link to a paper where some researchers think that we do not understand cognition enough to build intelligent AI.

2024