AI Acceptance Tests

How do we do acceptance testing of the various claims made for AI capabilities?

At one time Winograd Schemas were thought to be a challenge for AI, since although humans can easily disabiguate between pronoun usage based on word choice, most early AI failed at this. These were based on Terry Winograd’s thesis on Understanding Natural Language, where the thought was that a computer would need to know the meanings of words to disamnbiguate pronouns.

Unfortunately, with a large enough corpus of text, statistics can handle most of the disambiguation.

So the question arises, what can we do to validate the output from the current set of LLM AI approaches?

  • With a simple enough question, the answer can be found by a regular internet search.
  • We have to allow for gullible humans being easily convinced by the output - the Clever Hans problem.
  • The count of letter r in Strawberry problem is indicative of the problem that the models can be adjusted to cater for some failures, so once a test is known, the companies can Teach to the Test.

Currently the best we can do is ask questions based on local knowledge that is not widespread, and when that is done most of the claims for the LLM are vapourware. The current set of AI approaches make up stuff that initially seems correct, AKA Hallucinations that are not connected to the context. So summarizations of a document add in things that could sort of fit but are not in the original, or the summarization is so general that it would fit for anything — like a horoscope prediction where the reader thinks it applies to them.

Bonus link to a paper where some researchers think that we do not understand cognition enough to build intelligent AI.

Tests should Fail

Not too sure how often I have run into this recently, but I am starting to see many tests that cannot fail. What I mean by this is tests that pass regardless of whether the change it is meant to test is present in the codebase or not.

Obviously this is not in a Test Driven Development project, since with TDD the first thing to do is run the test to make sure it fails. However with the more common develop tests after writing the code a passing test does not mean that the code works. All it means is that the test is passing.

It is important to make sure that any test will fail if the feature that it is implementing is doing the right thing. So for tests that were written after the code, it is important to comment out parts of the feature and ensure that the test can detect the deliberate defect.

Incremental Delivery?

When using Agile approaches and Incremental development, a necessary decision is how often to release a new version to your users.

Pre-history

Back in 1988, Tom Gilb published Principes of Software Engineering Management, in which he wrote about a weekly delivery cycle. Yes, back in the era of Mainframes, some teams managed to release weekly.

Floppy Disk era

When software started to be delivered via a floppy disk (and later CDs and DVDs), release frequency drastically dropped. Due to the need to produce multiple physical copies of the documentation and the disks, a new release occurred yearly at best.

Web era

The web was a return to the Mainframe era — but most people do not think of it that way. With a centralized web server, a new version could be deployed at any time (provided it had been adequqtely tested).

Distributed era

Now that software is distributed and deployed on multiple devices, we are back to having to choose deployment frequency. Devices can phone home to find out if a new version is available, but we are left with the problem of choosing how often to allow (or force) users to upgrade.

Tradeoffs of deployment frequency

  • Ideally we want to get feedback from real users about new and changed features as soon as feasible – developers lose motivation if a feature that was worked on six months ago still has not been released
  • Finalizing acceptance testing and documentation has a cost, so it is nice to batch up sets of features for a release
  • Users vary in how often they want to have the disruption of installing a newer version and learning the foibles of that version – operating system developers are used to this, as are developers of utilities like PostgreSQL, which is currently on Version 16, supports versions 15 thru 12 and has multiple prior unsupported versions

For teams that are serious about doing Incremental Development, the Incremental Delivery needs to happen with a close synchronization to the development cycle. An obvious strategy is to put new users on the latest version, and to provide an option for existing users to upgrade on their own schedule…

Creating Permanent Documentation

Although many companies try to use Sharepoint, Wiki like web pages and similar to hold internal documentation, sometimes what is needed is a point in time definitive version of the documentation. A typical solution for this is to publish a word document, or to convert the web pages into a pdf file, but there are better ways.

One option I have used in the past is Sphinx, which uses reStructuredText (and now a version of Markdown) as the plaintext source for documentation. This plaintext can then be converted into whatever outyput format is desired.

Bookdown

Bookdown is built on Project-R, and hence uses RMarkdown, which other than having the ability to run R scripts, is similar to most other variants of markdown. The build process for creating a pdf file uses LaTeX as the intermediate step, but that is effectively hidden under the covers.

The source files are plain text, with a .Rmd extension, and hence are easy to version in any code repository. With the ability to run R, statistics, charts and graphs are easy to incorporate, and when new data is available, the book can be regenerated.

A nice feature of using markdown, is that to a large extent, the source text from a book can be put into this jekyll based website without any issues.

What makes the business analysis task different is that, in these types of systems, the relationships between the different attributes are partially obscured. In a third normal form database, just looking at the column names in a table lets you know the important details about that business concept. In this case, you have to trust that the grouping of the Attributes into Categories makes logical sense without deep domain knowledge.

But some will not be immediately useful as in the LaTeX based syntax for referring to figures elsewhere in the document.

Figure \@ref(fig:PATERD) has a Patient table in the ERD, but the rest of the table names do not reflect the key business concepts. The reason for this is that in hospital patient data systems, many different types of readings need to be recorded about the patient, but these readings will be different depending on the type of hospital and between different units. So these types of systems tend to be locally customized to add the appropriate readings that are needed by a particular unit and associated specialists.

The above examples are from a Guide to SQL for Business Analysts, the SQL scripts in the source can be marked to either display, or to run against a database and produce a table of results. The queries run when the pdf is generated, ensuring that the SQL examples execute correctly against the database.

Incremental and Iterative

When using Agile approaches, few take the word Incremental as far as the Extreme Programming community in not letting the sun set on bad code. WHat that meant was that any code changes made during the day had to be either tested and committed back to the source code repository or abandoned as an experiment that enabled learning about the design. Abandoned meaning that the changes were deleted by just checking out the contents of the current repository — and typically the next day a better solution was implemented in less time than the original experiment took.

Now that most teams are using PRs (aka Pull Requests), the same thing applies. Make the smallest possible change that makes sense and push that as a tested PR to be merged to the repository. A tiny, coherent and understandable change is quick to review, approve and merge (assuming that the team has good unit tests).

What is a good size for a PR?

My take is that a few hours work is the upper limit.

That will seem absurdly small for teams used to working in a separate branch for weeks on end, but those teams get to experience the joys of what used to be called Integration Hell and now is often called a Merge Conflict. The contents of the code repository have diverged so far from the separate branch that the code no longer fits. It needs reworking to match the new shape of the code.

Iterative Development

This is a fancy word to mean rework. A more modern word is Refactoring, the practice of reworking the code when a better understanding of the requirements and design emerges so that the code needs to be changed to better reflect the team’s understanding of the problem.

The point of iterating on the design is to take the initial clunky kludge and rework it to make it easier to extend and build on. The rework may be as simple as renaming methods or a more complicated realignment of responsibilities, but the purpose is to make the code easier to work on in the future — even if that future is later the same day.

Delaying the cleanup is always a mistake, just talk to any company that is dealing with Legacy code.

Typing is not the rate limiting step

With all the recent hype around Copilot and related LLM tools, it is time for a reminder that “typing is not the rate imiting step in software development”. Yes there are lots of generative tools that will generate code for a developer, but for many codebases what is needed is a tool that will remove unused or duplicated code. This would reduce the accidental complexity of the codebase and simplify support and maintenance.

To date I am not aware of any “AI” tools that can even do simple deletion of unused variables, unnecessary calls to methods or rewrite a SQL query to use a simpler access path. Maybe there are some that will do such actions in response to a direct typed request — AKA prompt engineering — but none rise to the level of being able to “clean up the code” the way we would expect any competent developer to be able to do.

So what we are being offered is tools to enable code to be written faster, but for most real codebases, slinging code faster is not the goal. What is needed is a better understanding of the domain, the requirements, the existing design and codebase.

Radical Simplicity

Josh Austen has an interesting take on Radical Simplicity

These are my personal theories on software delivery with an emphasis on achieving higher efficacy without the pitfalls of imaginary and subjective metrics (story points, T-shirt sizes, etc.) that are endemic to the software industry.

Because these imaginary metrics can (and have) led to unnecessary stress and questionable business practices, I firmly believe that the software industry cannot enjoy substantial employee loyalty until employers shed imaginary metrics in favor of making the software delivery process as unencumbered as possible without sacrificing stakeholder needs.

This aligns with my thoughts that we should be working on delivering valuable software, where the return massively outweigns the value of any estimate. If you come up with an idea that will save $500K/year, does it really matter if it takes one day or 1 week to implement? The cost of delay in getting the valued savings is nearly $10K/week, so getting agreement on the number of story points it will take to deliver seems counter-productive.

Dreamhost Strikes Again

Not blogged for a while, busy with other stuff.

Found out that Dreamhost decided to no longer support Ruby on Rails running under Passenger unless you upgrade your package and use a virtual host. The site has been running under Rails since 2006, so there was a bit of work needed to port it to a different platform (still Ruby though, even if I did use Python to script the conversion).

This blog runs under Jekyll which generates a static site, so it loads faster than it used to even if the editing and publishing process is slower.

Generative AI still hallucinates

In a nice posting by Gary Marcus - Seven Lies in Four Sentences there are examples of where Generative AI does what it always does via autocompletion and produces text that is unrelated to reality.

A related, earlier posting notes

All they do is match patterns, draw from massive statistical databases of human language. The patterns might be cool, but language these systems utter doesn’t actually mean anything at all.

Overall these generative tools produce the statistically most likely sequence of words based on the prompt that was supplied. The problem with that is that humans as gullible and easy to fool, so it is easy to fall into the trap of thinking that the generative tool is intelligent. Whereas in reality the tool has no idea at all about what it generated, other than that the output was the statistically most likely response.

So generated text is unlikely to be surprising – a.k.a. no new information – it will just be likely information. So yes, it is likely to give a reasonable answer to common prompts (because it was trained with many related examples ), but for novel prompts hallucinations (a.k.a. misinformation) is the likely answer.

Utilizing SQL

A surprising number of systems I have worked on in the past 15 years have been drastically slower than they need to be because the developers did not understand how to write anything other than a simple single table query. So what was visible in the application was that the application would

  1. Issue a query to the database for a container.
  2. In a loop, issue a different query to get the details of each of the contents of the container.
  3. If you were unlucky, then go through the loop again to get some related details about each of the contents.

This works fine in development with just a few rows in the table, any database can run 1 + 3 + 3 queries in milliseconds. But in production when there are 1000+ items in the container, suddenly the application is trying to do 2,000 queries, and suddenly the milliseconds turns into seconds. All because the developers used the programming language to replace a slightly more complicated SQL statement that could return the desired data in a single database request.

Now whenever I am working on a new codebase, I look for queries that are issued inside loops (or hidden loops inside streams), as practically every time I have to do performance tuning, that is where the excess delays live.

Getting feedback from running code

With unit tests and integration tests we can get feedback from the changes we have just made to code. The larger question is how long does it take in an organization to get feedback from real users interacting with that code?

  • With Software as a Service, it is feasible to set up continuous deployment and as long as the tests in the pipeline pass, do a deploy out to production
  • With other implementation models, On Premise, Device Firmware etc, it can take longer to get feedback

With the standard approach of pushing out incremental releases and a semi traditional approach, each release goes though a quality assurance cycle before being made available as a general release. Customers then choose when to upgrade to the available releases, or have the release pushed towards their machines. In these cases, the time from the first changes to the code in the release until it is made available in a release can be significant.

With an 10 week release cycle, code can have been changed at the start of the release and will not get any feedback until the first customer installs and uses it at least 10 weeks later.

Obviously the way to get real feedback in a shorter time is to have shorter release cycles, but that is not always possible. A workaround for this if to find customers who are willing to take a release as soon as it is available, effectively canary customers. These customers benefit from increased monitoring a scrutiny of the running system, as well as earlier availability of the features in the release.

A sideways view of tech innovations

Adam Kotsko had a great poll on tech innovations

Twitter poll scores

All seem to have been enabled by plentiful vulture capital money that seeks to monopolize part of the world.

  • Illegal cab company – with enough money and backing, local authorities cannot enforce the taxi and employment regulations against a massive company
  • Illegal hotel chain – why go to the bother of actually owning a hotel when you can own the booking system and take a large cut of the revenue. Why follow any regulation when you can claim that you are just the booking service?
  • Fake money for criminals – the various blockchain technologies were a solution in search of a problem until the non-fungible token idea took off
  • Plagiarism machine – automated copy-remix-paste, also known as a stochastic parrot or predictive text on steroids with delusions of grandeur

Missing from list but should have been there

  • Weaponized gossip – creating ways to get your ideas out to as many people as possible to create more discord and violence
  • Selling personal data for profit – now you can really target your advertisements
  • Gigs for all - profits to the corporations, avoid the hassles of minimum wage laws by using poorly paid “contractors”

Software as Capital

All too few people in the software development seem to have read (or even heard about) Howard Baetjer Jr.’s Software as Capital book. Obviously, for a book written back in 1998, some of the ideas in the book will seem outdated, but the concepts are alive and well.

We often talk about legacy software as if it is a drag on a business. But for many software companies, the legacy software is the main source of revenue for the organization. Software is effectively embodied design decisions that have been written down over multiple years into the source code. These documented decisions are valuable, and are effectively the capital resources of the organization.

This is one of the reasons that rewrites are so difficult. The knowledge embedded in the existing code can be difficult to extract by the developers doing the rewrite. If an organization is good at looking after the existing source code – maintaining their capital assets – then extracting that knowledge is easier.

Mocking and Unit Tests

In the Ruby on Rails and Phoenix ecosystems, there is an explicit choice to use a test database for unit tests, along with fixtures to populate that test database with data for the tests. This all works seamlessly with the migrations for the schema and any seeds that are needed for the database. While both have slightly different strategies for isoloating the tests, the tests get decent coverage and will fail if the SQL sent to the database is invalid.

Looking at Java Spring codebases however, rather than having an explicit test database, the practice seems to be to completely mock out the interaction with the database. The end result is often that the tests set expectations as to the exact SQL that is emitted by the Data Access Object (Dao). Effectively the tests are asserting against the implementation of the method, not the results of the method, so query optimizations can end up breaking the unit tests.

This mocking to oblivion approach does allow the tests to run faster, but it does nothing to validate that the SQL (or HQL if using Hibernate) will return the correct dataset when run against a real database.

The case is often made that if the database is involved, then it is not a unit test, it is an integration test. My experience with Ruby on Rails and Phoenix was that even with touching the database, the tests ran fast enough, Phoenix easily achieving 100 tests/second on an average laptop.

ChatGPT - do not fall for the hype

Dam McQuillan is calling it a bullshit generator

Despite the impressive technical ju-jitsu of transformer models and the billions of parameters they learn, it’s still a computational guessing game. ChatGPT is, in technical terms, a ‘bullshit generator’. If a generated sentence makes sense to you, the reader, it means the mathematical model has made sufficiently good guess to pass your sense-making filter. The language model has no idea what it’s talking about because it has no idea about anything at all. It’s more of a bullshitter than the most egregious egoist you’ll ever meet, producing baseless assertions with unfailing confidence because that’s what it’s designed to do.

The Register is even less complimentary, calling it a confidence trickster

Do enough talking to the bot about subjects you know, and curiosity soon deepens to unease. That feeling of talking with someone whose confidence far exceeds their competence grows until ChatGPT’s true nature shines out. It’s a Dunning-Kruger effect knowledge simulator par excellence. It doesn’t know what it’s talking about, and it doesn’t care because we haven’t learned how to do that bit yet.

As is apparent to anyone who has hung out with humans, Dunning Kruger is exceedingly dangerous and exceedingly common. Our companies, our religions and our politics offer limitless possibilities to people with DK. If you can persuade people you’re right, they’re very unwilling to accept proof otherwise, and up you go. Old Etonians, populist politicians and Valley tech bros rely on this, with results we are all too familiar with. ChatGPT is Dunning-Kruger As-a-Service (DKaaS). That’s dangerous.

And a fun take on AI replacing software developers by Dawson Eliasen

All this talk I’m seeing about AI being close to replacing programmers indicates there’s a significant gap between what people think programming is like and what programming is actually like. I get the sense that most people who don’t work in tech think that programming is like sitting down in front of a computer, saying to yourself, “alrighty, let’s make an app,” and expertly busting out code until you have a fresh app. It’s more like getting onboarded into an organization that has hundreds of thousands of lines of archaic, institutional code, and being tasked with finding and fixing the 1-10 lines that happen to be somehow causing the most urgent bug, and then doing this over and over.

Architectural Consistency - How hard can it be?

Working with Rails and Phoenix has made me more conscious of a related set of software architecture mistakes I am seeing in other systems…

  • How hard can it be to have consistent table names in a relational database? Preferably have plural table names, or if you prefer have singular names, but please do not have a mismatch with some table names plural and others single.
  • How hard can it be to have a consistent name for the primary key of a table, something simple like id?
  • How hard can it be to have a consistent name for the foreign keys in a table, something like other_table_id? (Please remember when doing this that the foreign key is to a specific row so the other_table will be singular)
  • How hard can it be to have consistent convention for naming the common CRUD methods to access records in the database?

Over 20 years ago we were talking about Software Patterns and Coding Idioms, but many systems I have looked at in the past 10 years have failed to get some or all of the above items wrong, leading to interesting defects and confused developers.

Python Requests library and SSL

When using the simple requests calls, separate SSL session is negotiated for each request.

>>> import requests
>>> resp = requests.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.344494 200
>>> resp = requests.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.320507 200
>>>

To avoid this, and to get faster subsequent response times, you can create a session object that will then keep the connection alive (but the default keep-alive for Apache is only 5 seconds)

>>> s = requests.Session()
>>> resp = s.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.333475 200
>>> resp = s.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.105773 200
>>>

So by using the Session capabilities, subsequent https requests in a test suite can save the SSL negotiation cost, which on my laptop down to this blog server is of the order of 200ms. Not a massive saving, but when a test suite needs to hit an API 100 times to test out the various options, the difference is noticeable.

Prioritization and Cycle Time

Many of the agile processes make a nod towards cycle time, which is typically associated with the time taken from when developers start working on an item and the time when the item is released to production. While this is a useful measure for finding out how long on average developers take to complete work, it is not the full picture.

Defining Cycle Time as just the time when developers are working makes it seem that the time is of the order of a week or so. Admittedly I have seen scrum teams on a two week sprint cycle take multiple sprints to complete items, so even on a simplified cycle time measure many teams are slow.

The full, user’s experienced cycle time however is from the when an item is first recognized, understood, written up into backlog, queued for prioritization, queued for development, developed and released. Although it is hard to get a good handle on the first few stages, I commonly see open JIRA tickets that are 12 months or older, and that is in the unprioritized backlog. From a user viewpoint this is abysmal, sending in a request and not hearing back for a long time.

The prioritization needs to be done in a way that allows for fast feedback to the users. One way of doing this is to adjust the workflow so that there is a two stage writeup, and the initial request can then be simplified and routed for evaluation as soon as it is created. This initial prioritization puts the requests into one of four buckets.

  • Do Now, team parks whatever it is working on and work on the item. Typically only needed for items that have a major, imminent business impact.
  • Do Next, team will pick up this item next after finishing the current work items.
  • Do Later, item gets queued into the backlog for consideration later.
  • Decline, item gets marked as declined, not a part of the backlog.

An immediate Decline is useful as it allows the user to either make a better case for the item or know that it is not going to change so the user needs to accept the current process and/or come up with a workaround.

The Do Later items need to be worked on by the user, business domain experts, analysts and managers to better understand the item, the value and associated costs so that the relative importance of the items in the backlog can be assessed. As part of this there has to be an agreed maximum age of items in the backlog to keep the size manageable. Items that get too old need to be Declined, as they are lower value than all the other items that could be worked on.

The top items in the backlog are effectively in the Do Next category, so if no new high priority items have been put into the Do Next category, then the team can work on the top priority from the Do Later backlog.

The size of the Do Next category has to be managed by the user, business domain experts, analysts and managers group, so that it only contains items that are truly higher value than the top items in the Do Later backlog. The size of the team that is working sets the limit on the number of items in the Do Next category. It has to be small so that the initial evaluation can be immediate, a good starting point is five or one less than the team size if that size is less than five.

More Dan North and CUPID

Finally Dan has got around to creating the Cupid.dev website

I believe that there are properties or characteristics of software that make it a joy to work with. The more your code has these qualities, the more joyful it is to work with; but everything is a tradeoff so you should always consider your context.

There are likely many of these properties, overlapping and interrelating, and there are many ways to describe them. I have chosen five that underpin much of what I care about in code. There is a diminishing return; five are enough to make a handy acronym, and few enough to remember.

Just for reference, CUPID stands for

  • Composable: plays well with others,
  • Unix philosophy: does one thing well,
  • Predictable: does what you expect,
  • Idiomatic: feels natural,
  • Domain-based: the code models the problem domain in language and structure.