AI Coding Assistants?

The current crop of AI Coding Assistants are not fit for purpose.

They are capable of generating boilerplate code based on detailed text descriptions of the intended design, but that boilerplate code is nearly always suboptimal. The problem is that most languages and development environments have capabilities to generate any necessary boilerplate. Code that is generated by those environments is always consistent and predictable.

Unfortunately the “AI” tools fail at producing consistent, predictable code. All the examples I have seen (beyond the inevitable toy examples) fail at the task of generating code that is consistent with the codebase that it is working with.

If the “AI” tools want to be useful for the task of software development, rather than trying to generate code – because Typing is not the rate limiting step, they should focus on other aspects of software development.

  • Improving the quality of compiler generated error messages
  • Linters that do not result in lots of false positive warnings
  • Instrumentation of running systems to assist with performance tuning
  • Code analysis to report on possible vulnerabilities beyond known CVEs

A much better area for the “AI” companies to look at is the upstream activities

  • Exploring the quality of requirements
  • Understanding how requirements could result in unintended consequences
  • Analysis of design options and how these would interact with the overall requirements

How hard can it be?

A perennial problem in software development is determining how long it will take to deliver a useful system. Amusingly the conversation normally starts with the words How hard can it be…

The correct answer to this is a lot harder than you initially think it will be.

The problem is that collectively we are very bad at requirements elicitation

The idea that we could identify all requirements up front before developing the software has been proven false multiple times. A key reason for this is that any new system, has impacts outside of the intended results, that although possibly easy to see in hindsight are hard to predict …

  • We can now send and receive emails, do we have a way of addressing the problems related to spam?
  • Once connected to the internet, how to deal with distributed denial of service?
  • With social media, none of the initial implementations considered hostile actors.

The Agile approaches deal with the problem by developing incrementally and iteratively. So a small part of the eventual system is built and evaluated. Depending on what is learnt during the evaluation, iteration may be required to rework the feature, design or implementation. This iteration is an essential part, as it allows new information to be incorporated into the system.

An obvious downside is that iteration slows the delivery of the next increment of functionality – since the rework of the delivered features takes time. But the key thing to consider it that not responding to what has been learned from the evaluation will mean that the eventual users will be adversely impacted.

Incremental Development requires Incremental Requirements Capture

The above implies that there is no point in trying to elicit and document all the requirements up front. Instead the requirements for the initial features need to be captured before doing the development of those features. But the requirements for the subsequent features can be deferred until later in the process – potentially just in time capture of the second set of requirements in parallel with the development and evaluation of the first features.

Hence my answer to How hard can it be… is how well do you understand your problem domain? With a hat tip to the Dunning Kruger effect where people without experience normally over estimate how good their understanding is.

SWEBOK 4 is out

A few years ago I noticed that SWEBOK 3 was out, and at that point I noticed that it seemed to use an outdated reference Parameterized types, also known as generics (Ada, Eiffel) and templates (C++).

Well now SWEBOK 4 is out, and it is updated (sort of)

Parameterized types, also known as generics (Ada, Java, Eiffel) and templates (C++), enable a type or class definition without specifying all the other types used. The unspecified types are supplied as parameters at the point of use. Parameterized types provide a third way (besides class inheritance and object composition) to compose behaviors in object-oriented software.

It is nice to see that Java now makes the list, even though it has been around for more than 20 years. As I said last time, it might have been more relevant to reference Go or Rust as examples in this context.

We need better Tech Journalism

It is amazing to me how easy it is for tech entrpreneurs to mislead journalists by making claims for products that are not grounded in reality. A recent example is the coverage of the robotaxi launch, after 8+ years of failing to deliver full self driving, Axios completely failed to report that in 2016 Tesla claimed full self driving wouf be available next year. Similarly Wired claimed the cybercab is here and then stated that it was going into production in 2026. Admittedly Wired did say Musk, an admitted collector of missed deadlines, has been promising Tesla self-driving tech since 2016, but that was five paragraphs into the article.

Arstechnica was better, and 10 paragraphs in has a subheading of Do you still believe this stuff?. Overall that should have been the lede.

But in their coverage of Apple Intelligence Arstechnica failed to be skeptical of the claims made by Apple. Practically all coverage of Artificial Intellingence fails to mention the problem that it make simple mistakes — inventing things that are not real — and uses the term Hallucination to whitewash the problem.

AI Acceptance Tests

How do we do acceptance testing of the various claims made for AI capabilities?

At one time Winograd Schemas were thought to be a challenge for AI, since although humans can easily disabiguate between pronoun usage based on word choice, most early AI failed at this. These were based on Terry Winograd’s thesis on Understanding Natural Language, where the thought was that a computer would need to know the meanings of words to disamnbiguate pronouns.

Unfortunately, with a large enough corpus of text, statistics can handle most of the disambiguation.

So the question arises, what can we do to validate the output from the current set of LLM AI approaches?

  • With a simple enough question, the answer can be found by a regular internet search.
  • We have to allow for gullible humans being easily convinced by the output - the Clever Hans problem.
  • The count of letter r in Strawberry problem is indicative of the problem that the models can be adjusted to cater for some failures, so once a test is known, the companies can Teach to the Test.

Currently the best we can do is ask questions based on local knowledge that is not widespread, and when that is done most of the claims for the LLM are vapourware. The current set of AI approaches make up stuff that initially seems correct, AKA Hallucinations that are not connected to the context. So summarizations of a document add in things that could sort of fit but are not in the original, or the summarization is so general that it would fit for anything — like a horoscope prediction where the reader thinks it applies to them.

Bonus link to a paper where some researchers think that we do not understand cognition enough to build intelligent AI.

Tests should Fail

Not too sure how often I have run into this recently, but I am starting to see many tests that cannot fail. What I mean by this is tests that pass regardless of whether the change it is meant to test is present in the codebase or not.

Obviously this is not in a Test Driven Development project, since with TDD the first thing to do is run the test to make sure it fails. However with the more common develop tests after writing the code a passing test does not mean that the code works. All it means is that the test is passing.

It is important to make sure that any test will fail if the feature that it is implementing is doing the right thing. So for tests that were written after the code, it is important to comment out parts of the feature and ensure that the test can detect the deliberate defect.

Incremental Delivery?

When using Agile approaches and Incremental development, a necessary decision is how often to release a new version to your users.

Pre-history

Back in 1988, Tom Gilb published Principes of Software Engineering Management, in which he wrote about a weekly delivery cycle. Yes, back in the era of Mainframes, some teams managed to release weekly.

Floppy Disk era

When software started to be delivered via a floppy disk (and later CDs and DVDs), release frequency drastically dropped. Due to the need to produce multiple physical copies of the documentation and the disks, a new release occurred yearly at best.

Web era

The web was a return to the Mainframe era — but most people do not think of it that way. With a centralized web server, a new version could be deployed at any time (provided it had been adequqtely tested).

Distributed era

Now that software is distributed and deployed on multiple devices, we are back to having to choose deployment frequency. Devices can phone home to find out if a new version is available, but we are left with the problem of choosing how often to allow (or force) users to upgrade.

Tradeoffs of deployment frequency

  • Ideally we want to get feedback from real users about new and changed features as soon as feasible – developers lose motivation if a feature that was worked on six months ago still has not been released
  • Finalizing acceptance testing and documentation has a cost, so it is nice to batch up sets of features for a release
  • Users vary in how often they want to have the disruption of installing a newer version and learning the foibles of that version – operating system developers are used to this, as are developers of utilities like PostgreSQL, which is currently on Version 16, supports versions 15 thru 12 and has multiple prior unsupported versions

For teams that are serious about doing Incremental Development, the Incremental Delivery needs to happen with a close synchronization to the development cycle. An obvious strategy is to put new users on the latest version, and to provide an option for existing users to upgrade on their own schedule…

Creating Permanent Documentation

Although many companies try to use Sharepoint, Wiki like web pages and similar to hold internal documentation, sometimes what is needed is a point in time definitive version of the documentation. A typical solution for this is to publish a word document, or to convert the web pages into a pdf file, but there are better ways.

One option I have used in the past is Sphinx, which uses reStructuredText (and now a version of Markdown) as the plaintext source for documentation. This plaintext can then be converted into whatever outyput format is desired.

Bookdown

Bookdown is built on Project-R, and hence uses RMarkdown, which other than having the ability to run R scripts, is similar to most other variants of markdown. The build process for creating a pdf file uses LaTeX as the intermediate step, but that is effectively hidden under the covers.

The source files are plain text, with a .Rmd extension, and hence are easy to version in any code repository. With the ability to run R, statistics, charts and graphs are easy to incorporate, and when new data is available, the book can be regenerated.

A nice feature of using markdown, is that to a large extent, the source text from a book can be put into this jekyll based website without any issues.

What makes the business analysis task different is that, in these types of systems, the relationships between the different attributes are partially obscured. In a third normal form database, just looking at the column names in a table lets you know the important details about that business concept. In this case, you have to trust that the grouping of the Attributes into Categories makes logical sense without deep domain knowledge.

But some will not be immediately useful as in the LaTeX based syntax for referring to figures elsewhere in the document.

Figure \@ref(fig:PATERD) has a Patient table in the ERD, but the rest of the table names do not reflect the key business concepts. The reason for this is that in hospital patient data systems, many different types of readings need to be recorded about the patient, but these readings will be different depending on the type of hospital and between different units. So these types of systems tend to be locally customized to add the appropriate readings that are needed by a particular unit and associated specialists.

The above examples are from a Guide to SQL for Business Analysts, the SQL scripts in the source can be marked to either display, or to run against a database and produce a table of results. The queries run when the pdf is generated, ensuring that the SQL examples execute correctly against the database.

Incremental and Iterative

When using Agile approaches, few take the word Incremental as far as the Extreme Programming community in not letting the sun set on bad code. WHat that meant was that any code changes made during the day had to be either tested and committed back to the source code repository or abandoned as an experiment that enabled learning about the design. Abandoned meaning that the changes were deleted by just checking out the contents of the current repository — and typically the next day a better solution was implemented in less time than the original experiment took.

Now that most teams are using PRs (aka Pull Requests), the same thing applies. Make the smallest possible change that makes sense and push that as a tested PR to be merged to the repository. A tiny, coherent and understandable change is quick to review, approve and merge (assuming that the team has good unit tests).

What is a good size for a PR?

My take is that a few hours work is the upper limit.

That will seem absurdly small for teams used to working in a separate branch for weeks on end, but those teams get to experience the joys of what used to be called Integration Hell and now is often called a Merge Conflict. The contents of the code repository have diverged so far from the separate branch that the code no longer fits. It needs reworking to match the new shape of the code.

Iterative Development

This is a fancy word to mean rework. A more modern word is Refactoring, the practice of reworking the code when a better understanding of the requirements and design emerges so that the code needs to be changed to better reflect the team’s understanding of the problem.

The point of iterating on the design is to take the initial clunky kludge and rework it to make it easier to extend and build on. The rework may be as simple as renaming methods or a more complicated realignment of responsibilities, but the purpose is to make the code easier to work on in the future — even if that future is later the same day.

Delaying the cleanup is always a mistake, just talk to any company that is dealing with Legacy code.

Typing is not the rate limiting step

With all the recent hype around Copilot and related LLM tools, it is time for a reminder that “typing is not the rate imiting step in software development”. Yes there are lots of generative tools that will generate code for a developer, but for many codebases what is needed is a tool that will remove unused or duplicated code. This would reduce the accidental complexity of the codebase and simplify support and maintenance.

To date I am not aware of any “AI” tools that can even do simple deletion of unused variables, unnecessary calls to methods or rewrite a SQL query to use a simpler access path. Maybe there are some that will do such actions in response to a direct typed request — AKA prompt engineering — but none rise to the level of being able to “clean up the code” the way we would expect any competent developer to be able to do.

So what we are being offered is tools to enable code to be written faster, but for most real codebases, slinging code faster is not the goal. What is needed is a better understanding of the domain, the requirements, the existing design and codebase.

Radical Simplicity

Josh Austen has an interesting take on Radical Simplicity

These are my personal theories on software delivery with an emphasis on achieving higher efficacy without the pitfalls of imaginary and subjective metrics (story points, T-shirt sizes, etc.) that are endemic to the software industry.

Because these imaginary metrics can (and have) led to unnecessary stress and questionable business practices, I firmly believe that the software industry cannot enjoy substantial employee loyalty until employers shed imaginary metrics in favor of making the software delivery process as unencumbered as possible without sacrificing stakeholder needs.

This aligns with my thoughts that we should be working on delivering valuable software, where the return massively outweigns the value of any estimate. If you come up with an idea that will save $500K/year, does it really matter if it takes one day or 1 week to implement? The cost of delay in getting the valued savings is nearly $10K/week, so getting agreement on the number of story points it will take to deliver seems counter-productive.

Dreamhost Strikes Again

Not blogged for a while, busy with other stuff.

Found out that Dreamhost decided to no longer support Ruby on Rails running under Passenger unless you upgrade your package and use a virtual host. The site has been running under Rails since 2006, so there was a bit of work needed to port it to a different platform (still Ruby though, even if I did use Python to script the conversion).

This blog runs under Jekyll which generates a static site, so it loads faster than it used to even if the editing and publishing process is slower.

Generative AI still hallucinates

In a nice posting by Gary Marcus - Seven Lies in Four Sentences there are examples of where Generative AI does what it always does via autocompletion and produces text that is unrelated to reality.

A related, earlier posting notes

All they do is match patterns, draw from massive statistical databases of human language. The patterns might be cool, but language these systems utter doesn’t actually mean anything at all.

Overall these generative tools produce the statistically most likely sequence of words based on the prompt that was supplied. The problem with that is that humans as gullible and easy to fool, so it is easy to fall into the trap of thinking that the generative tool is intelligent. Whereas in reality the tool has no idea at all about what it generated, other than that the output was the statistically most likely response.

So generated text is unlikely to be surprising – a.k.a. no new information – it will just be likely information. So yes, it is likely to give a reasonable answer to common prompts (because it was trained with many related examples ), but for novel prompts hallucinations (a.k.a. misinformation) is the likely answer.

Utilizing SQL

A surprising number of systems I have worked on in the past 15 years have been drastically slower than they need to be because the developers did not understand how to write anything other than a simple single table query. So what was visible in the application was that the application would

  1. Issue a query to the database for a container.
  2. In a loop, issue a different query to get the details of each of the contents of the container.
  3. If you were unlucky, then go through the loop again to get some related details about each of the contents.

This works fine in development with just a few rows in the table, any database can run 1 + 3 + 3 queries in milliseconds. But in production when there are 1000+ items in the container, suddenly the application is trying to do 2,000 queries, and suddenly the milliseconds turns into seconds. All because the developers used the programming language to replace a slightly more complicated SQL statement that could return the desired data in a single database request.

Now whenever I am working on a new codebase, I look for queries that are issued inside loops (or hidden loops inside streams), as practically every time I have to do performance tuning, that is where the excess delays live.

Getting feedback from running code

With unit tests and integration tests we can get feedback from the changes we have just made to code. The larger question is how long does it take in an organization to get feedback from real users interacting with that code?

  • With Software as a Service, it is feasible to set up continuous deployment and as long as the tests in the pipeline pass, do a deploy out to production
  • With other implementation models, On Premise, Device Firmware etc, it can take longer to get feedback

With the standard approach of pushing out incremental releases and a semi traditional approach, each release goes though a quality assurance cycle before being made available as a general release. Customers then choose when to upgrade to the available releases, or have the release pushed towards their machines. In these cases, the time from the first changes to the code in the release until it is made available in a release can be significant.

With an 10 week release cycle, code can have been changed at the start of the release and will not get any feedback until the first customer installs and uses it at least 10 weeks later.

Obviously the way to get real feedback in a shorter time is to have shorter release cycles, but that is not always possible. A workaround for this if to find customers who are willing to take a release as soon as it is available, effectively canary customers. These customers benefit from increased monitoring a scrutiny of the running system, as well as earlier availability of the features in the release.

A sideways view of tech innovations

Adam Kotsko had a great poll on tech innovations

Twitter poll scores

All seem to have been enabled by plentiful vulture capital money that seeks to monopolize part of the world.

  • Illegal cab company – with enough money and backing, local authorities cannot enforce the taxi and employment regulations against a massive company
  • Illegal hotel chain – why go to the bother of actually owning a hotel when you can own the booking system and take a large cut of the revenue. Why follow any regulation when you can claim that you are just the booking service?
  • Fake money for criminals – the various blockchain technologies were a solution in search of a problem until the non-fungible token idea took off
  • Plagiarism machine – automated copy-remix-paste, also known as a stochastic parrot or predictive text on steroids with delusions of grandeur

Missing from list but should have been there

  • Weaponized gossip – creating ways to get your ideas out to as many people as possible to create more discord and violence
  • Selling personal data for profit – now you can really target your advertisements
  • Gigs for all - profits to the corporations, avoid the hassles of minimum wage laws by using poorly paid “contractors”

Software as Capital

All too few people in the software development seem to have read (or even heard about) Howard Baetjer Jr.’s Software as Capital book. Obviously, for a book written back in 1998, some of the ideas in the book will seem outdated, but the concepts are alive and well.

We often talk about legacy software as if it is a drag on a business. But for many software companies, the legacy software is the main source of revenue for the organization. Software is effectively embodied design decisions that have been written down over multiple years into the source code. These documented decisions are valuable, and are effectively the capital resources of the organization.

This is one of the reasons that rewrites are so difficult. The knowledge embedded in the existing code can be difficult to extract by the developers doing the rewrite. If an organization is good at looking after the existing source code – maintaining their capital assets – then extracting that knowledge is easier.

Mocking and Unit Tests

In the Ruby on Rails and Phoenix ecosystems, there is an explicit choice to use a test database for unit tests, along with fixtures to populate that test database with data for the tests. This all works seamlessly with the migrations for the schema and any seeds that are needed for the database. While both have slightly different strategies for isoloating the tests, the tests get decent coverage and will fail if the SQL sent to the database is invalid.

Looking at Java Spring codebases however, rather than having an explicit test database, the practice seems to be to completely mock out the interaction with the database. The end result is often that the tests set expectations as to the exact SQL that is emitted by the Data Access Object (Dao). Effectively the tests are asserting against the implementation of the method, not the results of the method, so query optimizations can end up breaking the unit tests.

This mocking to oblivion approach does allow the tests to run faster, but it does nothing to validate that the SQL (or HQL if using Hibernate) will return the correct dataset when run against a real database.

The case is often made that if the database is involved, then it is not a unit test, it is an integration test. My experience with Ruby on Rails and Phoenix was that even with touching the database, the tests ran fast enough, Phoenix easily achieving 100 tests/second on an average laptop.

ChatGPT - do not fall for the hype

Dam McQuillan is calling it a bullshit generator

Despite the impressive technical ju-jitsu of transformer models and the billions of parameters they learn, it’s still a computational guessing game. ChatGPT is, in technical terms, a ‘bullshit generator’. If a generated sentence makes sense to you, the reader, it means the mathematical model has made sufficiently good guess to pass your sense-making filter. The language model has no idea what it’s talking about because it has no idea about anything at all. It’s more of a bullshitter than the most egregious egoist you’ll ever meet, producing baseless assertions with unfailing confidence because that’s what it’s designed to do.

The Register is even less complimentary, calling it a confidence trickster

Do enough talking to the bot about subjects you know, and curiosity soon deepens to unease. That feeling of talking with someone whose confidence far exceeds their competence grows until ChatGPT’s true nature shines out. It’s a Dunning-Kruger effect knowledge simulator par excellence. It doesn’t know what it’s talking about, and it doesn’t care because we haven’t learned how to do that bit yet.

As is apparent to anyone who has hung out with humans, Dunning Kruger is exceedingly dangerous and exceedingly common. Our companies, our religions and our politics offer limitless possibilities to people with DK. If you can persuade people you’re right, they’re very unwilling to accept proof otherwise, and up you go. Old Etonians, populist politicians and Valley tech bros rely on this, with results we are all too familiar with. ChatGPT is Dunning-Kruger As-a-Service (DKaaS). That’s dangerous.

And a fun take on AI replacing software developers by Dawson Eliasen

All this talk I’m seeing about AI being close to replacing programmers indicates there’s a significant gap between what people think programming is like and what programming is actually like. I get the sense that most people who don’t work in tech think that programming is like sitting down in front of a computer, saying to yourself, “alrighty, let’s make an app,” and expertly busting out code until you have a fresh app. It’s more like getting onboarded into an organization that has hundreds of thousands of lines of archaic, institutional code, and being tasked with finding and fixing the 1-10 lines that happen to be somehow causing the most urgent bug, and then doing this over and over.