Radical Simplicity

Josh Austen has an interesting take on Radical Simplicity

These are my personal theories on software delivery with an emphasis on achieving higher efficacy without the pitfalls of imaginary and subjective metrics (story points, T-shirt sizes, etc.) that are endemic to the software industry.

Because these imaginary metrics can (and have) led to unnecessary stress and questionable business practices, I firmly believe that the software industry cannot enjoy substantial employee loyalty until employers shed imaginary metrics in favor of making the software delivery process as unencumbered as possible without sacrificing stakeholder needs.

This aligns with my thoughts that we should be working on delivering valuable software, where the return massively outweigns the value of any estimate. If you come up with an idea that will save $500K/year, does it really matter if it takes one day or 1 week to implement? The cost of delay in getting the valued savings is nearly $10K/week, so getting agreement on the number of story points it will take to deliver seems counter-productive.

Dreamhost Strikes Again

Not blogged for a while, busy with other stuff.

Found out that Dreamhost decided to no longer support Ruby on Rails running under Passenger unless you upgrade your package and use a virtual host. The site has been running under Rails since 2006, so there was a bit of work needed to port it to a different platform (still Ruby though, even if I did use Python to script the conversion).

This blog runs under Jekyll which generates a static site, so it loads faster than it used to even if the editing and publishing process is slower.

Generative AI still hallucinates

In a nice posting by Gary Marcus - Seven Lies in Four Sentences there are examples of where Generative AI does what it always does via autocompletion and produces text that is unrelated to reality.

A related, earlier posting notes

All they do is match patterns, draw from massive statistical databases of human language. The patterns might be cool, but language these systems utter doesn’t actually mean anything at all.

Overall these generative tools produce the statistically most likely sequence of words based on the prompt that was supplied. The problem with that is that humans as gullible and easy to fool, so it is easy to fall into the trap of thinking that the generative tool is intelligent. Whereas in reality the tool has no idea at all about what it generated, other than that the output was the statistically most likely response.

So generated text is unlikely to be surprising – a.k.a. no new information – it will just be likely information. So yes, it is likely to give a reasonable answer to common prompts (because it was trained with many related examples ), but for novel prompts hallucinations (a.k.a. misinformation) is the likely answer.

Utilizing SQL

A surprising number of systems I have worked on in the past 15 years have been drastically slower than they need to be because the developers did not understand how to write anything other than a simple single table query. So what was visible in the application was that the application would

  1. Issue a query to the database for a container.
  2. In a loop, issue a different query to get the details of each of the contents of the container.
  3. If you were unlucky, then go through the loop again to get some related details about each of the contents.

This works fine in development with just a few rows in the table, any database can run 1 + 3 + 3 queries in milliseconds. But in production when there are 1000+ items in the container, suddenly the application is trying to do 2,000 queries, and suddenly the milliseconds turns into seconds. All because the developers used the programming language to replace a slightly more complicated SQL statement that could return the desired data in a single database request.

Now whenever I am working on a new codebase, I look for queries that are issued inside loops (or hidden loops inside streams), as practically every time I have to do performance tuning, that is where the excess delays live.

Getting feedback from running code

With unit tests and integration tests we can get feedback from the changes we have just made to code. The larger question is how long does it take in an organization to get feedback from real users interacting with that code?

  • With Software as a Service, it is feasible to set up continuous deployment and as long as the tests in the pipeline pass, do a deploy out to production
  • With other implementation models, On Premise, Device Firmware etc, it can take longer to get feedback

With the standard approach of pushing out incremental releases and a semi traditional approach, each release goes though a quality assurance cycle before being made available as a general release. Customers then choose when to upgrade to the available releases, or have the release pushed towards their machines. In these cases, the time from the first changes to the code in the release until it is made available in a release can be significant.

With an 10 week release cycle, code can have been changed at the start of the release and will not get any feedback until the first customer installs and uses it at least 10 weeks later.

Obviously the way to get real feedback in a shorter time is to have shorter release cycles, but that is not always possible. A workaround for this if to find customers who are willing to take a release as soon as it is available, effectively canary customers. These customers benefit from increased monitoring a scrutiny of the running system, as well as earlier availability of the features in the release.

A sideways view of tech innovations

Adam Kotsko had a great poll on tech innovations

Twitter poll scores

All seem to have been enabled by plentiful vulture capital money that seeks to monopolize part of the world.

  • Illegal cab company – with enough money and backing, local authorities cannot enforce the taxi and employment regulations against a massive company
  • Illegal hotel chain – why go to the bother of actually owning a hotel when you can own the booking system and take a large cut of the revenue. Why follow any regulation when you can claim that you are just the booking service?
  • Fake money for criminals – the various blockchain technologies were a solution in search of a problem until the non-fungible token idea took off
  • Plagiarism machine – automated copy-remix-paste, also known as a stochastic parrot or predictive text on steroids with delusions of grandeur

Missing from list but should have been there

  • Weaponized gossip – creating ways to get your ideas out to as many people as possible to create more discord and violence
  • Selling personal data for profit – now you can really target your advertisements
  • Gigs for all - profits to the corporations, avoid the hassles of minimum wage laws by using poorly paid “contractors”

Software as Capital

All too few people in the software development seem to have read (or even heard about) Howard Baetjer Jr.’s Software as Capital book. Obviously, for a book written back in 1998, some of the ideas in the book will seem outdated, but the concepts are alive and well.

We often talk about legacy software as if it is a drag on a business. But for many software companies, the legacy software is the main source of revenue for the organization. Software is effectively embodied design decisions that have been written down over multiple years into the source code. These documented decisions are valuable, and are effectively the capital resources of the organization.

This is one of the reasons that rewrites are so difficult. The knowledge embedded in the existing code can be difficult to extract by the developers doing the rewrite. If an organization is good at looking after the existing source code – maintaining their capital assets – then extracting that knowledge is easier.

Mocking and Unit Tests

In the Ruby on Rails and Phoenix ecosystems, there is an explicit choice to use a test database for unit tests, along with fixtures to populate that test database with data for the tests. This all works seamlessly with the migrations for the schema and any seeds that are needed for the database. While both have slightly different strategies for isoloating the tests, the tests get decent coverage and will fail if the SQL sent to the database is invalid.

Looking at Java Spring codebases however, rather than having an explicit test database, the practice seems to be to completely mock out the interaction with the database. The end result is often that the tests set expectations as to the exact SQL that is emitted by the Data Access Object (Dao). Effectively the tests are asserting against the implementation of the method, not the results of the method, so query optimizations can end up breaking the unit tests.

This mocking to oblivion approach does allow the tests to run faster, but it does nothing to validate that the SQL (or HQL if using Hibernate) will return the correct dataset when run against a real database.

The case is often made that if the database is involved, then it is not a unit test, it is an integration test. My experience with Ruby on Rails and Phoenix was that even with touching the database, the tests ran fast enough, Phoenix easily achieving 100 tests/second on an average laptop.

ChatGPT - do not fall for the hype

Dam McQuillan is calling it a bullshit generator

Despite the impressive technical ju-jitsu of transformer models and the billions of parameters they learn, it’s still a computational guessing game. ChatGPT is, in technical terms, a ‘bullshit generator’. If a generated sentence makes sense to you, the reader, it means the mathematical model has made sufficiently good guess to pass your sense-making filter. The language model has no idea what it’s talking about because it has no idea about anything at all. It’s more of a bullshitter than the most egregious egoist you’ll ever meet, producing baseless assertions with unfailing confidence because that’s what it’s designed to do.

The Register is even less complimentary, calling it a confidence trickster

Do enough talking to the bot about subjects you know, and curiosity soon deepens to unease. That feeling of talking with someone whose confidence far exceeds their competence grows until ChatGPT’s true nature shines out. It’s a Dunning-Kruger effect knowledge simulator par excellence. It doesn’t know what it’s talking about, and it doesn’t care because we haven’t learned how to do that bit yet.

As is apparent to anyone who has hung out with humans, Dunning Kruger is exceedingly dangerous and exceedingly common. Our companies, our religions and our politics offer limitless possibilities to people with DK. If you can persuade people you’re right, they’re very unwilling to accept proof otherwise, and up you go. Old Etonians, populist politicians and Valley tech bros rely on this, with results we are all too familiar with. ChatGPT is Dunning-Kruger As-a-Service (DKaaS). That’s dangerous.

And a fun take on AI replacing software developers by Dawson Eliasen

All this talk I’m seeing about AI being close to replacing programmers indicates there’s a significant gap between what people think programming is like and what programming is actually like. I get the sense that most people who don’t work in tech think that programming is like sitting down in front of a computer, saying to yourself, “alrighty, let’s make an app,” and expertly busting out code until you have a fresh app. It’s more like getting onboarded into an organization that has hundreds of thousands of lines of archaic, institutional code, and being tasked with finding and fixing the 1-10 lines that happen to be somehow causing the most urgent bug, and then doing this over and over.

Architectural Consistency - How hard can it be?

Working with Rails and Phoenix has made me more conscious of a related set of software architecture mistakes I am seeing in other systems…

  • How hard can it be to have consistent table names in a relational database? Preferably have plural table names, or if you prefer have singular names, but please do not have a mismatch with some table names plural and others single.
  • How hard can it be to have a consistent name for the primary key of a table, something simple like id?
  • How hard can it be to have a consistent name for the foreign keys in a table, something like other_table_id? (Please remember when doing this that the foreign key is to a specific row so the other_table will be singular)
  • How hard can it be to have consistent convention for naming the common CRUD methods to access records in the database?

Over 20 years ago we were talking about Software Patterns and Coding Idioms, but many systems I have looked at in the past 10 years have failed to get some or all of the above items wrong, leading to interesting defects and confused developers.

Python Requests library and SSL

When using the simple requests calls, separate SSL session is negotiated for each request.

>>> import requests
>>> resp = requests.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.344494 200
>>> resp = requests.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.320507 200
>>>

To avoid this, and to get faster subsequent response times, you can create a session object that will then keep the connection alive (but the default keep-alive for Apache is only 5 seconds)

>>> s = requests.Session()
>>> resp = s.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.333475 200
>>> resp = s.get('https://improvingwetware.com/favicon.ico') ;  print(resp.elapsed, resp.status_code)
0:00:00.105773 200
>>>

So by using the Session capabilities, subsequent https requests in a test suite can save the SSL negotiation cost, which on my laptop down to this blog server is of the order of 200ms. Not a massive saving, but when a test suite needs to hit an API 100 times to test out the various options, the difference is noticeable.

Prioritization and Cycle Time

Many of the agile processes make a nod towards cycle time, which is typically associated with the time taken from when developers start working on an item and the time when the item is released to production. While this is a useful measure for finding out how long on average developers take to complete work, it is not the full picture.

Defining Cycle Time as just the time when developers are working makes it seem that the time is of the order of a week or so. Admittedly I have seen scrum teams on a two week sprint cycle take multiple sprints to complete items, so even on a simplified cycle time measure many teams are slow.

The full, user’s experienced cycle time however is from the when an item is first recognized, understood, written up into backlog, queued for prioritization, queued for development, developed and released. Although it is hard to get a good handle on the first few stages, I commonly see open JIRA tickets that are 12 months or older, and that is in the unprioritized backlog. From a user viewpoint this is abysmal, sending in a request and not hearing back for a long time.

The prioritization needs to be done in a way that allows for fast feedback to the users. One way of doing this is to adjust the workflow so that there is a two stage writeup, and the initial request can then be simplified and routed for evaluation as soon as it is created. This initial prioritization puts the requests into one of four buckets.

  • Do Now, team parks whatever it is working on and work on the item. Typically only needed for items that have a major, imminent business impact.
  • Do Next, team will pick up this item next after finishing the current work items.
  • Do Later, item gets queued into the backlog for consideration later.
  • Decline, item gets marked as declined, not a part of the backlog.

An immediate Decline is useful as it allows the user to either make a better case for the item or know that it is not going to change so the user needs to accept the current process and/or come up with a workaround.

The Do Later items need to be worked on by the user, business domain experts, analysts and managers to better understand the item, the value and associated costs so that the relative importance of the items in the backlog can be assessed. As part of this there has to be an agreed maximum age of items in the backlog to keep the size manageable. Items that get too old need to be Declined, as they are lower value than all the other items that could be worked on.

The top items in the backlog are effectively in the Do Next category, so if no new high priority items have been put into the Do Next category, then the team can work on the top priority from the Do Later backlog.

The size of the Do Next category has to be managed by the user, business domain experts, analysts and managers group, so that it only contains items that are truly higher value than the top items in the Do Later backlog. The size of the team that is working sets the limit on the number of items in the Do Next category. It has to be small so that the initial evaluation can be immediate, a good starting point is five or one less than the team size if that size is less than five.

More Dan North and CUPID

Finally Dan has got around to creating the Cupid.dev website

I believe that there are properties or characteristics of software that make it a joy to work with. The more your code has these qualities, the more joyful it is to work with; but everything is a tradeoff so you should always consider your context.

There are likely many of these properties, overlapping and interrelating, and there are many ways to describe them. I have chosen five that underpin much of what I care about in code. There is a diminishing return; five are enough to make a handy acronym, and few enough to remember.

Just for reference, CUPID stands for

  • Composable: plays well with others,
  • Unix philosophy: does one thing well,
  • Predictable: does what you expect,
  • Idiomatic: feels natural,
  • Domain-based: the code models the problem domain in language and structure.

Simon Wardley's Blah Template

An amusing take on the generic corporate strategy templates that companies seem to use, just replace Blah with an appropriate buzzword…

        Our strategy is [Blah]. We will lead
    a [Blah] effort of the market through our use
   of [Blah] and [Blah] to build a [Blah]. By being
   both [Blah] and [Blah], our [Blah] approach will
       drive [Bah] throughout the organisation.
     Synergies between our [Blah] and [Blah] will
          enable us to capture the upside by
       becoming [Blah] in a [Blah] world. These
      transformations combined with [Blah] due to
                our [Blah] will create
          a [Blah] through [Blah] and [Blah].

Resulting in a strategy like

Our strategy is customer focused. We will lead a disruptive effort of the market through our use of innovative social media and big data to build a collaborative cloud based ecosystem. By being both digital first and agile, our open approach will drive efficiency throughout the organisation…

From Simon Wardley’s “Why the fuss about Serverless?

Choosing what to work on at the start of a project

This was prompted by watching a presentation by Chris Matts, who pointed out that he often sees teams that deliver a Login screen as the first part of a project. A common pattern he observed was teams delivering the easy parts first and only then tackling the harder parts of the project. The unfortunate outcome of this was that the teams backloaded the project risks, which made the eventual delivery date hard to predict.

His suggestion was to undertake the high risk work first, so as the project moves on the easier stuff moves to the front of the queue and hence the end date becomes more predictable at time progresses. Unstated but part of this is that the team gets better at delivery as the project progresses, so the easier work can be done faster as less learning is needed.

Similar thoughts on the same problem

  • Alistair Cockburn’s Walking Skeleton as presented in Crystal Clear
  • The Executable Architecture idea from the early versions of the Unified Process
  • The Spike Solution idea from eXtreme Programming

Related to this is the idea from Agile of learning from the delivered system. The goal is to learn and respond to that learning as early as possible, so the key idea is to manage the risks by choosing to work on the high risk (business or technical) first.

This is different from the idea of doing the high value items first, mainly because the value of a feature is a guess by the business, but the risk associated with a feature can be assessed. The team knows whether they have built something like the feature before and know how stable the technology is. The business owners know if they have a really clear idea as to how the feature will work, and how the feature will be accepted in the marketplace. All features that score highly on uncertainty need to be scheduled early, so that if the idea does not work, the project can steer towards a different outcome.

Affordances in programming languages

Embedded programming is often done in languages like C because it makes it easy to do bit level operations. As a result, many low level protocols for talking to hardware have partial byte data as per https://www.rfc-editor.org/rfc/rfc791.txt, which draws out an internet header as shown below, with several fields not taking a full byte.

 0                   1                   2                   3   
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version|  IHL  |Type of Service|          Total Length         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Identification        |Flags|      Fragment Offset    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Time to Live |    Protocol   |         Header Checksum       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Source Address                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Destination Address                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options                    |    Padding    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The obvious thing to do in C is to define a struct using bit fields:

struct {
    unsigned int Version : 4;
    unsigned int IHL : 4;
    unsigned int Type_of_Service : 8;
    unsigned int Total_Length : 16;
} Header;

Other languages such as Python lack this construct, so often a programmer might resort to doing a masking and shifting bits to extract the values

>>> Version = first_byte & 0b00001111
>>> IHL = (first_byte & 0b11110000) >> 4

But in this case the affordance of the language leads us astray, yes it is easy to do the masking and shifting, but it is much better to use a library that makes the intention of the code simpler to understand

>>> import bitstring
>>> bits = bitstring.BitString(header_bytes)
>>> version, ihl, tos, t_len = bits.unpack("uint:4, uint:4, uint:8,uint:16")

Implications of Shift-Left for Quality Assurance

Most development teams that use Continuous Integration and Continuous Delivery pipelines are starting to adopt the Shift-Left approach of making sure that the development team build appropriate Unit, Component and Integration tests into the pipeline. The obvious implication is that the traditional Quality Assurance team – from a testing standpoint – only has to deal with System testing.

But in the spirit of Shift-Left, the Quality Assurance team should be looking to see how to validate things earlier as well.

  • With appropriate Unit, Component and Integration tests, the traditional refrain that the code is working as designed becomes more accurate, so the QA team can no longer write End-to-End tests and consider the the job is complete. QA instead has to poke at the corners between the design of the individual subsystems and instead look at the interactions between those subsystems.
  • The problem of working as designed becomes one of QA validating that the design is indeed correct, basically shifting the validation to before the code is written as a logical extension of the Shift-Left mindset.
  • Similarly, QA has to look at the Requirements and the way that they are elicited and documented to move the validation process further left.

A challenge with this is that the Agilista mindset is that it is only working code that matters. So having validation before the code is written could lead to the suggestion that the QA team is falling back into the old ways and encouraging analysis paralysis.

The resolution of this is to look at the overall workflow from a Kanban perspective and see how long it takes for an idea that comes up in Requirements Elicitation to make it into the Development Queue and hence into the hands of the Users. For most organizations, once the Requirement has hit the front of the backlog queue, the process to deliver is relatively short, typically less than 2 weeks for a simple User Story, but can be of the order of a month or more for an Epic containing multiple User Stories.

So the QA team does have a window of opportunity for validating the Requirements – the gap between Elicitation and Documentation and when the Stories get into the development queue. Validating the Design is harder, since most agile teams tend to do that on the fly…

Automating system tests for Defects

An interesting problem comes up when exploratory testing finds a defect in an application. If the defect cannot be fixed immediately, the fix is going to be assigned to the backlog and eventually scheduled to be worked on by the development team. The challenge now is what to put into the automated test. should the test fail or should it pass?

  • A failing test is the obvious choice, since the test should specify correct behavior, and when the developers get around to making the necessary changes, the test will just pass without any extra work required by the QA team.
  • Under PyTest there is the option to mark a test as XFail, which reports the count of tests that failed as expected (XFAIL) and the count of tests that were expected to fail but passed (XPASS)
  • A passing test that encodes the current behavior of the code as correct is another option for the test. The passing test is a Change Detector in that the test will fail when the defect is fixed.

Of the three options, a failing test will mean that every time the test suite runs, one or more tests will fail and there is some overhead of deciding which of the failures are expected and which are not expected. The xfail test has a similar problem, but without the benefit of a stack trace, so if the actual behavior changes, but is not fully fixed, then the test may still report as an XFAIL, so there is still some overhead of checking the test suite result.

A passing test is cheap to evaluate. If all tests pass, then nothing has changed and there is nothing to investigate. If however a test fails, then the normal process of investigating the error starts – and if the passing test had a good error explanation referring to the defect, then the investigation should be short. It is then a short process to amend the test case to reflect the new correct behavior and now the entire test suite should then pass. Some exploratory testing is still necessary to make sure that the fix did not introduce any other weird behavior, but the process of getting the test suite passing again should be trivial.

Automating System Tests

Rails and Phoenix introduced the idea of testing to web development, see Rails Guide and Hexdocs Testing Guide their main focus is on Unit Testing, Testing Views and Testing Controllers, while not saying much about System testing. They both make a nod towards Integration testing, but the focus is more on what developers should know about testing rather than a full system test.

With Unit Tests, the test case setup and teardown is relatively minimal, so often a unit test will have a single assertion – basically a unit test is testing just a single thing. This is appropriate for a Unit Test, since the execution time for the test suite is relatively insensitive to the the number of tests when the testing framework can execute a suite of 1,000 unit tests in a few seconds.

For Controller and View tests, both Rails and Phoenix have a nice way of mocking out the webserver interaction so that the behavior of the Controller that responds to the GET/POST/PUT/DELETE requests from the browser can be tested without needing the full stack. Both include the idea of a Test Database for use by the test suite that gets populated by fixtures that work in conjunction with the test cases to provide appropriate data for the tests. Typically the data is set up before each test case and then cleaned up after each test case so that the failure of one test case does not impact the other test cases. Typically these tests run slower than Unit tests, but a reasonable suite of several hundred tests can run in less than 10 seconds.

Both Unit and the Controller/View tests tend to be relatively simple with few assertions. System Testing is different. For a start the setup and teardown time for these tests can be significant, especially if you have a microservice architecture and the test case covers the interactions between multiple services. So system tests have to pay back the larger setup and teardown time by doing more work inside each system test case which means

  • the scenario for a system test case has to be more of a Soap Opera
  • there should be many more assertions about the steps along the scenario so that if there is an error, the source of the error can be found quickly
  • the scenarios should follow multiple alternate paths through the system