Detecting a hung windows process by Matt Wrock

A couple years ago, I was adding remoting capabilities to boxstarter so that one could setup any machine in their network and not just their local environment. As part of this effort, I ran all MSI installs via a scheduled task because some installers (mainly from Microsoft) would pull down bits via windows update and that always fails when run via a remote session. Sure the vast majority of MSI installers will not need a scheduled task but it was just easier to run all through a scheduled task instead of maintaining a whitelist.

I have run across a couple installers that were not friendly to being run silently in the background. They would throw up some dialog prompting the user for input and then that installer would hang forever until I forcibly killed it. I wanted a way to automatically tell when "nothing" was happening as opposed to a installer actually chugging away doing stuff.

Its all in the memory usage

This has to be more than a simple timeout based solution  because there are legitimate installs that can take hours. One that jumps to mind rhymes with shequel shurver. We have to be able to tell if the process is actually doing anything without the luxury of human eyes staring at a progress bar. The solution I came up with and am going to demonstrate here has performed very reliable for me and uses a couple memory counters to track when a process is truly idle for long periods of time.

Before I go into further detail, lets look at the code that polls the current memory usage of a process:

function Get-ChildProcessMemoryUsage {
    param(
        $ID=$PID,
        [int]$res=0
    )
    Get-WmiObject -Class Win32_Process -Filter "ParentProcessID=$ID" | % { 
        if($_.ProcessID -ne $null) {
            $proc = Get-Process -ID $_.ProcessID -ErrorAction SilentlyContinue
            $res += $proc.PrivateMemorySize + $proc.WorkingSet
            $res += (Get-ChildProcessMemoryUsage $_.ProcessID $res)
        }
    }
    $res
}

Private Memory and Working Set

You will note that the memory usage "snapshot" I take is the sum of PrivateMemorySize and WorkingSet. Such a sum on its own really makes no sense so lets see what that means. Private memory is the total number of bytes that the process is occupying in windows and WorkingSet is the number of bytes that are not paged to disk. So WorkingSet will always be a subset of PrivateMemory, why add them?

I don't really care at all about determining how much memory the process is consuming in a single point in time. What I do care about is how much memory is ACTIVE relative to a point in time just before or after each of these snapshots. The idea is that if these numbers are changing, things are happening. I tried JUST looking at private memory and JUST looking at working set and I got lots of false positive hangs - the counts would remain the same over fairly long periods of time but the install was progressing.

So after experimenting with several different memory counters (there are a few others), I found that if I looked at BOTH of these counts together, I could more reliably use them to detect when a process is "stuck."

Watching the entire tree

You will notice in the above code that the Get-ChildProcessMemoryUsage function recursively tallies up the memory usage counts for the process in question and all child processes and their children, etc. This is because the initial installer process tracked by my program often launches one or more subprocesses that do various bits of work. If I only watch the initial root process, I will get false hang detections again because that process may do nothing for long periods of time as it waits on its child processes.

Measuring the gaps between changes

So we have seen how to get the individual snapshots of memory being used by a tree of processes. As stated before, these are only useful in relationships between other snapshots. If the snapshots fluctuate frequently, then we believe things are happening and we should wait. However if we get a long string where nothing happens, we have reason to believe we are stuck. The longer this string, the more likely our stuckness is a true hang.

Here are memory counts where memory counts of processes do not change:

VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 0
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 1
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 2

These are 3 snapshots of the child processes of the 2012 Sql Installer on an Azure instance being installed by Boxstarter via Powershell remoting. The memory usage is the same for all processes so if this persists, we are likely in a hung state.

So what is long enough?

Good question!

Having played and monitored this for quite some time, I have come up with 120 seconds as my threshold - thats 2 minutes for those not paying attention. I think quite often that number can be smaller but I am willing to error conservatively here. Here is the code that looks for a run of inactivity:

function Test-TaskTimeout($waitProc, $idleTimeout) {
    if($memUsageStack -eq $null){
        $script:memUsageStack=New-Object -TypeName System.Collections.Stack
    }
    if($idleTimeout -gt 0){
        $lastMemUsageCount=Get-ChildProcessMemoryUsage $waitProc.ID
        $memUsageStack.Push($lastMemUsageCount)
        if($lastMemUsageCount -eq 0 -or (($memUsageStack.ToArray() | ? { $_ -ne $lastMemUsageCount }) -ne $null)){
            $memUsageStack.Clear()
        }
        if($memUsageStack.Count -gt $idleTimeout){
            KillTree $waitProc.ID
            throw "TASK:`r`n$command`r`n`r`nIs likely in a hung state."
        }
    }
    Start-Sleep -Second 1
}

This creates a stack of these memory snapshots. If the last snapshot is identical to the one just captured, we add the one just captured to the stack. We keep doing this until one of two thing happens:

  1. A snapshot is captured that varies from the last one recorded in the stack. At this point we clear the stack and continue.
  2. The number of snapshots in the stack exceeds our limit. Here we throw an error - we believe we are hung.

I hope you have found this information to be helpful and I hope that none of your processes become hung!

Creating windows base images using Packer and Boxstarter by Matt Wrock

I've written a couple posts on how to create vagrant boxes for windows and how to try and get the base image as small as possible. These posts explain the multiple steps of preparing the base box and then the process of converting that box to the appropriate format of your preferred hypervisor. This can take hours. I cringe whenever I have to refresh my vagrant boxes. There is considerable time involved in each step: downloading the initial ISO from microsoft, installing the windows os, installing updates, cleaning up the image, converting the image to any alternate hypervisor formats (I always make both VirtualBox and Hyper-V images), compacting the image to the vagrant .box file, testing and uploading the images.

Even if all goes smoothly, the entire process can take around 8 hours. There is a lot of babysitting done along the way. Often there are small hiccups that require starting over or backtracking.

This post shows a way that ends much of this agony and adds considerable predictability to a successful outcome. The process can still be lengthy, but there is no baby sitting. Type a command, go to sleep and you wake up with new windows images ready for consumption. This post is a culmination of my previous posts on this topic but on steroids...wait...I mean a raw foods diet - its 100% automated and repeatable.

High level tool chain overview

Here is a synopsis of what this post will walk through:

  • Downloading free evaluation ISOs (180 day) of Windows
  • Using a Packer template To load the ISO in a VirtualBox VM and customize using a Windows Autounattend.xml file.
  • Optimize the image with Boxstarter, installing all windows updates, and shrinking as much as possible.
  • Output a vagrant .box file for creating new VirtualBox VMs with this image.
  • Sharing the box with others using Atlas.Hashicorp.com

tl;dr

If you want to quickly jump into things and forgo the rest of this post or just read it later, just read the bit about instaling packer or go right to their download page and then clone my packer-templates github repo. This has the packer template, and all the other artifacts needed to build a windows 2012 R2 VirtualBox based Vagrant box for windows. You can study this template and its supporting files to discover what all is involved.

Want Hyper-V?

For the 7 other folks out there that use Hyper-V to consume their devops tool artifacts, I recently blogged how to create a Hyper-V vagrant .box file from a Virtualbox hard disk (.vdi or vmdk). This post will focus of creating the VirtualBox box file, but you can read my earlier post to easily turn it into a Hyper-V box.

Say hi to Packer

Hi Packer.

In short, Packer is a tool that assists in the automation of machine images. Many are familiar with Vagrant, made by the same outfit - Hashicorp - which has become an extremely popular platform for creating VMs. Making the spinning up of VMs easy to do and easy to share has been huge. But there has remained a somewhat uncharted frontier - automating the creation of the images that make up the VM.

So many of today's automation tooling focuses on bringing a machine from bare OS to a known desired state. However there is still the challenge of obtaining a bare OS and perhaps one that is not so "bare". Rather than spending so many cycles building the base OS image up to desired state on each VM creation, why not just do this once and bake it into the base image?...oh wait...we used to do that and it sucked.

We created our "golden" images. We bowed down before them and worshiped at the alter of the instant environment. And then we forgot how we built the environment and we wept.

Just like Vagrant makes building a VM a source controlled artifact, Packer does the same for VM templates and like Vagrant, its built on a plugin architecture allowing for the creation of just about all the common formats including containers.

Installing Packer

Installing packer is super simple. Its just a collection of .exe files on windows and regardless of platform, its just an archive that needs to be extracted and then you simply add the extraction target to your path. If you are on Windows, do yourself a favor and install using Chocolatey.

choco install packer -y

Thats it. There should be nothing else needed. Especially since release 0.8.1 (the current release at the time of this post), Packer has everything needed to build windows images and no need for SSH.

Packer Templates

At the core of creating images using Packer is the packer template file. This is a single json file that is usually rather small. It orchestrates the entire image creation process which has three primary components:

  • Builders - a set of pluggable components that create the initial base image. Often this is just the bare installation media booted to a machine.
  • Provisioners - these plugins can attach to the built, minimal image above and bring it forward to a desired state.
  • Post-Processors - Components  that take the provisioned image and usually brings it to its final usable artifact. We will be using the vagrant post-processor here which converts the image to a vagrant box file.

Here is the template I am using to build my windows vagrant box:

{
    "builders": [{
    "type": "virtualbox-iso",
    "vboxmanage": [
      [ "modifyvm", "{{.Name}}", "--natpf1", "winrm,tcp,,55985,,5985" ],
      [ "modifyvm", "{{.Name}}", "--memory", "2048" ],
      [ "modifyvm", "{{.Name}}", "--cpus", "2" ]
    ],
    "guest_os_type": "Windows2012_64",
    "iso_url": "iso/9600.17050.WINBLUE_REFRESH.140317-1640_X64FRE_SERVER_EVAL_EN-US-IR3_SSS_X64FREE_EN-US_DV9.ISO",
    "iso_checksum": "5b5e08c490ad16b59b1d9fab0def883a",
    "iso_checksum_type": "md5",
    "communicator": "winrm",
    "winrm_username": "vagrant",
    "winrm_password": "vagrant",
    "winrm_port": "55985",
    "winrm_timeout": "5h",
    "guest_additions_mode": "disable",
    "shutdown_command": "C:/windows/system32/sysprep/sysprep.exe /generalize /oobe /unattend:C:/Windows/Panther/Unattend/unattend.xml /quiet /shutdown",
    "shutdown_timeout": "15m",
    "floppy_files": [
      "answer_files/2012_r2/Autounattend.xml",
      "scripts/postunattend.xml",
      "scripts/boxstarter.ps1",
      "scripts/package.ps1"
    ]
  }],
    "post-processors": [
    {
      "type": "vagrant",
      "keep_input_artifact": true,
      "output": "windows2012r2min-{{.Provider}}.box",
      "vagrantfile_template": "vagrantfile-windows.template"
    }
  ]
}

As stated in the tldr above, you can view the entire repository here.

Lets walk through this.

First you will notice the template includes a builder and a post-processor as mentioned above. It does not use a provisioner. Those are optional and we'll be doing most of our "provisioning" in the builder. I explain more about why later. Note that one can include multiple builders, provisioners, and post-processors. This one is pretty simple but you can find lots of more complex examples online.

  • type - This uses the virtualbox-iso builder which takes an ISO install file and produces VirtualBox .ovf and .vmdk files. (see packer documentation for information on the other built in builders).
  • vboxmanage - config sent directly to Virtualbox:
    • natpf1 - very helpful if building on a windows host where the winrm ports are already active. Allows you to define port forwarding to winrm port.
    • memory and cpus - these settings will speed things up a bit.
  • iso_url: The url of the iso file to load. This is the windows install media. I downloaded an eval version of server 2012 R2 (discussed below). This can be either an http url that points to the iso online or an absolute file path or file path relative to the current directory. I keep the iso file in an iso directory but because it is so large I have added that to my .gitignore.
  • iso_checksum and iso_checksum_type: These serve to validate the iso file contents. More on how to get these values later.
  • winrm values: as of version 0.8.0, packer comes with a winrm communicator. This means no need to install an SSH server. I use the vagrant username and password because this will be a vagrant box and those are the default credentials used by vagrant. Note that above in the vboxmanage settings I forward port 5985 to 55985 on the guest. 5985 is the default http winrm port so I need to specify 55985 as the winrm port I am using. The reason I am using a non default port is because the host I am using has winrm enabled and listens on 5985. If you are not on a windows host, you can probably just use the default port but that would conflict on my host. I specify a 5 hour timeout for winrm. This is the amount of time that packer will wait at most for winrm to be available. This is very important and I will discuss why later.
  • guest_additions_mode - By default, the virtualbox-iso builder will upload the latest virtualbox guest additions to the box. For my purposes I do not need this and it just adds extra time, takes more space and I have also had intermittent errors while the file is uploaded which hoses the entire build.
  • shutdown_command: This is the command to use to shutdown the machine. Different operating systems may require different commands. I am invoking sysprep which will shutdown the machine when it ends. Syprep when called with /generalize like I am doing here will strip the machine of security identifiers, machine name and other elements that that make it unique. This is particularly useful if you plan to use the image in an environment where many machines may be provisioned from this template and they need to interact with one another. Without doing this, all machines would have the same name and unique user SIDs which could cause problems especially in domain scenarios.
  • floppy_files: an array of files to be added to a floppy drive and made accessible to the machine. Here these include an answer file, and other files to be used throughout the image preparation.

Obtaining evaluation copies of windows

Many do not know that you can obtain free and fully functioning versions of all the latest versions of windows from Microsoft. These are "evaluation" copies and only last 180 days. However, if you only plan to use the images for testing purposes like I do, that should be just fine.

You can get these from  https://msdn.microsoft.com/en-us/evalcenter.aspx You will just need to regenerate the image at least every 180 days. You are not bound to purchase after 180 days but can easily just download a new ISO.

Finding the iso checksums

As shown above, you will need to provide the checksum and checksum type for the iso file you are using. You can do this using a utility called fciv.exe. You can install it from chocolatey:

choco install fciv -y

Now you can call fciv and pass it the path to any file and it will produce the checksum of that file in md5 format by default but you can pass a different format if desired.

Significance of winrm availability and the winrm_timeout

The basic flow of the virtualbox packer build is as follows:

  1. Boot the machine with the ISO
  2. Wait for winrm to be accessible
  3. As soon as it is accessible via winrm, shutdown the box
  4. start the provisioners
  5. when the provisioners are complete, run the post processors

In a perfect world I would have the windows install process enable winrm early on in my setup. This would result in a machine restart and then invoke provisioners that would perform all of my image bootstrap scripts. However, there are problems with that sequence on windows and the primary culprit is windows updates. I want to have all critical updates installed ASAP. However, windows updates cannot be easily run via a remote session which the provisioners do and once they complete, you will want to reboot the box which can cause the provisioners to fail.

Instead, I install all updates during the initial boot process. This allows me to freely reboot the machine as many times as I need since packer is just waiting for winrm and will not touch the box until that is available so I make sure not to enable winrm until the very end of my bootstrap scripts.

Also these scripts run locally so windows update installs can be performed without issue.

Bootstrapping the image with an answer file

Since we are feeding virtualbox an install iso, if we did nothing else, it would prompt the user for all of the typical windows setup options like locale, admin password, disk partition, etc. Obviously this is all meant to be scripted and unattended and that would just hang the install. This is what answer files are for. Windows uses answer files to automate the the setup process. There are all sorts of options one can provide to customize this process.

My answer file is located in my repo here. Note that it s named AutoUnattend.xml and added to the floppy drive of the booted machine. Windows will load any file named AutoUnattend.xml in the floppy drive and use that file as the answer file. I am not going to go through every line here but do know there are additional options beyond what I have that one can specify. I will cover some of the more important parts.

<UserAccounts>
  <AdministratorPassword>
    <Value>vagrant</Value>
    <PlainText>true</PlainText>
  </AdministratorPassword>
  <LocalAccounts>
      <LocalAccount wcm:action="add">
        <Password>
          <Value>vagrant</Value>
          <PlainText>true</PlainText>
        </Password>
        <Group>administrators</Group>
        <DisplayName>Vagrant</DisplayName>
        <Name>vagrant</Name>
        <Description>Vagrant User</Description>
      </LocalAccount>
    </LocalAccounts>
</UserAccounts>
<AutoLogon>
  <Password>
    <Value>vagrant</Value>
    <PlainText>true</PlainText>
  </Password>
  <Enabled>true</Enabled>
  <Username>vagrant</Username>
</AutoLogon>

This creates the administrator user vagrant with the password vagrant. This is the default vgrant username and password so setting up this admin user will make vagrant box setups easier. This also allows the initial boot to auto logon as this user instead of prompting.

<FirstLogonCommands>
  <SynchronousCommand wcm:action="add">
     <CommandLine>cmd.exe /c C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -File a:\boxstarter.ps1</CommandLine>
     <Order>1</Order>
  </SynchronousCommand>
</FirstLogonCommands>

You can specify multiple SynchronousCommand elements containing commands that should be run when the user very first logs in. I find it easier to read this already difficult to read file by just specifying one powershell file to run and then I'll have that file orchestrate the entire bootstrapping.

This file boxstarter.ps1 is another file in my scripts directory of the repo that I add to the virtualbox floppy. We will look closely at that file in just a bit.

<settings pass="specialize">
  <component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-ServerManager-SvrMgrNc" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS">
    <DoNotOpenServerManagerAtLogon>true</DoNotOpenServerManagerAtLogon>
  </component>
  <component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-IE-ESC" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS">
    <IEHardenAdmin>false</IEHardenAdmin>
    <IEHardenUser>false</IEHardenUser>
  </component>
  <component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-OutOfBoxExperience" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS">
  <doNotOpenInitialConfigurationTasksAtLogon>true</DoNotOpenInitialConfigurationTasksAtLogon>
  </component>
</settings>

In short, this customizes the image in such a way that will make it far less likely to cause you to kill yourself or others while actually using the image. So you are totally gonna want to include this.

This prevents the "server manager" from opening on startup and allows IE to actually open web pages without the need to ceremoniously click thousands of times.

A boxstarter bootstrapper

So given the above answer file, windows will install and reboot and then the vagrant user will auto logon. Then a powershell session will invoke boxstarter.ps1. Here it is:

$WinlogonPath = "HKLM:\Software\Microsoft\Windows NT\CurrentVersion\Winlogon"
Remove-ItemProperty -Path $WinlogonPath -Name AutoAdminLogon
Remove-ItemProperty -Path $WinlogonPath -Name DefaultUserName

iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/mwrock/boxstarter/master/BuildScripts/bootstrapper.ps1'))
Get-Boxstarter -Force

$secpasswd = ConvertTo-SecureString "vagrant" -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential ("vagrant", $secpasswd)

Import-Module $env:appdata\boxstarter\boxstarter.chocolatey\boxstarter.chocolatey.psd1
Install-BoxstarterPackage -PackageName a:\package.ps1 -Credential $cred

This downloads the latest version of boxstarter and installs the boxstarter powershell modules and finally installs package.ps1 via a boxstarter install run. You can visit boxstarter.org for more information regarding all the details of what boxstarter does. The key features is that it can run a powershell script, and handle machine reboots by making sure the user is automatically logged back in and that the script (package.ps1 here) is restarted.

Boxstarter also exposes many commands that can tweak the windows UI, enable/disable certain windows options and also install windows updates.

Note the winlogon registry edit at the beginning of the boxstarter bootstraper. Without this, boxstarter will not turn off the autologin when it completes. This is only necessary if running boxstarter from a autologined session like this one. Boxstarter takes note of the current autologin settings before it begins and restores those once it finishes. So in this unique case it would restore to a autologin state.

Here is package.ps1, the meat of our bootstrapping:

Enable-RemoteDesktop
Set-NetFirewallRule -Name RemoteDesktop-UserMode-In-TCP -Enabled True

Write-BoxstarterMessage "Removing page file"
$pageFileMemoryKey = "HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management"
Set-ItemProperty -Path $pageFileMemoryKey -Name PagingFiles -Value ""

Update-ExecutionPolicy -Policy Unrestricted

Write-BoxstarterMessage "Removing unused features..."
Remove-WindowsFeature -Name 'Powershell-ISE'
Get-WindowsFeature | 
? { $_.InstallState -eq 'Available' } | 
Uninstall-WindowsFeature -Remove

Install-WindowsUpdate -AcceptEula
if(Test-PendingReboot){ Invoke-Reboot }

Write-BoxstarterMessage "Cleaning SxS..."
Dism.exe /online /Cleanup-Image /StartComponentCleanup /ResetBase

@(
    "$env:localappdata\Nuget",
    "$env:localappdata\temp\*",
    "$env:windir\logs",
    "$env:windir\panther",
    "$env:windir\temp\*",
    "$env:windir\winsxs\manifestcache"
) | % {
        if(Test-Path $_) {
            Write-BoxstarterMessage "Removing $_"
            Takeown /d Y /R /f $_
            Icacls $_ /GRANT:r administrators:F /T /c /q  2>&1 | Out-Null
            Remove-Item $_ -Recurse -Force -ErrorAction SilentlyContinue | Out-Null
        }
    }

Write-BoxstarterMessage "defragging..."
Optimize-Volume -DriveLetter C

Write-BoxstarterMessage "0ing out empty space..."
wget http://download.sysinternals.com/files/SDelete.zip -OutFile sdelete.zip
[System.Reflection.Assembly]::LoadWithPartialName("System.IO.Compression.FileSystem")
[System.IO.Compression.ZipFile]::ExtractToDirectory("sdelete.zip", ".") 
./sdelete.exe /accepteula -z c:

mkdir C:\Windows\Panther\Unattend
copy-item a:\postunattend.xml C:\Windows\Panther\Unattend\unattend.xml

Write-BoxstarterMessage "Recreate [agefile after sysprep"
$System = GWMI Win32_ComputerSystem -EnableAllPrivileges
$System.AutomaticManagedPagefile = $true
$System.Put()

Write-BoxstarterMessage "Setting up winrm"
Set-NetFirewallRule -Name WINRM-HTTP-In-TCP-PUBLIC -RemoteAddress Any
Enable-WSManCredSSP -Force -Role Server

Enable-PSRemoting -Force -SkipNetworkProfileCheck
winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'

The goals of this script is to fully patch the image and then to get it as small as possible. Here is the breakdown:

  • Enable Remote Desktop. We do this with a bit of shame but so be it.
  • Remove the page file. This frees up about a GB of space. At the tail end of the script we turn it back on which means the page file will restore itself the first time this image is run after this entire build.
  • Update the powershell execution policy to unrestricted because who likes restrictions? You can set this to what you are comfortable with in your environment but if you do nothing, powershell can be painful.
  • Remove all windows features that are not enabled. This is a new feature in 2012 R2 called feature on demand and can save considerable space.
  • Install all critical windows updates. There are about 118 at the time of this writing.
  • Restart the machine if reboots are pending and the first time this runs, they will be.
  • Run the DISM cleanup that cleans the WinSxS folder of rollback files for all the installed updates. Again this is new in 2012 R2 and can save quite a bit of space. Warning: it also takes a long time but not nearly as long as the updates themselves.
  • Remove some of the random cruft. This is not a lot but why not get rid of what we can?
  • Defragment the hard drive and 0 out empty space. This will allow the final act of compression to do a much better job compressing the disk.
  • Lastly, and it is important this is last, enable winrm. Remember that once winrm is accessible, packer will run the shutdown command and in our case here that is the sysprep command:
C:/windows/system32/sysprep/sysprep.exe /generalize /oobe /unattend:C:/Windows/Panther/Unattend/unattend.xml /quiet /shutdown

This will cause a second answer file to fire after the machine boots next. That will not happen until after this image is built, likely just after a "vagrant up" of the image. That file can be found here and its much smaller than the initial answer file that drove the windows install. This second unattend file mainly ensures that the user will not have to reset the admin password at initial startup.

Packaging the vagrant file

So now we have an image and the builders job is done. On my machine this all takes just under 5 hours. Your mileage may vary but make sure that your winrm_timeout is set appropriately. Otherwise if the timeout is less that the length of the build, the entire build will fail and be forever lost.

The vagrant post-processor is what generates the vagrant box. It takes the artifacts of the builder and packges them up. There are other post-processors available, you can string several together and you can create your own custom post-processors. You can make your own provisioners and builders too.

One post processor worth looking at is the atlas post-processor which you can add after the vagrant post processor and it will upload your vagrant box to atlas and now you can share it with others.

Next steps

I have just gotten to the point where this is all working. So this is rough around the edges but it does work. There are several improvements to be made here:

  • Make a boxstarter provisioner for packer. This could run after the initial os install in a provisioner and AFTER winrm is enabled. I would have to leverage boxstarter's remoting functionality so that it does not fail when interacting with windows updates over a winrm session. One key advantage is that the boxstarter output would bubble up to the console running packer giving much more visibility to what is happening with the build as it is running. As it stands now, all output can only be seen in the virtualbox GUI which will not appear in headless environments like in an Atlas build.
  • As stated at the beginning of this post, I create both virtualbox and Hyper-V images. The Hyper-V conversion could use its own post-processor. Now I simply run this in a separate powershell command.
  • Use variables to make the scripts reusable. I'm just generating a single box now but I will definitely be generating more: client SKUs, server core, nano, etc. Packer allows you to specify variables and better templatize the build.

Thanks Matt Fellows and Dylan Meissner

I started looking into all of this a couple months ago and then got sidetracked until last week. In my initial investigation in may, packer 0.8.1 was not released and there was no winrm support out of the box. Matt and Dylan both have contributed to those efforts and also provided really helpful pointers as I was having issues getting things working right.

Creating a Hyper-V Vagrant box from a VirtualBox vmdk or vdi image by Matt Wrock

I personally use Hyper-V as my hypervisor on windows, but I use VirtualBox on my Ubuntu work laptop so it is convenient for me to create boxes in both formats . However it is a major headache to do so especially when preparing Windows images. I either have to run through creating the image twice (once for VirtualBox and again for Hyper-V) or I have to copy multi gigabyte files across my computers and then convert them.

This post will demonstrate how to automate the conversion of a VirtualBox disk image to a Hyper-V compatible VHD and create a Vagrant .box file that can be used to fire up a functional Hyper-V VM. This can be entirely done on a machine without Hyper-V installed or enabled. I will be demonstrating on a Windows box using Powershell scripts but the same can be done on a linux box using bash.

The environment

You cannot have VirtualBox and Hyper-V comfortably coexist on the same machine. I have Hyper-V enabled on my personal windows laptop and I needed some "bare metal" to install VirtualBox. Well as luck had it, I was able to salvage my daughter's busted old laptop. Its video is hosed which is just fine for my purposes. I'll be running it as a headless VM host. Here is how I set it up:

  • Repaved with Windows 8.1 with Update 1 Professional, fully patched
  • Enabled RemoteDesktop
  • Installed VirtualBox, Vagrant, and Packer with Chocolatey

Of course I used Boxstarter to set it all up. After all I was not born in a barn.

Creating the VirtualBox image

I am currently working on another post (hopefully due to publish this week) that will go into detail on creating Windows images with Packer and Boxstarter and cover many gory details around, unattend.xml files, sysprep, and tips to get the image as small as possible. This post will not cover the creation of the image. However, if you are interested in this topic, I'm assuming you are able to create a VirtualBox VM. That is all you need to do to get started here. Guest OS does not matter. You just need to get a .VDI or .VMDK file generated which is what automatically happens when you create a VirtualBox VM.

Converting the virtual hard disk to VHD

This is a simple one liner using VBoxManage.exe which installs with VirtualBox. You just need the path to the VirtualBox VM's hard disk.

$vboxDisk = Resolve-Path "$baseDir\output-virtualbox-iso\*.vmdk"
$hyperVDir = "$baseDir\hyper-v-output\Virtual Hard Disks"
$hyperVDisk = Join-Path $hyperVDir 'disk.vhd'
$vbox = "$env:programfiles\oracle\VirtualBox\VBoxManage.exe"
.$vbox clonehd $vboxDisk $hyperVDisk --format vhd

This uses the clonehd command and takes the location of the VirtualBox disk and the path of the vhd to be created. The vhd --format is also supplied. The conversion takes a few minutes to complete on my system.

Note that this may likely produce a vhd file that is much larger than the vmdk or vdi image from which the conversion took place. Thats OK. My 3.8GB vmdk produced a 9GB vhd. This is simply because the VHD is uncompressed and we'll take care of that in the last step. It is highly advisable that you "0 out" all unused disk space as the last step of the image creation to assist the compression. On windows images, the sysinterals tool sdelete does a good job at that.

Laying out the Hyper-V and Vagrant metadata

So you now have a vhd file that can be sucked into any Hyper-V VM. However, just the vhd alone is not enough for Vagrant to produce a Hyper-V VM. There are 2 key bits that we need to create: A Hyper-V xml file that defines the VM metadata and vagrant metadata json as well as an optional Vagrantfile. These all have to be archived in a specific folder structure.

The Hyper-V XML

Prior to Windows 10, Hyper-V stored virtual machine metadata in an XML file. Vagrant expects this XML file and inspects it for several bits of metadata that it uses to create a new VM when you "vagrant up". Its still completely compatible with Windows 10 since Vagrant will simply use the Hyper-V Powershell cmdlets to create a new VM. However, it does mean that exporting a windows 10 Hyper-V vm will no longer produce this xml file. Instead it produces two binary files in vmcx and vmrx formats that are no longer readable or editable. However, if you so happen to have access to a xml vm file exported from v8.1/2012R2 or earlier, you can still use that.

Here I am using just that. A xml file exported from a 2012R2 VM. Its fairly large so I will not display it in its entirety here but you can view it on github here. You can edit things like vm name, available ram, switches, etc. This one should be compatible out of the box on any Hyper-V host. The most important thing is to make sure that the file name of the hard disk vhd referenced in this metadata matches the filename of the vhd you produced above. Here is the disk metadata:

<controller0>
  <drive0>
   <iops_limit type="integer">0</iops_limit>
    <iops_reservation type="integer">0</iops_reservation>
    <pathname type="string">C:\dev\vagrant\Win2k12R2\.vagrant\machines\default\hyperv\disk.vhd</pathname>
    <persistent_reservations_supported type="bool">False</persistent_reservations_supported>
    <type type="string">VHD</type>
    <weight type="integer">100</weight>
  </drive0>
  <drive1>
    <pathname type="string"></pathname>
    <type type="string">NONE</type>
  </drive1>
</controller0>

Another thing to take note of here is the "subtype" of the VM. This is known to most as the "generation." Later Hyper-V versions support both generation and one and two VMs. If you are from the future, there may be more. Typically vagrant boxes will do fine on generation 1 and will also be more portable and accessible to older hosts so I make sure this is specified.

<properties>
  <creation_time type="bytes">j3iz2977zwE=
</creation_time>
  <global_id type="string">0AA394FA-7C4A-4070-BA32-773D43B28A68</global_id>
  <highly_available type="bool">False</highly_available>
  <last_powered_off_time type="integer">130599959406707888</last_powered_off_time>
  <last_powered_on_time type="integer">130599954966442599</last_powered_on_time>
  <last_state_change_time type="integer">130599962187797355</last_state_change_time>
  <name type="string">2012R2</name>
  <notes type="string"></notes>
  <subtype type="integer">0</subtype>
  <type_id type="string">Virtual Machines</type_id>
  <version type="integer">1280</version>
</properties>

Again, its the subtype that is of relevance here. 0 is generation 1. Also, here is where you would change the name of the vm if desired.

If you are curious about exactly how Vagrant parses this file to produce a VM. Here is the powershell that does that. I dont think there is any reason why this same xml file would not work for other OSes but you would probably want to change the vm name. In a bit, I'll show you where this file goes.

Vagrant metadata.json

This is super simple but 100% required. The packaged vagrant box file needs a metadata.json file in its root. Here are the exact contents of the file:

{
  "provider": "hyperv"
}

Optional Vagrantfile

This is not required but especially for windows boxes it may be helpful to include an embedded Vagrantfile. If you are familiar with vagrant, you know that the Vagrantfile contains all of the config data for your vagrant box. Well if you package a Vagrantfile with the base box as I will show here, the config in this file will be inherited by any Vagrantfile that consumes this box. Here is what I typically add to windows images:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure(2) do |config|
  config.vm.guest = :windows
  config.vm.communicator = "winrm"

  config.vm.provider "virtualbox" do |vb|
    vb.gui = true
    vb.memory = "1024"
  end

  config.vm.provider 'hyperv' do |hv|
    hv.ip_address_timeout = 240
  end
end

First, this is pure ruby. That likely does not matter at all and those of you unfamiliar with ruby hopefully groc what this file is specifying, but if you want to go crazy with ifs, whiles, and dos, go right ahead.

I find this file provides a better windows experience accross the board by specifying the following:

  • As long as winrm is enabled on the base image vagrant can talk to it on the right port.
  • VirtualBox will run the vm in a GUI console
  • Hyper-V typically takes more than 2 minutes to become accesible and this prevents timeouts.

Laying out the files

Here is how all of these files should be structured in the end:

Creating the .box file

In the end vagrant consumes a tar.gz file with a .box extension. We will use 7zip to create this file.

."$env:chocolateyInstall\tools\7za.exe" a -ttar package-hyper-v.tar hyper-v-output\*
."$env:chocolateyInstall\tools\7za.exe" a -tgzip package-hyper-v.box package-hyper-v.tar

If you have chocolatey installed, Rejoice! you already have 7zip. If you do not have chocolatey, you should feel bad.

This simply creates the tar archive and then compacts it. It takes about 20 minutes on my machine.

Testing the box

Now copy the final box file to a machine that runs Hyper-V and has vagrant installed.

C:\dev\vagrant\Win2k12R2\exp\Win-2012R2> vagrant box add test-box .\package-hyper-v.box
==> box: Box file was not detected as metadata. Adding it directly...
==> box: Adding box 'test-box' (v0) for provider:
    box: Unpacking necessary files from: file://C:/dev/vagrant/Win2k12R2/exp/Win-2012R2/package-hyper-v.box
    box: Progress: 100% (Rate: 98.8M/s, Estimated time remaining: --:--:--)
==> box: Successfully added box 'test-box' (v0) for 'hyperv'!
C:\dev\vagrant\Win2k12R2\exp\Win-2012R2> vagrant init test-box
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.
C:\dev\vagrant\Win2k12R2\exp\Win-2012R2> vagrant up
Bringing machine 'default' up with 'hyperv' provider...
==> default: Verifying Hyper-V is enabled...
==> default: Importing a Hyper-V instance
    default: Cloning virtual hard drive...
    default: Creating and registering the VM...
    default: Successfully imported a VM with name: 2012R2Min
==> default: Starting the machine...
==> default: Waiting for the machine to report its IP address...
    default: Timeout: 240 seconds
    default: IP: 192.168.1.11
==> default: Waiting for machine to boot. This may take a few minutes...
    default: WinRM address: 192.168.1.11:5985
    default: WinRM username: vagrant
    default: WinRM transport: plaintext
==> default: Machine booted and ready!
==> default: Preparing SMB shared folders...
    default: You will be asked for the username and password to use for the SMB
    default: folders shortly. Please use the proper username/password of your
    default: Windows account.
    default:
    default: Username: matt
    default: Password (will be hidden):
==> default: Mounting SMB shared folders...
    default: C:/dev/vagrant/Win2k12R2/exp/Win-2012R2 => /vagrant

Enjoy your boxes!

Boxstarter 2.5 Realeased by Matt Wrock

This week I released Boxstarter 2.5.10. Although there are no new features introduced in this release or really the last several releases, it seemed appropriate to bump the minor version after so many bug fixes, stabilizations and small experience improvement tweaks in v2.4. The focus of this release has been to provide improved isolation between the version of Chocolatey used by boxstarter from the version previously (or subsequently) installed by the user.

In this post I'd like to provide further details on why boxstarter uses its own version of Chocolatey and also share some thoughts on where I see boxstarter heading in the future.

Why is boxstarter running its own version of Chocolatey?

When Chocolatey v0.9.9 released, it introduced a complete rewrite of chocolatey, transforming itself from a set of powershell scripts and functions to a C# based executable. This fundamentally changes the way that boxstarter is able to interact with chocolatey and hook into the package installation sequence. Boxstarter essentially "Monkey Patches" chocolatey in order to intercept certain package installation commands. It needs to do this for a few reasons - the most important is to be able to detect pending reboots prior to any install operation and reboot if necessary.

Boxstarter now installs the last powershell incarnation of chocolatey (v0.9.8.33) in its own install location ($env:appdata\Boxstarter\Chocolatey). However, it keeps the chocolatey lib and bin directories in the standard chocolatey locations inside $env:ProgramData\Chocolatey so that all applications installed with either version of chocolatey are kept in the same repository. Boxstarter continues to intercept choco commands and reroute them to its own chocolatey.ps1. Its important to note this only happens when installing chocolatey packages via boxstarter.

Boxstarter cannot achieve the same behavior with the latest version of chocolatey without some significant changes to boxstarter and likely some additional changes contributed to Chocolatey. I did some spiking on this while on vacation a few months ago but have not found the time to dig in further since. However it is the highest priority item lined up in boxstarter feature work standing just behind critical bug fixes.

Windows protected your PC?

If you use boxstarter using the "click-once" based web installer. You may see a banner in windows 8/2012 and up when installing boxstarter. This is because I have had to obtain a new code signing certificate (an annual event) and it has not yet gained enough time/downloads to establish complete trust from windows. I do not know what the actual algorithm for gaining this trust is but last year these warnings went away after just a few days. In the meantime just click the "more info" link and then choose "run anyways."

Why fewer features?

Over the last year I have taken a bit of a break from boxstarter development in comparison to 2013 and early 2014. This is partly because my development has been much less windows centric and I've been exploring the rich set of automation tool chains that have deeper roots in linux but have been gaining a foothold in windows as well. These include tools such as Vagrant, Packer, Chef, Docker and Ansible. This exploration has been fascinating. (Side note: another reason for less boxstarter work is my commitment to this blog. These posts consume alot of time but I find it worth it).

I strongly believe that in order to provide sound automation tools for windows and truly understand the landscape, one must explore how these tools have been built and consumed in the linux world.  Many of these tools have some overlap with boxstarter and I have no desire to reinvent the wheel. I also think I now have a more mature vision of how to focus future boxstarter contributions.

Whats ahead for boxstarter?

First and foremost - compatibility with the latest chocolatey. This will mean some possibly gut wrenching (the good kind) changes to boxstarter architecture.

Integrating with other ecosystems

I'd like to focus on taking the things that make boxstarter uniquely valuable and making them accessible to other tools. This also means cutting some boxstarter features where other tools do a better job delivering the same functionality.

For instance, I'd like to make boxstarter provisioners for Vagrant and Packer. This means people can more easily setup windows boxes and images with boxstarter while leveraging Vagrant and Packer to interact with the Hypervisor and image file manipulation. With that, I could deprecate the Azure and Hyper-V boxstarter modules because Vagrant is already doing that work.

I'd also like to see better interoperability between boxstarter and other configuration management tools like chef, puppet and DSC.

Decoupling Features

Boxstarter has some features which I find very useful like its streaming scheduled task output, hang detection, and ability to wrap a script with reboot resiliency. There have been many times that I would have liked to have been able to reuse these features in different contexts, but they are currently hard wired and not intended to be easily consumable on their own. I'd like to break this functionality out which would not only make these features more accessible outside of boxstarter but also improve the boxstarter architecture.

Hoping for a more active year to come

I'm unfortunately still a bit time starved but I am really hoping to knock out the chocolatey compatibility work and get started on some new boxstarter features and integrations over the next year.

Why TDD for PowerShell? Or why pester? Or why unit test a "scripting" language? by Matt Wrock

I was asked a couple weeks ago by Adam Bertram  (@abertram) on twitter for any info on why one would want to use TDD with Pester. I have written a couple posts on HOW to use pester and I'm sure I mentioned TDD but I really don't recall ever seeing any posts on WHY one would use TDD. I think that's a fascinating question. I have not been writing much powershell at all these days but these questions are just as applicable to infrastructure code written in ruby I have been writing. I have alot of thoughts on this subject but I'd like to expand the question to an even broader scope. Why use pester (or any unit testing framework) at all? Really? Unit tests for a "scripting" language?

We are living in very interesting times. As "infrastructure as code" is growing in popularity we have devs writing more systems code and admins/ops writing more and more sophisticated scripts. In some windows circles, we see more administrators learning and writing code who have never scripted before. So you have talented devs that don't know layer 1 from layer 2 networking and think CIDR is just something you drink and experienced admins who consider share point a source control repository and have never considered writing tests for their automation.

I'm part of the "dev" group and have no right to judge here. I believe god placed cidr calculators on the internet (thanks god!) for calculating IP ranges and wikipedia for a place to lookup the OSI model. However, I'm fairly competent in writing tests and believe the discovery of TDD was a turning point in my becoming a better developer.

So this post is a collection of observations and thoughts on testing "scripts". I'm intentionally surrounding scripts in quotes because I'm finding that one person's script quickly become a full blown application. I'll also touch on TDD which I am passionate about but less dogmatic on the subject than I once was.

Tests? I ran the code and it worked. There's your test!

Don't dump your friction laden values on my devops rainbow. By the time your tests turn green, I've shipped already and am in the black. I've encountered these sentiments both in infrastructure and more traditional coding circles. Sometimes it is a conscous inability to see value in adding tests but many times the developers just are not considering tests or have never written test code. One may argue: Why quibble over these implementation details? We are taking a huge, slow manual process that took an army of operators hours to accomplish and boiling it down to an automated script that does the same in a few minutes.

Once the script works, why would it break? In the case of provisioning infrastructure, many may feel if the VM comes up and runs its bootstrap installs without errors, extra validations are a luxury.

Until the script works, testing it is a pain

So we all choose our friction. The first time we run through our code changes, we think we'll manually run it, validate it and then move on. Sounds reasonable until the manual validations prove the code is broken again and again and again. We catch ourselves rerunning cleanup, rerunning the setup, then rerunning our code and then checking the same conditions. This gets old fast and gets even worse when you have to revisit it a few days or weeks later. Its great to have a harness that will setup, run the exact same tests and then clean up - all by invoking a single command.

No tests? Welcome to Fear Driven Development

Look, testing is really hard. At least I think so. I usually spend way more time getting tests right and factored than whipping out the actual implementation code. However, whenever I am making changes to a codebase, I am so relieved when there are tests. Its my safety net. If the tests were constructed skillfully, I should be able to rip things apart and know that things are not deployable from all the failing tests. I may need to add, change or remove some tests to account for my work but overall, as those failing tests go green, its like breadcrumbs leading me back home to safety.

But maybe there are no tests. Now I'm scared and I should be and if you are on my team then you should be too. So I have a "sophie's" choice: write tests now or practice yet another time honored discipline - prayer driven development - sneaking in just this one change and hoping some manual testing gets me through it.

I'm not going to say that the former is always the right answer. Writing tests for existing code can be incredibly difficult and can make a 5 minute bug fix turn into a multi day yak hair detangling session even when you focus on just adding tests for the code you are changing. Sometimes it is the right thing to invest this extra time. It really depends on context, but I assure you the more one takes the latter road, the more dangerous the code becomes to change. The last thing you want in your codebase is to be afraid to change it unless it all works perfectly and its requirements are immutable.

Your POC will ship faster with no tests

Oh shoot, we shipped the POC. (You are likely saying something other than "shoot").

This may not always be the case, but I am pretty confident that a MVP (minimal viable product) can be completed more quickly without tests. However, v2 will be slower, v3, even slower. v4 and on will likely be akin to death marches and you probably hired a bunch of black box testers to test the features reporting bugs well after the developer has mentally moved on to other features. As the cyclomatic complexity of your code grows, it becomes nearly impossible to test all conditions affected by recent changes let alone remember them.

TIP: A POC should be no more than a POC. Prove the concept and then STOP and do it right! Side note: Its pretty awesome to blog about this and stand so principled...real life is often much more complicated...ugh...real life.

But infrastructure code is different

Ok. So far I don't think anything in this post varies with infrastructure code. As far as I am concerned, these are pretty universal rules to testing. However, infrastructure code IS different. I started the post (and titled it) referring to Pester - a test framework written in and for PowerShell. Chances are (though no guarantees) if you are writing PowerShell you are working on infrastructure. I have been focusing on infrastructure code for the past 3 to 4 years and I really found it different. I remain passionate about testing, but have embraced different patterns, workflows and principles since working in this domain. And I am still learning.

If I mock the infrastructure, what's left?

So when writing more traditional style software projects (whatever the hell that is but I don't know what else to call it), we often try to mock or stub out external "ifrastructureish" systems. File systems, databases, network sockets - we have clever ways of faking these out and that's a good thing. It allows us to focus on the code that actually needs testing.

However if I am working on a PR for the winrm ruby gem that implements the winrm protocol or I am working on provisioning a VM or am leaning heavily on something that uses the windows registry, if I mock away all of these layers, I may fall into the trap where I am not really testing my logic.

More integration tests

One way in which my testing habits have changed when dealing with infrastructure code is I am more willing to sacrifice unit tests for integration style tests. This is because there are likely to be big chunks of code that may have little conditional logic but is instead expending its effort just moving stuff around. If I mock everything out I may just end up testing that I am calling the correct API endpoints with the expected parameters. This can be useful to some extent but can quickly start to smell like the tests just repeat the implementation.

Typically I like the testing pyramid approach of lots and lots of unit tests under a relatively thin layer of integration tests. I'll fight to keep that structure but find that often the integration layer needs to be a bit thicker in the infrastructure domain. This may mean that coverage slips a bit at the unit level but some unit tests just don't provide as much value and I'm gonna get more bang for my buck in integration tests.

Still - strive for unit tests

Having said I'm more willing to skip unit tests for integration tests, I would still stress the importance of unit tests. Unit tests can be tricky but there is more often than not a way to abstract out the code that surrounds your logic in a testable way. It may seem like you are testing some trivial aspect of the code but if you can capture the logic in unit tests, the tests will run much faster and you can iterate on the problem more quickly. Also bugs found in unit tests lie far closer to the source of the bug and are thus much easier to troubleshoot.

Virtues of thorough unit test coverage in interpreted languages

When working with compiled languages like C#, Go, C++, Java, etc, it is often said that the compiler acts as Unit Test #1. There is alot to be said for code that compiles. Well there is also great value in using dynamic languages but one downside in my opinion is the loss of this  initial "unit test". I have run into situations both in PowerShell and Ruby where code was deployed that simply was not correct. Using a misspelled method name or referencing an undeclared variable just to name a couple possibilities. If anything, unit tests that do no more than merely walk all possible code paths can protect code from randomly blowing up.

How about TDD?

Regardless of whether I'm writing infrastructure code or not, I tend to NOT do TDD when I am trying to figure out how to do something. Like determining which APIs to call and how to call them. How can I test for outcomes when I have no idea what they look like. I might not know what registry tree to scan or even if the point of automation is controlled at all by the registry.

Well with infrastructure code I find myself in more scenarios where I start off having no idea how to do something and the code is a journey of figuring this out. So I'm probably not writing unit tests until I figure this out. But when I can, I still love TDD. I've done lots of infrastructure TDD. Really does not matter what the domain is, I love the red, green, refactor workflow.

If you can't test first, test ASAP

So maybe writing tests first in some cases does not make sense as you hammer out just how things work. Once you do figure tings out either refactor what you have with tests first or fill in with tests after the initial implementation. Another law I have found to equally apply to all code is that the longer you delay writing the tests, the more difficult (or impossible) it is to write the tests. Some code is easy to test and some is not. When you are coding with the explicit intention of writing tests, you are motivated to make sure things are testable.

This tends to also have some nice side effects of breaking down the code into smaller decoupled components because its a pain in the but to test monoliths.

When do we NOT write tests

I don't think the answer is never. However, too often than not "throw away code" is not thrown away. Instead it grows and grows. What started as a personal utility script gets committed to source control, distributed with our app and depended on by customers. So I think we just need to be cautious to identify these inflection points as soon as possible when our "one-off" script becomes a core routine of our infrastructure.