I’ve been having a problem for a while where my virtual machines (that I use to set up the puzzles) don’t launch reliably – sometimes they work, and sometimes they don’t.
I didn’t understand why this was before, and on Friday I think I figured it out!
what was going wrong
When I started a puzzle, I’d:
- launch a VM (by giving
cloud-init.yamlfile), which would set up all the right files and run some commands
- wait until SSH was up
- ssh into the VM and run a script called
I don’t know why it took me so long to remember this, but – just because
ssh is running, it doesn’t mean that the
cloud-init is done running! So if I
ssh into the instance as soon as SSH is up, my setup script might not have
everything it needs to run.
I also found this launchpad bug suggesting that at some point in the past cloud-init only brought up SSH when it was finished running, but that that doesn’t happen anymore.
solution: wait until
cloud-init is done running
cloud-init makes this pretty easy: it creates a file at
/var/lib/cloud/data/result.json when it’s done running.
I also made my puzzle setup code run as part of
cloud-init (in the
scripts/per-boot stage), so I don’t need to do an extra SSH to run the last
stage of setup.
So now instead I’m doing:
- launch a VM (by giving
- wait until SSH is up
- wait until the
result.jsonfile is present
- make sure that
I’m not sure that this will solve all my problems, but it’s helped already and it’s a much better plan.
how to make SSH ignore
Right now I’m testing my
cloud-init.yaml files by spinning up a bunch of VMs
on my laptop with
qemu. I had a problem where every instance had a different
randomly generated SSH key, so SSH was giving me these giant warnings about the
[email protected]:2222 changing. These warnings were annoying me (and
providing no value in this case) so I wanted them to go away.
At first I tried to solve this with
ssh -o StrictHostKeyChecking=no but,
while this let me SSH without typing “yes” to the prompt warning me about the
change in keys, it still displayed the warning.
I found out that I can do
ssh -o UserKnownHostsFile=/dev/null instead, which
ignores my usual
a script to run my
cloud-init files locally with qemu
I also wrote a script to run my
cloud-init files locally!
Here it is. Now I can just run
./scripts/start-vm PUZZLE-NAME to start the VM
for a given puzzle. It takes about a minute to boot a VM and it made it WAY WAY
WAY faster to iterate on changes.
I’ve gotten a bit better at bash recently by writing the bash zine and I used
some of my newfound bash knowledge here (like using
trap to kill the qemu
process when the script exits, writing while loops, and
arithmetic.). I felt like this was a nice example of a good place for a bash script because:
- the logic is very simple (there’s just 1 while loop and a
- it needs to run a bunch of processes (so bash is the right language)
- I’m the only person using it
#!/bin/bash set -e # kill qemu on exit trap 'set -e; kill $(jobs -p)' exit CLOUD_INIT_FILE=$(find . -path "*$1*cloud-init.yaml") [ -f $CLOUD_INIT_FILE ] || exit echo "instance-id: $(uuidgen || echo i-abcdefg)" > my-meta-data IMG=/tmp/my-seed.img FOCAL=/home/bork/work/images/focal-server-cloudimg-amd64.img SNAPSHOT=/tmp/snapshot.qcow2 qemu-img create -b $FOCAL -f qcow2 -F qcow2 $SNAPSHOT cloud-localds $IMG $CLOUD_INIT_FILE my-meta-data qemu-system-x86_64 --enable-kvm -m 1024 \ -drive file=$SNAPSHOT,format=qcow2 \ -drive file=$IMG,format=raw \ -net user,hostfwd=tcp::2222-:22 -net nic \ -nographic > out 2>out & SSH_OPTIONS="-p 2222 -i wizard.key -o UserKnownHostsFile=/dev/null -o ConnectTimeout=1 -o StrictHostKeyChecking=no" start=$SECONDS while ! ssh $SSH_OPTIONS [email protected] 'python3 /usr/local/bin/started_up' do duration=$(( SECONDS - start )) echo "waiting for ssh.. $duration" sleep 1 done # we're done! SSH into the VM. ssh $SSH_OPTIONS [email protected]
next up: see if it’s actually more reliable!
I’ve done a lot of testing locally and this setup seems more reliable, but I still haven’t implemented it in production. My guess is that there are still a few other problems I’ll need to work out.
Booting a VM is also still pretty slow – it takes almost 2 minutes sometimes!
Kamal suggested using
kexec and I still haven’t fully understood what that is
or how I could use it.