Why is this a problem?

If you are running a command that is non-idempotent, the command will be started over again which may cause a failure in your playbook execution. Recently I ran into a situation where Ansible was rerunning on a task on a node even though the task was not configured to be retried. It turns out that the ssh connection will retry actions auto-magically if the ssh connection fails while waiting for the task to execute. Here's how I was able to reproduce the issue....

Environment Setup

I started by setting up a fake environment using containers. I have a scale-ssh project that can be used to launch a set of containers running ssh that can be used to run Ansible against.

$ git clone https://github.com/mwhahaha/scale-ssh
$ cd scale-ssh
$ ./build-container.sh
$ cp ~/.ssh/id_rsa.pub authorized_keys
$ ./run-containers.sh 10

After the containers have launched, we now have an Ansible inventory file that can be used to execute Ansible via SSH to our freshly launched containers.

Setting up the Broken Pipe

In order to be able to reproduce the connection failure, we want to inject ourselves between Ansible and the target hosts. We can do this using an ssh CommandProxy when connecting to our container network (e.g. 172.16.86.0/24). We can add the following configuration to our ~/.ssh/config file.

Host 172.16.86.*
    ProxyCommand ssh localhost nc %h %p
    UserKnownHostsFile=/dev/null
    StrictHostKeyChecking=no
    ControlMaster=auto
    ControlPersist=60s
    ServerAliveInterval=1
    ServerAliveCountMax=1
    PreferredAuthentications=publickey

In order for this to work, you will need to ensure ncat is available on yourself. Make sure to install it using your favorite package manager.

Running Ansible

We will create a playbook that just runs a shell command against the target containers via SSH. We need to create an ansible.cfg to correctly set up the SSH options we want Ansible to use when connecting. Our ansible.cfg will look like...

[defaults]
forks = 10

[ssh_connection]
control_path_dir = /tmp/ansible-ssh
retries = 8
pipelining = True

We have a dummy playbook that will run and wait for ~5 minutes while we mess with our connections. We create a broken-pipe.yaml...

- hosts: all
  gather_facts: false
  tasks:
    - shell: sleep 301

Now we can run this playbook and see what happens.

Ansible Command Execution

Our containers will mount the systemd socket so when Ansible executes our shell command in the container, we get an entry in the journal on the host running the container. In a separate window, you'll want to run `journalctl -f -t ansible-command` to watch the journal and see when Ansible is running the shell task.

$ ansible-playbook -i inventory.ini broken-pipe.yaml

In the other window with journalctl running and you should see some ansible-command log lines.

$ journalctl -f -t ansible-command
-- Logs begin at Tue 2020-07-28 10:14:26 MDT. --
Jul 28 16:38:39 myhostname ansible-command[23231]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23252]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23255]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23277]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23279]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23284]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None

Break the Pipe

Now that the command is running, you need to identify one of our ncat processes to stop to simulate a broken pipe.

  $ ps -x | grep nc
  22703 pts/0    S+     0:00 ssh localhost nc 172.16.86.89 22
  22704 pts/0    S+     0:00 ssh localhost nc 172.16.86.95 22
  22705 pts/0    S+     0:00 ssh localhost nc 172.16.86.97 22
  22706 pts/0    S+     0:00 ssh localhost nc 172.16.86.93 22
  22707 pts/0    S+     0:00 ssh localhost nc 172.16.86.90 22
  22708 pts/0    S+     0:00 ssh localhost nc 172.16.86.92 22
  22709 pts/0    S+     0:00 ssh localhost nc 172.16.86.94 22
  22710 pts/0    S+     0:00 ssh localhost nc 172.16.86.96 22
  22711 pts/0    S+     0:00 ssh localhost nc 172.16.86.91 22
  22712 pts/0    S+     0:00 ssh localhost nc 172.16.86.88 22
  $ kill 22862

Once you've run the kill command, you should see a new ansible-command entry pop up in the journalctl output. This is ansible retrying the command.

Jul 28 16:38:39 myhostname ansible-command[23284]: Invoked with _raw_params=sleep 301 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:49 myhostname ansible-command[23333]: Invoked with _raw_params=sleep 301 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None

How do I fix it?

If you have a long running process, you can use async and poll on the task to prevent connectivity related issues from causing the execution to restart. The example playbook can be adjusted with async and poll. Note that the async value is basically an upper timeout limit on the execution.

- hosts: all
  gather_facts: false
  tasks:
    - shell: sleep 301
      async: 305
      poll: 3

Search This Blog

scyph.us (sī′fəs)

Ansible - Task executed multiple times due to a Broken Pipe