Ansible - Task executed multiple times due to a Broken Pipe
Why is this a problem?
If you are running a command that is non-idempotent, the command will be started over again which may cause a failure in your playbook execution. Recently I ran into a situation where Ansible was rerunning on a task on a node even though the task was not configured to be retried. It turns out that the ssh connection will retry actions auto-magically if the ssh connection fails while waiting for the task to execute. Here's how I was able to reproduce the issue....
Environment Setup
I started by setting up a fake environment using containers. I have a scale-ssh project that can be used to launch a set of containers running ssh that can be used to run Ansible against.$ git clone https://github.com/mwhahaha/scale-ssh $ cd scale-ssh $ ./build-container.sh $ cp ~/.ssh/id_rsa.pub authorized_keys $ ./run-containers.sh 10After the containers have launched, we now have an Ansible inventory file that can be used to execute Ansible via SSH to our freshly launched containers.
Setting up the Broken Pipe
In order to be able to reproduce the connection failure, we want to inject ourselves between Ansible and the target hosts. We can do this using an ssh CommandProxy when connecting to our container network (e.g. 172.16.86.0/24). We can add the following configuration to our ~/.ssh/config file.
Host 172.16.86.* ProxyCommand ssh localhost nc %h %p UserKnownHostsFile=/dev/null StrictHostKeyChecking=no ControlMaster=auto ControlPersist=60s ServerAliveInterval=1 ServerAliveCountMax=1 PreferredAuthentications=publickey
In order for this to work, you will need to ensure ncat is available on yourself. Make sure to install it using your favorite package manager.
Running Ansible
We will create a playbook that just runs a shell command against the target containers via SSH. We need to create an ansible.cfg to correctly set up the SSH options we want Ansible to use when connecting. Our ansible.cfg will look like...
[defaults] forks = 10 [ssh_connection] control_path_dir = /tmp/ansible-ssh retries = 8 pipelining = True
We have a dummy playbook that will run and wait for ~5 minutes while we mess with our connections. We create a broken-pipe.yaml...
- hosts: all gather_facts: false tasks: - shell: sleep 301
Ansible Command Execution
Our containers will mount the systemd socket so when Ansible executes our shell command in the container, we get an entry in the journal on the host running the container. In a separate window, you'll want to run `journalctl -f -t ansible-command` to watch the journal and see when Ansible is running the shell task.
$ ansible-playbook -i inventory.ini broken-pipe.yaml
In the other window with journalctl running and you should see some ansible-command log lines.
$ journalctl -f -t ansible-command -- Logs begin at Tue 2020-07-28 10:14:26 MDT. -- Jul 28 16:38:39 myhostname ansible-command[23231]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23252]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23255]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23277]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23279]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 28 16:38:39 myhostname ansible-command[23284]: Invoked with _raw_params=sleep 50 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Break the Pipe
Now that the command is running, you need to identify one of our ncat processes to stop to simulate a broken pipe.
$ ps -x | grep nc 22703 pts/0 S+ 0:00 ssh localhost nc 172.16.86.89 22 22704 pts/0 S+ 0:00 ssh localhost nc 172.16.86.95 22 22705 pts/0 S+ 0:00 ssh localhost nc 172.16.86.97 22 22706 pts/0 S+ 0:00 ssh localhost nc 172.16.86.93 22 22707 pts/0 S+ 0:00 ssh localhost nc 172.16.86.90 22 22708 pts/0 S+ 0:00 ssh localhost nc 172.16.86.92 22 22709 pts/0 S+ 0:00 ssh localhost nc 172.16.86.94 22 22710 pts/0 S+ 0:00 ssh localhost nc 172.16.86.96 22 22711 pts/0 S+ 0:00 ssh localhost nc 172.16.86.91 22 22712 pts/0 S+ 0:00 ssh localhost nc 172.16.86.88 22 $ kill 22862
Once you've run the kill command, you should see a new ansible-command entry pop up in the journalctl output. This is ansible retrying the command.
Jul 28 16:38:39 myhostname ansible-command[23284]: Invoked with _raw_params=sleep 301 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Jul 28 16:38:49 myhostname ansible-command[23333]: Invoked with _raw_params=sleep 301 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
How do I fix it?
If you have a long running process, you can use async and poll on the task to prevent connectivity related issues from causing the execution to restart. The example playbook can be adjusted with async and poll. Note that the async value is basically an upper timeout limit on the execution.
- hosts: all gather_facts: false tasks: - shell: sleep 301 async: 305 poll: 3