Docker Cluster Reboot – Systems Lab

Due to unforeseen events, I ended up having to shutdown all of my servers. Due to this, my docker VM cluster ended up getting its first chance to reboot the entire stack of all nodes. This ended up showing some problems in my configuration sadly, as well as a docker issue I had previously encountered in my raspberry pi cluster.

GlusterFS

After bringing everything back up, I noticed that my swarm was only loading containers onto one of the 3 nodes. I started looking into the nodes and noticed that the /mnt tree wasn’t showing the data from GlusterFS, however running df -h showed that gluster was indeed mounted. After noticing that /mnt was owned by root with no access to anyone else, I figured why not try a shot in the dark and chmod the /mnt directory. Well that worked, running a quick ls afterwards showed the directory having all of the data again.

Tasks.db

I encountered the same tasks.db problem that I had seen previously on my pi cluster. After getting the gluster mount back, the nodes still didn’t let containers start up on them. Running a docker node ls told me everything I needed to know, that the node was no longer a manager of the swarm or a part of a swarm (all 3 nodes are managers). I went to look at the tasks.db file and found it to be a few GB in size. The process to fix this is simple:

service docker stop
rm /var/lib/docker/swarm/worker/tasks.db
service docker start

Since my cluster did have one node come up fine, I thought that one might have been safe to the tasks.db problem, however shortly after fixing the other two nodes, this third node showed the same problem, and the same fix worked on it.

Conclusion

I should have rebooted nodes after initial configuration to ensure things would come up correctly after a reboot. I also should have tested a bit further on them before filling them with containers (24 containers running on the cluster now).