Hi all, I'm interested in monitoring the processes running in a Linux system and determining when they are stuck/running endlessly very quickly. Once I determine this, I also want to take on some actions (like dumping some debug info, restarting the process, etc..). I know I can detect stuck processes using systemd, but unfortunately I wasn't able to take action (where can I specify a script that I want to run when some process heartbeats are missed ?) Are you aware about other tools that act like watchdog monitors ? (processes can register to them, start sending heartbeats, and in case some heartbeats are missed, the tools takes some actions. I am aware I can write my own tool - I just want to know if there's anything else offering this functionality. Thank you,
Generically speaking, it is impossible to 'detect' if some random program is 'stuck'. By definition, 'stuck' means 'not operating correctly'. If I give you a list of 100 executables, how are you going to determine if they are not operating correctly? Doing so implies some knowledge of the application, of what it should be doing when operating 'correctly'. Without that knowledge, the task of determining if an application is 'stuck' is impossible. The only way systemd does it is via the application sending a 'heartbeat' (a special systemd message) to systemd periodically. If that heartbeat doesn't arrive at a regular interval, systemd assumes the application is not working and kills it. Note that, to the best of my knowledge, this only works with daemons launched by Systemd, and also (as mentioned above) requires the use of a special message; in other words the application has to be specifically written to provide the information. That being said, there ARE daemons that can monitor for SPECIFIC POTENTIAL indications of application issues, such as exceeding user definable memory utilization or CPU usage thresholds. Monit is one such daemon Here is a script at superuser.com I'm sure there are others, but these are a couple I found with a quick Google.