At work we’ve been having this intermittent problem. It’s a hang where the task seems to run, and never end. Thus causes the entire job to not complete. We see it when some machine is de-provisioned, that is, when the software is uninstalled and the machine is put back into the available resource pool, but the task hangs on some other machine when it tries to terminate. Naturally this happens several hours into a job so we’ve had a hard time trying to capture it. Well today we caught the hang while a job was running. Then by chance I looked at the task in the grid scheduler and it show that the hung task had been running on the de-provisioned machine, but when the machine was de-provisioned, which kills the task it was running, the grid scheduler rescheduled it onto the machine where it got stuck. So it just wasn’t any other task that was hanging, but the same one, the rescheduled one. This was key! Knowing this I postulated that there was something left in the shared file system which causes it to hang when the task restarts. As a result I have suggested that we delete all the temporary, work and output files when the application starts, therefore, when it is rescheduled it will appears to be a “fresh” execution. Let’s hope this works. Dennis will make the changes and try it tonight. I’ll find out tomorrow. Anyway I was happy to have figured this out with Jonathan because this problem has been bugging us for weeks. Maybe, just maybe, I earned my salary today.
Update: I found another problem that could cause a hang today. When executing a command under Java you create a process object. The process provides a mechanism to feed it input, and get its output and error. My present code reads the input in the current thread which can caused the hang, whereas, I needed to create a separate thread to do this reading.