Troubleshooting
Custom Pt Training Script Spawns Multiple Compile Jobs
Observed Error
Custom PyTorch training/evaluation script spawns multiple compile jobs (or custom PyTorch script recursively executing itself in infinite loop).
Explanation
The main reason why this happens is that the Python script is not guarded with an if __name__ == “__main__” section. In various places during execution, subprocesses are spun off (e.g., weight transfer, creating surrogate jobs, etc.) which could lead to the whole module being executed.
Work around
Add an if __name__ == “__main__” to your Python script.