This is a very specific error. We have quite a mess-up python setup with multiple versions, network modules and local, a SLURM cluster, and the option to install your own, so it’s quite tricky to track the origin of a module. But it helps if I have two servers with the same kernel and packages, one of them the program runs, but not in the other. I do have logs, but they are not very clear. Message says
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'python-3.7.3'
Traceback (most recent call last):
File "/XXX/run_docker.py", line 22, in <module>
from absl import app
ModuleNotFoundError: No module named 'absl'
So what is happening here I believe is that the program is trying to load the module, it fails and goes to the local install. As an user I can install the missing absl-py module, but I can’t run it in this case because of the special process. What to do then? All the solutions on stackoverflow were not suitable, since I don’t know which python is getting what. But I have a hint: the program seems to be loading python 3.7.3, that is not the default. So I look for the python modules stored and I found them on the server one. A simple sync
@ server-one ## > rsync -av /usr/local/lib/python3.6/site-packages/ root@server-two:/usr/local/lib/python3.6/site-packages/
and then the “program” works. What is going on? It looks like python, since it didn’t manage to find the modules for the requested python version, took the closest ones available (3.6). But who knows? I’m not a python expert, I’m just passing by 😔.