Login Node Slowness (module command hanging on memex.carnegiescience.edu)
timestamp1575316620001
Resolved 12/10/19:
After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x bandwidth). This tuning of the system was partially documented in our ticket system for future reference.
Update 12/04/19:
To see improvements to the module command, please log out and log back in. This step is required for you to take advantage of Lmod’s caching feature which improves its responsiveness. In the meantime, I am still investigating ways to improve filesystem performance for /memexnfs/*
mountpoints. If you are running SLURM jobs which writes or reads large files, I suggest using /lustre/scratch/$USER
as a working directory. If you are running a parallel or multiprocessing job, also use /lustre/scratch/$USER
as a working directory. For instance,
mkdir /lustre/scratch/$USER
rsync -aWz /home/$USER/workdir/ /lustre/scratch/$USER/workdir/
cd /lustre/scratch/$USER/workdir/
then submit your job as normal.
After the job finishes, you can rsync
the directory back to /home/$USER/workdir
for safe keeping. Please keep in mind, you’ll need to add "--delete"
to the rsync
command for an exact copy of the /lustre/scratch/$USER/workdir
(which deletes any files/dirs in /home/$USER/workdir
that are not in /lustre/scratch/$USER/workdir
).
Use cautiously,
rsync -aWz --delete /lustre/scratch/$USER/workdir/ /home/$USER/workdir/
Please note, the directories under /lustre/scratch
are not backed up and are not currently scrubbed.
The use of /lustre/scratch/$USER
is recommended because the read/write performance of /memexnfs/*
is being hindered by multiple I/O streams, including transfers (by root and users), SLURM jobs, and any other login node activity by users (VNC, shell/interpreter scripts, etc.).
Update 12/03/19:
Lmod’s cache was enabled to improve the performance of the module commands on Memex. However, I/O performance is still a bit sluggish, so more investigation is required to improve performance on the mounted filesystems.
Issue (started week of 11/25/2019):
After logging onto Memex (password/DUO), the user login hangs while trying to load modules. This is an ongoing issue which seems to be caused by remote and/or local mounted filesystems,
while Lmod is traversing one or more of the module paths (usually in “$MODULEPATH”). We are investigating the issue, but in the meantime enter “Ctrl+c” if the bash command prompt doesn’t appear swiftly after the following banner: