urn:noticeable:projects:3f43Ej0LaTLbXv21eFelchangelog Updateshpc-internal.carnegiescience.edu2021-12-07T18:26:14.900ZCopyright © changelogNoticeablehttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.pnghttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.png#1e88e5urn:noticeable:publications:qicYM3U7JTa4BbA78PsI2021-11-17T21:32:57.411Z2021-12-07T18:26:14.900ZMemex Login Hung on 11/17/21Initial incident and probable cause: Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly).<p>Initial incident and probable cause: </p><p>Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly). After reboot, the routing table and maintenance motd were both updated sooner than before. Also added a ping check in order to determine whether the networks needs restarting (after the routing table is updated).</p>Floyd Fayton[email protected]urn:noticeable:publications:UE8MII49pgq894dKcogP2020-05-07T18:12:00.001Z2020-06-03T18:03:45.531ZSLURM's Default Memory Per CPU Increased (1GB --> 2GB)Hi All, Based on previous usage, the default allocation of 1GB of memory per cpu is too low. I have now increased this default to 2GB of memory per cpu. If you are not setting memory requirements in your job(s), this will change will...<p>Hi All,</p> <p>Based on previous usage, the default allocation of 1GB of memory per cpu is too low. I have now increased this default to 2GB of memory per cpu.</p> <p>If you are <strong>not</strong> setting memory requirements in your job(s), this will change will affect them.</p> <p>If you are already setting memory requirements in your job(s), this change will <strong>not</strong> affect you.</p> <p>Best practice is to specify memory in all jobs. If you haven’t been doing so and are now running into issues (waiting in partitions for longer than usual), you’ll now need to specify memory per cpu in jobs.</p> <p><code>--mem-per-cpu=1G</code> # when passed via command line,<br> or<br> <code>#SBATCH --mem-per-cpu=1G</code> # when added to a submission script.</p> <p>Adding either option should set your memory per cpu back to 1GB.</p> <p>** Again, if you are already using the <code>--mem=</code> or <code>--mem-per-*=</code> options, no changes are required for your job(s).</p> <p>Thank You</p> Floyd Fayton[email protected]urn:noticeable:publications:r21hNwxoxQFI4LTYNgHe2020-05-07T15:15:00.001Z2020-06-03T18:03:55.692ZSLURM Priority AdjustmentSince priorities were not working for those users who use Memex less frequently and in smaller batches of submitted jobs, these parameters were adjusted: As a result gres/gpu was added to: This functionality changes in SLURM 19+, but...<p>Since priorities were not working for those users who use Memex less frequently and in smaller batches of submitted jobs, these parameters were adjusted:</p> <p><code>PriorityWeightFairShare=20000</code><br> <code>PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000</code></p> <p>As a result gres/gpu was added to:</p> <p><code>AccountingStorageTRES=cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu</code></p> <p>This functionality changes in SLURM 19+, but our current version is 18.08 which is the latest version packaged with OpenHPC 1.3.</p> <p><strong>Update:</strong> AcctGatherFilesystemType was also enabled for lustre.</p> Floyd Fayton[email protected]urn:noticeable:publications:q1iUTNu4XWrZcjo7rRcx2020-02-13T20:19:00.001Z2020-06-03T18:04:07.520ZDid You Know ... Slack EditionDid you know we have a Slack channel for HPC/Research Computing? Signup to our Carnegie Institution for Science workspace (click here) and then join the #hpc channel. Please use your Google login, "@carnegiescience.edu", email...<ul> <li><p><strong>Did you know we have a Slack channel for HPC/Research Computing?</strong></p> <p><a href="https://www.google.com/url?q=https://join.slack.com/t/carnegiescience/shared_invite/enQtODk2MDU1MjM3MjUxLTRlOWI3Y2ViZjgxNmFiZWY0YzZjNWIzZWI4NDI4MDIzN2E4N2FhNjUwYmVmZjQyMTA3OGZhZDEyNTFjN2Q1YTU&amp;sa=D&amp;source=hangouts&amp;ust=1581706914005000&amp;usg=AFQjCNH2bvD5I4q2xYGi-DHkKBvq88ZCsg&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">Signup to our Carnegie Institution for Science workspace (click here)</a> and then join the <a href="https://carnegiescience.slack.com/archives/C506QH690?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">#hpc</a> channel. Please use your Google login, "@carnegiescience.edu", email address.</p> <p>This channel is useful for a couple of reasons (as well as other Carnegie Science channels!):</p> <ul> <li>Anyone can share photos, <a href="https://slack.com/help/articles/205875058-Google-Drive-for-Slack?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">Googe Drive documents</a>, articles, issues and best practices.</li> <li>Anyone can send me an informal message to talk about HPC/Research Computing/Memex issues, software questions, usage statistics, and much more. If I can’t answer or address your concerns right away, a formal ticket can be created for a faster response ***.</li> <li>Messages can be directed to a particular user or posted to the channel for everyone to see. General issues can be shared to #hpc and searched by other users with similar questions or issues.</li> </ul></li> </ul> <p><a href="https://www.google.com/url?q=https://join.slack.com/t/carnegiescience/shared_invite/enQtODk2MDU1MjM3MjUxLTRlOWI3Y2ViZjgxNmFiZWY0YzZjNWIzZWI4NDI4MDIzN2E4N2FhNjUwYmVmZjQyMTA3OGZhZDEyNTFjN2Q1YTU&amp;sa=D&amp;source=hangouts&amp;ust=1581706914005000&amp;usg=AFQjCNH2bvD5I4q2xYGi-DHkKBvq88ZCsg&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">To join, click here!!</a></p> <p>Other things you can do are <a href="https://slack.com/help/articles/201402297-Create-a-channel?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">create a new private channel</a>, have a group call (type "<code>/call @user1 @user2</code>"), set reminders (type "<code>/remind help</code>" for options), connect with a Github (type "<code>/github help</code>" for options), and much more (<a href="https://slack.com/help/articles/202288908-Format-your-messages?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">formatting messages</a>, <a href="https://slack.com/help/articles/201259356-Use-built-in-slash-commands?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">built-in commands</a>)!</p> <p>*** Slack does not replace our ticketing system. It is a way for us to discuss research computing without the formality of a ticketing system (by emailing <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a> or typing "<code>/freshservice-ticket</code>" from <a href="https://slack.com/help/articles/212281468-Send-direct-messages?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">Slack’s direct messaging system</a>).</p> Floyd Fayton[email protected]urn:noticeable:publications:2QuP1YochGCrFRF04wYV2020-01-23T16:56:00.001Z2020-06-03T18:35:53.833ZDid You Know ... Python EditionDid you know official support for Python 2 is over? That said, we have Python 3 available on Memex by loading the module, "python/3.6.7". This Python version includes conda, R, Jupyter, IntelMPI, and many other packages. Most...<ul> <li><p>Did you know official support for <a href="https://www.python.org/doc/sunset-python-2/?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-python-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.2QuP1YochGCrFRF04wYV&amp;utm_medium=newspage" target="_blank" rel="noopener">Python 2 is over</a>?</p> <p>That said, we have Python 3 available on Memex by loading the module, "python/3.6.7". This Python version includes conda, R, Jupyter, IntelMPI, and many other packages. Most Python packages can be installed by sending your request to <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a>.</p></li> <li><p>Did you know you can install your own Python packages using conda?</p> <p>For example, if you’d like the latest python version available, 3.8.1, first search for it <a href="https://anaconda.org/?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-python-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.2QuP1YochGCrFRF04wYV&amp;utm_medium=newspage" target="_blank" rel="noopener">here first</a> (mainly conda-forge channel) or directly from the command line with:</p> <pre><code> module purge \ #only needed to ensure a clean environment module load python/3.6.7 conda search python \ #latest should be 3.8.1 as of last week </code></pre> <p>and use it to set up your own conda environment (both <a href="https://anaconda.org/?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-python-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.2QuP1YochGCrFRF04wYV&amp;utm_medium=newspage" target="_blank" rel="noopener">anaconda.org</a> and the command line shows the proper channel, "-c", to use below).</p> <pre><code> conda create -p /lustre/scratch/$USER/.envs/py38 \ -c conda-forge python=3.8.1 </code></pre> <p>The above command assumes <code>/lustre/scratch/$USER/.envs</code> is where you want the conda environment, <mark>py38</mark>, to reside. It has to be in a directory you own.</p> <p>To activate any conda environment (using example, <mark>py38</mark>, above):</p> <pre><code> source activate py38 </code></pre> <p>To deactivate any conda environment:</p> <pre><code> source deactivate </code></pre> <p>To remove an entire conda environment:</p> <pre><code> conda-env remove -n py38 </code></pre></li> <li><p>Did you know you can list existing conda environments already on Memex?</p> <p>To list the available conda environments (still using the module, python/3.6.7), simply type:</p> <pre><code> conda env list </code></pre> <p>if you can see them, you can activate them but if you don’t have write permission, you cannot install, remove, or modify them. You can however, clone them (make the new name <strong>unique</strong>–&gt; "foo", not the <strong>clone name</strong> --&gt; “foo2”):</p> <p><code>conda create -p /lustre/scratch/$USER/.envs/foo --clone foo2</code></p></li> <li><p>Did you know you can list any installed Python package, in the base conda environment (activated by <code>module load python/3.6.7</code>) or a custom conda environment (activated by <code>source activate foo</code>)?</p> <p><code>conda list packagename</code> <strong><em>#replace packagename with real package name</em></strong></p> <p>where “packagename” is a placeholder for a real package (like python, or r-base). Of course,</p> <pre><code> conda list </code></pre> <p>lists all packages installed in your current environment.</p></li> <li><p>Did you know there are other useful commands (<em>#replace packagename and X with real values</em>):</p> <p><code>conda search packagename=X.X.X --info</code> <strong><em>#where X.X.X=version_number of packagename</em></strong><br> <code>conda clean</code> <strong><em>#removes downloaded packages and caches</em></strong><br> <code>conda remove packagename</code> <strong><em>#to remove package</em></strong><br> <code>conda list --revisions</code> <strong><em>#to identify a revision to rollback to</em></strong><br> <code>conda install --revision X</code> <strong><em>#rollback to revision number, X</em></strong></p> <p>Type <code>conda -h</code> and/or <code>conda-env -h</code> for more options.</p></li> </ul> <p><strong>Note</strong>: To manage where your cache directory is located (includes tar balls and downloaded packages), you can set the following in your bashrc:</p> <pre><code>echo CONDA_PKGS_DIRS=/lustre/scratch/$USER/.envs/pkgs &gt;&gt; \ $HOME/.bashrc source $HOME/.bashrc </code></pre> <p>Or in your current environment only with:</p> <pre><code>export CONDA_PKGS_DIRS=/lustre/scratch/$USER/.envs/pkgs </code></pre> Floyd Fayton[email protected]urn:noticeable:publications:FuF1YhK9mCFksmxHZsDe2020-01-15T15:57:00.001Z2020-06-03T18:36:04.820ZDid You Know ... Storage EditionDid you know there's a 256GB quota for all This policy will be fully enforced in the coming weeks. This requirement is needed to manage space and load for > ==Note:== If you are currently over the limit, you will have time to move...<ul> <li><p>Did you know there’s a 256GB quota for all <code>/home</code> directories?</p> <p>This policy will be fully enforced in the coming weeks. This requirement is needed to manage space and load for <code>/home</code>. Home directories should primarily be used to set up scripts, small software applications, and environment variables across the cluster. To that end, we will be enforcing a hard quota of 256GB. A consolidation of all <code>/home</code> directories to one storage device will take place this year. The exact date will be advertised but please contact me or <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a> with any other questions.</p></li> </ul> <blockquote> <p><mark>Note:</mark> If you are currently over the limit, you will have time to move data but start moving (or removing) data <strong>now</strong>.</p> </blockquote> <ul> <li><p>Did you know we have about 950TB of storage right now?</p> <p>This includes the Lustre filesystem (<code>/lustre</code>, 698TB) and the MemexNFS filesystem (<code>/work</code>, <code>/scratch</code>, <code>/share/apps</code>, and <code>/home</code> ~ 252TB). Check out our login banner (see "<mark>Mountpoint Information</mark>") to see how each filesystem should be used.</p></li> </ul> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/FuF1YhK9mCFksmxHZsDe/01h55ta3gsa47k8s5mbm1zzsva-image.png" alt="Screen Shot 2020-01-14 at 11.41.05 PM.png"></p> <ul> <li><p>Did you know we now have a <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000044729?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-storage-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.FuF1YhK9mCFksmxHZsDe&amp;utm_medium=newspage" target="_blank" rel="noopener">Memex Globus endpoint</a>, "cisuser#carnegiescience"?</p> <p>Globus transfers are typically much faster than our command line option. Instructions on how to <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000044729?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-storage-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.FuF1YhK9mCFksmxHZsDe&amp;utm_medium=newspage" target="_blank" rel="noopener">setup and use Globus on Memex</a> is in our FreshService ticketing system (<a href="https://carnegiescience.freshservice.com/support/solutions?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-storage-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.FuF1YhK9mCFksmxHZsDe&amp;utm_medium=newspage" target="_blank" rel="noopener">Solutions --&gt; Computation</a>).</p></li> <li><p>Did you know you can check disk usage in the following ways?</p> <p>For most users, the new command</p> <pre><code> $ zquota </code></pre> <p>will show your <code>/home</code> usage and your group’s usage for /work/DEPT. Shared directories such as <code>/share/apps</code>, <code>/scratch</code>, and <code>/lustre</code> usage can seen by the ‘df’ command. For example:</p> <pre><code> $ df -h /lustre /share/apps /scratch </code></pre> <p>shows usage for all three filesystems. Of course, other commands like <code>$ du -sh /home/username/directory/</code> for directories or <code>$ ls -Shl filename</code> for files can be used as well.</p></li> </ul> Floyd Fayton[email protected]urn:noticeable:publications:RBpMKY9co1xfZS8hubEf2020-01-13T21:58:00.001Z2020-01-15T16:05:41.596ZDid You Know ... SLURM EditionDid you know there's a gui to view SLURM jobs? Inside a VNC or "ssh -XY .." session, type Did you know you can view the maximum resources of each node with: Did you know you can view the maximum memory used for running jobs with (see...<ul> <li>Did you know there’s a gui to view SLURM jobs? Inside a VNC or “ssh -XY …” session, type <code>sview -a</code> from the command line (check out the menu selection as well):</li> </ul> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/RBpMKY9co1xfZS8hubEf/01h55ta3gsx7jhrekm9nghsdts-image.png" alt="Screen Shot 2020-01-13 at 9.58.26 AM.png"></p> <ul> <li><p>Did you know you can view the maximum resources of each node with:</p> <p><code>sinfo -e -o "%20N %10c %10m %25f %10G"</code></p></li> <li><p>Did you know you can view the maximum memory used for running jobs with (see “/bin/sacct -e” for format options):</p> <p><code>/bin/sacct --format="JobID,CPUTime,MaxRSS" -j JOBID</code></p> <p>better yet, try <code>sstat --format="jobid,maxrss,avecpu" -j JOBID</code> (“sstat -e” for format options)?</p></li> <li><p>Did you know you can email yourself job status changes?</p> <p><code>#SBATCH [email protected]</code><br> <code>#SBATCH --mail-type=FAIL,BEGIN,END,SUSPEND</code></p> <p>Here are a <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000039168?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slurm-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.RBpMKY9co1xfZS8hubEf&amp;utm_medium=newspage" target="_blank" rel="noopener">few more tips</a> and you can contact me directly with any questions.</p></li> </ul> Floyd Fayton[email protected]urn:noticeable:publications:5csmLDRBAVK9iyQKDttS2019-12-19T17:35:00.001Z2020-01-15T17:27:49.876ZUser reported that rsync/cp/scp too slow on /memexnfs/apps,Temporary Resolution 01/02/20: Using rclone instead of rsync/cp/scp is 10x faster for large directory and possibly large file syncs to /memexnfs/ mountpoints. Although all reads from /memexnfs mounts are performing performing as...<p><strong>Temporary Resolution 01/02/20:</strong><br> Using rclone instead of rsync/cp/scp is 10x faster for large directory and possibly large file syncs to /memexnfs/ mountpoints. Although all reads from /memexnfs mounts are performing performing as expected, all disk to disk writes to /memexnfs are not. As load increases, presumably from cluster jobs and transfers (syncs to /home), write performance suffers. The workaround is to use rclone and this email was sent out to all users:</p> <blockquote> <p>Please use rclone for large local or remote transfers while using /memenfs/* filesystems, or /share/apps/dept, /home/username, /work/DEPT, or /scratch/username. There seems to be an issue with the common linux commands, rsync and cp, when transferring large directories (size and number of files).</p> </blockquote> <blockquote> <p>The solution is to use rclone instead of rsync or cp or scp for large directories (size and number of files),</p> </blockquote> <p><code>rclone sync /home/username/directory/ /scratch/username/directory/ -LP</code></p> <blockquote> <p>This syncing issue currently affects write speeds but not read speeds for large directories to /memexnfs/*. This solution has been tested and should also work fine for small directories and files.</p> </blockquote> <blockquote> <p>Of course, rclone is used to sync files to/from GDrive as well.</p> </blockquote> <blockquote> <p>Rclone Tutorial:<br> <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000040389?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.user-reported-that-rsync-cp-scp-too-slow-on-memexnfs-apps&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.5csmLDRBAVK9iyQKDttS&amp;utm_medium=newspage" target="_blank" rel="noopener">https://carnegiescience.freshservice.com/support/solutions/articles/3000040389</a></p> </blockquote> <p><strong>Issue 12/19/19:</strong><br> A user reported rsyncs were tool slow on the /share/apps mount of /memexnfs/apps. Since all /memexnfs/* mounts share the same disks/setup, it was determined the issue was not isolated to /share/apps, /work, /scratch, and /home (all /memexnfs mountpoints across the cluster).</p> Floyd Fayton[email protected]urn:noticeable:publications:EC7QoW1ZmiKkRDgyNIDr2019-12-02T19:57:00.001Z2020-01-15T17:25:26.267ZLogin Node Slowness (module command hanging on memex.carnegiescience.edu)Resolved 12/10/19: After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x...<p><strong>Resolved 12/10/19:</strong></p> <p>After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x bandwidth). This tuning of the system was partially documented in <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000044399?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.login-slowness-module-command-hanging&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.EC7QoW1ZmiKkRDgyNIDr&amp;utm_medium=newspage" target="_blank" rel="noopener">our ticket system</a> for future reference.</p> <p><strong>Update 12/04/19:</strong></p> <p>To see improvements to the module command, please log out and log back in. This step is required for you to take advantage of Lmod’s caching feature which improves its responsiveness. In the meantime, I am still investigating ways to improve filesystem performance for <code>/memexnfs/*</code> mountpoints. If you are running SLURM jobs which writes or reads large files, I suggest using <code>/lustre/scratch/$USER</code> as a working directory. If you are running a parallel or multiprocessing job, also use <code>/lustre/scratch/$USER</code> as a working directory. For instance,</p> <blockquote> <p>mkdir /lustre/scratch/$USER<br> rsync -aWz /home/$USER/workdir/ /lustre/scratch/$USER/workdir/<br> cd /lustre/scratch/$USER/workdir/</p> </blockquote> <p>then submit your job as normal.</p> <p>After the job finishes, you can <code>rsync</code> the directory back to <code>/home/$USER/workdir</code> for safe keeping. Please keep in mind, you’ll need to add <code>"--delete"</code> to the <code>rsync</code> command for an exact copy of the <code>/lustre/scratch/$USER/workdir</code> (which deletes any files/dirs in <code>/home/$USER/workdir</code> that are not in <code>/lustre/scratch/$USER/workdir</code>).</p> <p>Use cautiously,</p> <blockquote> <p>rsync -aWz --delete /lustre/scratch/$USER/workdir/ /home/$USER/workdir/</p> </blockquote> <p>Please note, the directories under <code>/lustre/scratch</code> are <strong>not</strong> backed up and are <strong>not</strong> currently scrubbed.</p> <p>The use of <code>/lustre/scratch/$USER</code> is recommended because the read/write performance of <code>/memexnfs/*</code> is being hindered by multiple I/O streams, including transfers (by root and users), SLURM jobs, and any other login node activity by users (VNC, shell/interpreter scripts, etc.).</p> <p><strong>Update 12/03/19:</strong></p> <p>Lmod’s cache was enabled to improve the performance of the module commands on Memex. However, I/O performance is still a bit sluggish, so more investigation is required to improve performance on the mounted filesystems.</p> <p><strong>Issue (started week of 11/25/2019):</strong></p> <p>After logging onto Memex (password/DUO), the user login hangs while trying to load modules. This is an ongoing issue which seems to be caused by remote and/or local mounted filesystems,</p> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/EC7QoW1ZmiKkRDgyNIDr/01h55ta3gsaf68z6xy02yhv34s-image.png" alt="Screen Shot 2019-12-02 at 3.12.27 PM.png"></p> <p>while Lmod is traversing one or more of the module paths (usually in “$MODULEPATH”). We are investigating the issue, but in the meantime <strong>enter</strong> “Ctrl+c” if the bash command prompt doesn’t appear swiftly after the following banner:</p> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/EC7QoW1ZmiKkRDgyNIDr/01h55ta3gswf49314edn913r1t-image.png" alt="Screen Shot 2019-12-02 at 3.03.54 PM.png"></p> Floyd Fayton[email protected]urn:noticeable:publications:y3QRYShmzZxk6tnDtkgn2019-10-14T15:40:00.001Z2020-01-15T17:25:34.931ZSystem Update & Failed disk in SureStoreHD, memexnfs ZFS pool degradedSystem Update 11/6/19: System was updated on November 8th, which includes updates to SLURM, ZFS (0.6 to 0.8, which improves time to rebuild failed disk), and CentOS (7.5 to 7.7). Resolved 11/2/19: Issue resolved. Replacement disk has...<p><strong>System Update 11/6/19:</strong></p> <p><em>System was updated on November 8th, which includes updates to SLURM, ZFS (0.6 to 0.8, which improves time to rebuild failed disk), and CentOS (7.5 to 7.7).</em></p> <p><strong>Resolved 11/2/19:</strong></p> <p>Issue resolved. Replacement disk has finished resilvering.</p> <p><strong>Update 10/22/19:</strong></p> <p>The new drive in our primary filesystem is still rebuilding and will be done in about 8 days. The type of filesystem is ZFS (version 0.6) in RAIDZ1 configuration which means one failed drive puts the filesystem in a degraded state. This degraded state will continue until the drive is "resilvered", or data is copied to the healthy disk and it comes online. This process, which takes entirely too much time, was flagged as a ZFS bug back in November, 2017.</p> <p>The current version of ZFS, 0.8.0, was released in May of this year and addresses the bug. The improvement to the resilvering process is said to be 5-6x better that the performance we’re currently seeing. The command line slowness you are experiencing on Memex’s login is far worse (up to 5x worse) that the I/O performance on Memex’s compute nodes, but all I/O for /home, /scratch, /work/ and /share/apps will be affected. This means you can still submit jobs from the login but all other activities will be slow in /home, /scratch, and /work.</p> <p>A way around this is to use your own Lustre scratch directory, /lustre/scratch/username (if it doesn’t exist, you can create it, “mkdir -p /lustre/scratch/username”), to edit files, run local commands, etc. Cleanup for /lustre/scratch/username is turned off for now and you can even submit jobs from here.</p> <p><code>**ANNOUNCEMENT**</code><br> We are planning a software update for SLURM and ZFS when the disk replacement is completely done. I will be sending out a notice for a planned reboot, which is necessary in order to ensure the ZFS filesystem is truly updated. Please keep this in mind as you are submitting jobs. A job intended to run for a month or so, will be killed prior to these updates in a couple of weeks.</p> <p><strong>Update 10/14/19:</strong></p> <p>Drive still resilvering - 24% done and going -</p> <blockquote> <p>status: One or more devices is currently being resilvered. The pool will<br> continue to function, possibly in a degraded state.<br> action: Wait for the resilver to complete.<br> scan: resilver in progress since Wed Oct 9 12:55:27 2019<br> 18.8T scanned out of 77.8T at 47.3M/s, 363h15m to go<br> 566G resilvered, 24.15% done</p> </blockquote> <p><strong>Update 10/9/19:</strong> <br> Failed drive replaced and ZFS resilvering with new drive started.</p> <p><strong>Update 10/7/19:</strong> <br> Increased quota for /memexnfs/home, decreased quota for /scratch as well.</p> <p><strong>Update 10/7/19:</strong> <br> Memexnfs has become more responsive due to ZFS resilvering after the drive failure. This has resulted in I/O improvements for /home, /work, /scratch, and /share/apps.</p> <p><strong>Update 10/5/19:</strong> <br> Waiting for new HDD in order to replace the failed disk.</p> <p>The main issue is a failed drive for all /memexnfs/* mounts. I’ll let you know when the drive has been replaced. Until then Memex’s directories, /memexnfs/scratch (/scratch), /memexnfs/home/ (or /home), /memexnfs/work (or /work), and /memexnfs/apps (or /share/apps) will all be operating in a degraded state.</p> Floyd Fayton[email protected]