urn:noticeable:projects:3f43Ej0LaTLbXv21eFelchangelog Updateshpc-internal.carnegiescience.edu2021-12-07T18:26:14.900ZCopyright © changelogNoticeablehttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.pnghttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.png#1e88e5urn:noticeable:publications:qicYM3U7JTa4BbA78PsI2021-11-17T21:32:57.411Z2021-12-07T18:26:14.900ZMemex Login Hung on 11/17/21Initial incident and probable cause: Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly).<p>Initial incident and probable cause: </p><p>Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly). After reboot, the routing table and maintenance motd were both updated sooner than before. Also added a ping check in order to determine whether the networks needs restarting (after the routing table is updated).</p>Floyd Fayton[email protected]urn:noticeable:publications:kiXk98fZM3iLHtLNxMyQ2020-06-25T20:09:00.001Z2020-06-29T14:06:52.111ZLogin hangs after kernel message...Update (6/29/20): The login hang was caused by high I/O load which is returning after a weekend hiatus. Unfortunately, limiting the I/O on the login is not yet feasible due to the design of the system. The issue is not due to a lack of...<p><strong>Update (6/29/20):</strong><br> The <strong>login</strong> hang was caused by high I/O load which is returning after a weekend hiatus. Unfortunately, limiting the I/O on the login is not yet feasible due to the design of the system. The issue is not due to a lack of memory, CPU, or bandwidth on the login but the ability of the login to process lots of I/O by users (including transfers and local login processes). The issue is not present when performing the same work on compute nodes (transfers, normal commands, writing files, etc.). Short of an entire system redesign, we are planning a way forward with the way storage is used and configured on Memex. The earliest possible change(s) will be made during the July 8th shutdown. Details will be sent once those plans are set.</p> <p><strong>Incident (06/24/20 @1540):</strong><br> Login stalled after the global message below</p> <pre><code class="hljs language-Message from syslogd@memex at Jun 24 12:22:34 ..."> kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">23</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">23</span>:<span class="hljs-number">02</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">22</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">23</span>:<span class="hljs-number">34</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">23</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">24</span>:<span class="hljs-number">02</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">23</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">24</span>:<span class="hljs-number">30</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">22</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">24</span>:<span class="hljs-number">58</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">22</span>s! [$IP-m:<span class="hljs-number">69361</span>]``` ------------- </code></pre> Floyd Fayton[email protected]urn:noticeable:publications:zx7mwKxW16mNkfChp64x2020-05-22T02:02:00.001Z2020-06-25T20:30:56.494ZMaster Node RebootedIncident (5/21/20): While fixing issues with the GPU nodes, the master node became unstable because of several mount points that were damaged. All of the mount were runtime filesystems so a reboot was requested and fulfilled by SRCF...<p>**Incident (5/21/20): **</p> <p>While fixing issues with the GPU nodes, the master node became unstable because of several mount points that were damaged. All of the mount were runtime filesystems so a reboot was requested and fulfilled by SRCF personnel the next morning. No SLURM jobs were reported as affected but new logins were denied until the master node was rebooted. The outage lasted for about 8hrs.</p> Floyd Fayton[email protected]urn:noticeable:publications:UE8MII49pgq894dKcogP2020-05-07T18:12:00.001Z2020-06-03T18:03:45.531ZSLURM's Default Memory Per CPU Increased (1GB --> 2GB)Hi All, Based on previous usage, the default allocation of 1GB of memory per cpu is too low. I have now increased this default to 2GB of memory per cpu. If you are not setting memory requirements in your job(s), this will change will...<p>Hi All,</p> <p>Based on previous usage, the default allocation of 1GB of memory per cpu is too low. I have now increased this default to 2GB of memory per cpu.</p> <p>If you are <strong>not</strong> setting memory requirements in your job(s), this will change will affect them.</p> <p>If you are already setting memory requirements in your job(s), this change will <strong>not</strong> affect you.</p> <p>Best practice is to specify memory in all jobs. If you haven’t been doing so and are now running into issues (waiting in partitions for longer than usual), you’ll now need to specify memory per cpu in jobs.</p> <p><code>--mem-per-cpu=1G</code> # when passed via command line,<br> or<br> <code>#SBATCH --mem-per-cpu=1G</code> # when added to a submission script.</p> <p>Adding either option should set your memory per cpu back to 1GB.</p> <p>** Again, if you are already using the <code>--mem=</code> or <code>--mem-per-*=</code> options, no changes are required for your job(s).</p> <p>Thank You</p> Floyd Fayton[email protected]urn:noticeable:publications:r21hNwxoxQFI4LTYNgHe2020-05-07T15:15:00.001Z2020-06-03T18:03:55.692ZSLURM Priority AdjustmentSince priorities were not working for those users who use Memex less frequently and in smaller batches of submitted jobs, these parameters were adjusted: As a result gres/gpu was added to: This functionality changes in SLURM 19+, but...<p>Since priorities were not working for those users who use Memex less frequently and in smaller batches of submitted jobs, these parameters were adjusted:</p> <p><code>PriorityWeightFairShare=20000</code><br> <code>PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000</code></p> <p>As a result gres/gpu was added to:</p> <p><code>AccountingStorageTRES=cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu</code></p> <p>This functionality changes in SLURM 19+, but our current version is 18.08 which is the latest version packaged with OpenHPC 1.3.</p> <p><strong>Update:</strong> AcctGatherFilesystemType was also enabled for lustre.</p> Floyd Fayton[email protected]urn:noticeable:publications:q1iUTNu4XWrZcjo7rRcx2020-02-13T20:19:00.001Z2020-06-03T18:04:07.520ZDid You Know ... Slack EditionDid you know we have a Slack channel for HPC/Research Computing? Signup to our Carnegie Institution for Science workspace (click here) and then join the #hpc channel. Please use your Google login, "@carnegiescience.edu", email...<ul> <li><p><strong>Did you know we have a Slack channel for HPC/Research Computing?</strong></p> <p><a href="https://www.google.com/url?q=https://join.slack.com/t/carnegiescience/shared_invite/enQtODk2MDU1MjM3MjUxLTRlOWI3Y2ViZjgxNmFiZWY0YzZjNWIzZWI4NDI4MDIzN2E4N2FhNjUwYmVmZjQyMTA3OGZhZDEyNTFjN2Q1YTU&amp;sa=D&amp;source=hangouts&amp;ust=1581706914005000&amp;usg=AFQjCNH2bvD5I4q2xYGi-DHkKBvq88ZCsg&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">Signup to our Carnegie Institution for Science workspace (click here)</a> and then join the <a href="https://carnegiescience.slack.com/archives/C506QH690?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">#hpc</a> channel. Please use your Google login, "@carnegiescience.edu", email address.</p> <p>This channel is useful for a couple of reasons (as well as other Carnegie Science channels!):</p> <ul> <li>Anyone can share photos, <a href="https://slack.com/help/articles/205875058-Google-Drive-for-Slack?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">Googe Drive documents</a>, articles, issues and best practices.</li> <li>Anyone can send me an informal message to talk about HPC/Research Computing/Memex issues, software questions, usage statistics, and much more. If I can’t answer or address your concerns right away, a formal ticket can be created for a faster response ***.</li> <li>Messages can be directed to a particular user or posted to the channel for everyone to see. General issues can be shared to #hpc and searched by other users with similar questions or issues.</li> </ul></li> </ul> <p><a href="https://www.google.com/url?q=https://join.slack.com/t/carnegiescience/shared_invite/enQtODk2MDU1MjM3MjUxLTRlOWI3Y2ViZjgxNmFiZWY0YzZjNWIzZWI4NDI4MDIzN2E4N2FhNjUwYmVmZjQyMTA3OGZhZDEyNTFjN2Q1YTU&amp;sa=D&amp;source=hangouts&amp;ust=1581706914005000&amp;usg=AFQjCNH2bvD5I4q2xYGi-DHkKBvq88ZCsg&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">To join, click here!!</a></p> <p>Other things you can do are <a href="https://slack.com/help/articles/201402297-Create-a-channel?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">create a new private channel</a>, have a group call (type "<code>/call @user1 @user2</code>"), set reminders (type "<code>/remind help</code>" for options), connect with a Github (type "<code>/github help</code>" for options), and much more (<a href="https://slack.com/help/articles/202288908-Format-your-messages?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">formatting messages</a>, <a href="https://slack.com/help/articles/201259356-Use-built-in-slash-commands?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">built-in commands</a>)!</p> <p>*** Slack does not replace our ticketing system. It is a way for us to discuss research computing without the formality of a ticketing system (by emailing <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a> or typing "<code>/freshservice-ticket</code>" from <a href="https://slack.com/help/articles/212281468-Send-direct-messages?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slack-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.q1iUTNu4XWrZcjo7rRcx&amp;utm_medium=newspage" target="_blank" rel="noopener">Slack’s direct messaging system</a>).</p> Floyd Fayton[email protected]urn:noticeable:publications:2QuP1YochGCrFRF04wYV2020-01-23T16:56:00.001Z2020-06-03T18:35:53.833ZDid You Know ... Python EditionDid you know official support for Python 2 is over? That said, we have Python 3 available on Memex by loading the module, "python/3.6.7". This Python version includes conda, R, Jupyter, IntelMPI, and many other packages. Most...<ul> <li><p>Did you know official support for <a href="https://www.python.org/doc/sunset-python-2/?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-python-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.2QuP1YochGCrFRF04wYV&amp;utm_medium=newspage" target="_blank" rel="noopener">Python 2 is over</a>?</p> <p>That said, we have Python 3 available on Memex by loading the module, "python/3.6.7". This Python version includes conda, R, Jupyter, IntelMPI, and many other packages. Most Python packages can be installed by sending your request to <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a>.</p></li> <li><p>Did you know you can install your own Python packages using conda?</p> <p>For example, if you’d like the latest python version available, 3.8.1, first search for it <a href="https://anaconda.org/?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-python-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.2QuP1YochGCrFRF04wYV&amp;utm_medium=newspage" target="_blank" rel="noopener">here first</a> (mainly conda-forge channel) or directly from the command line with:</p> <pre><code> module purge \ #only needed to ensure a clean environment module load python/3.6.7 conda search python \ #latest should be 3.8.1 as of last week </code></pre> <p>and use it to set up your own conda environment (both <a href="https://anaconda.org/?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-python-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.2QuP1YochGCrFRF04wYV&amp;utm_medium=newspage" target="_blank" rel="noopener">anaconda.org</a> and the command line shows the proper channel, "-c", to use below).</p> <pre><code> conda create -p /lustre/scratch/$USER/.envs/py38 \ -c conda-forge python=3.8.1 </code></pre> <p>The above command assumes <code>/lustre/scratch/$USER/.envs</code> is where you want the conda environment, <mark>py38</mark>, to reside. It has to be in a directory you own.</p> <p>To activate any conda environment (using example, <mark>py38</mark>, above):</p> <pre><code> source activate py38 </code></pre> <p>To deactivate any conda environment:</p> <pre><code> source deactivate </code></pre> <p>To remove an entire conda environment:</p> <pre><code> conda-env remove -n py38 </code></pre></li> <li><p>Did you know you can list existing conda environments already on Memex?</p> <p>To list the available conda environments (still using the module, python/3.6.7), simply type:</p> <pre><code> conda env list </code></pre> <p>if you can see them, you can activate them but if you don’t have write permission, you cannot install, remove, or modify them. You can however, clone them (make the new name <strong>unique</strong>–&gt; "foo", not the <strong>clone name</strong> --&gt; “foo2”):</p> <p><code>conda create -p /lustre/scratch/$USER/.envs/foo --clone foo2</code></p></li> <li><p>Did you know you can list any installed Python package, in the base conda environment (activated by <code>module load python/3.6.7</code>) or a custom conda environment (activated by <code>source activate foo</code>)?</p> <p><code>conda list packagename</code> <strong><em>#replace packagename with real package name</em></strong></p> <p>where “packagename” is a placeholder for a real package (like python, or r-base). Of course,</p> <pre><code> conda list </code></pre> <p>lists all packages installed in your current environment.</p></li> <li><p>Did you know there are other useful commands (<em>#replace packagename and X with real values</em>):</p> <p><code>conda search packagename=X.X.X --info</code> <strong><em>#where X.X.X=version_number of packagename</em></strong><br> <code>conda clean</code> <strong><em>#removes downloaded packages and caches</em></strong><br> <code>conda remove packagename</code> <strong><em>#to remove package</em></strong><br> <code>conda list --revisions</code> <strong><em>#to identify a revision to rollback to</em></strong><br> <code>conda install --revision X</code> <strong><em>#rollback to revision number, X</em></strong></p> <p>Type <code>conda -h</code> and/or <code>conda-env -h</code> for more options.</p></li> </ul> <p><strong>Note</strong>: To manage where your cache directory is located (includes tar balls and downloaded packages), you can set the following in your bashrc:</p> <pre><code>echo CONDA_PKGS_DIRS=/lustre/scratch/$USER/.envs/pkgs &gt;&gt; \ $HOME/.bashrc source $HOME/.bashrc </code></pre> <p>Or in your current environment only with:</p> <pre><code>export CONDA_PKGS_DIRS=/lustre/scratch/$USER/.envs/pkgs </code></pre> Floyd Fayton[email protected]urn:noticeable:publications:FuF1YhK9mCFksmxHZsDe2020-01-15T15:57:00.001Z2020-06-03T18:36:04.820ZDid You Know ... Storage EditionDid you know there's a 256GB quota for all This policy will be fully enforced in the coming weeks. This requirement is needed to manage space and load for > ==Note:== If you are currently over the limit, you will have time to move...<ul> <li><p>Did you know there’s a 256GB quota for all <code>/home</code> directories?</p> <p>This policy will be fully enforced in the coming weeks. This requirement is needed to manage space and load for <code>/home</code>. Home directories should primarily be used to set up scripts, small software applications, and environment variables across the cluster. To that end, we will be enforcing a hard quota of 256GB. A consolidation of all <code>/home</code> directories to one storage device will take place this year. The exact date will be advertised but please contact me or <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a> with any other questions.</p></li> </ul> <blockquote> <p><mark>Note:</mark> If you are currently over the limit, you will have time to move data but start moving (or removing) data <strong>now</strong>.</p> </blockquote> <ul> <li><p>Did you know we have about 950TB of storage right now?</p> <p>This includes the Lustre filesystem (<code>/lustre</code>, 698TB) and the MemexNFS filesystem (<code>/work</code>, <code>/scratch</code>, <code>/share/apps</code>, and <code>/home</code> ~ 252TB). Check out our login banner (see "<mark>Mountpoint Information</mark>") to see how each filesystem should be used.</p></li> </ul> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/FuF1YhK9mCFksmxHZsDe/01h55ta3gsa47k8s5mbm1zzsva-image.png" alt="Screen Shot 2020-01-14 at 11.41.05 PM.png"></p> <ul> <li><p>Did you know we now have a <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000044729?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-storage-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.FuF1YhK9mCFksmxHZsDe&amp;utm_medium=newspage" target="_blank" rel="noopener">Memex Globus endpoint</a>, "cisuser#carnegiescience"?</p> <p>Globus transfers are typically much faster than our command line option. Instructions on how to <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000044729?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-storage-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.FuF1YhK9mCFksmxHZsDe&amp;utm_medium=newspage" target="_blank" rel="noopener">setup and use Globus on Memex</a> is in our FreshService ticketing system (<a href="https://carnegiescience.freshservice.com/support/solutions?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-storage-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.FuF1YhK9mCFksmxHZsDe&amp;utm_medium=newspage" target="_blank" rel="noopener">Solutions --&gt; Computation</a>).</p></li> <li><p>Did you know you can check disk usage in the following ways?</p> <p>For most users, the new command</p> <pre><code> $ zquota </code></pre> <p>will show your <code>/home</code> usage and your group’s usage for /work/DEPT. Shared directories such as <code>/share/apps</code>, <code>/scratch</code>, and <code>/lustre</code> usage can seen by the ‘df’ command. For example:</p> <pre><code> $ df -h /lustre /share/apps /scratch </code></pre> <p>shows usage for all three filesystems. Of course, other commands like <code>$ du -sh /home/username/directory/</code> for directories or <code>$ ls -Shl filename</code> for files can be used as well.</p></li> </ul> Floyd Fayton[email protected]urn:noticeable:publications:omAlGcd5omhvBQgZVFeq2020-01-14T21:45:00.001Z2020-01-17T17:08:01.805ZMemex unable to accept new user loginsIssue 01/14/20: While installing new packages on the login server, the file /etc/resolv.conf was overwritten and caused new user logins to fail. Once the file was replaced with the proper nameserver values, Memex accepted new logins...<p><strong>Issue 01/14/20:</strong></p> <p>While installing new packages on the login server, the file /etc/resolv.conf was overwritten and caused new user logins to fail. Once the file was replaced with the proper nameserver values, Memex accepted new logins again.</p> Floyd Fayton[email protected]urn:noticeable:publications:RBpMKY9co1xfZS8hubEf2020-01-13T21:58:00.001Z2020-01-15T16:05:41.596ZDid You Know ... SLURM EditionDid you know there's a gui to view SLURM jobs? Inside a VNC or "ssh -XY .." session, type Did you know you can view the maximum resources of each node with: Did you know you can view the maximum memory used for running jobs with (see...<ul> <li>Did you know there’s a gui to view SLURM jobs? Inside a VNC or “ssh -XY …” session, type <code>sview -a</code> from the command line (check out the menu selection as well):</li> </ul> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/RBpMKY9co1xfZS8hubEf/01h55ta3gsx7jhrekm9nghsdts-image.png" alt="Screen Shot 2020-01-13 at 9.58.26 AM.png"></p> <ul> <li><p>Did you know you can view the maximum resources of each node with:</p> <p><code>sinfo -e -o "%20N %10c %10m %25f %10G"</code></p></li> <li><p>Did you know you can view the maximum memory used for running jobs with (see “/bin/sacct -e” for format options):</p> <p><code>/bin/sacct --format="JobID,CPUTime,MaxRSS" -j JOBID</code></p> <p>better yet, try <code>sstat --format="jobid,maxrss,avecpu" -j JOBID</code> (“sstat -e” for format options)?</p></li> <li><p>Did you know you can email yourself job status changes?</p> <p><code>#SBATCH [email protected]</code><br> <code>#SBATCH --mail-type=FAIL,BEGIN,END,SUSPEND</code></p> <p>Here are a <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000039168?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.did-you-know-slurm-edition&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.RBpMKY9co1xfZS8hubEf&amp;utm_medium=newspage" target="_blank" rel="noopener">few more tips</a> and you can contact me directly with any questions.</p></li> </ul> Floyd Fayton[email protected]