urn:noticeable:projects:3f43Ej0LaTLbXv21eFelchangelog Updateshpc-internal.carnegiescience.edu2020-06-25T20:30:56.494ZCopyright © changelogNoticeablehttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.pnghttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.png#1e88e5urn:noticeable:publications:zx7mwKxW16mNkfChp64x2020-05-22T02:02:00.001Z2020-06-25T20:30:56.494ZMaster Node RebootedIncident (5/21/20): While fixing issues with the GPU nodes, the master node became unstable because of several mount points that were damaged. All of the mount were runtime filesystems so a reboot was requested and fulfilled by SRCF...<p>**Incident (5/21/20): **</p> <p>While fixing issues with the GPU nodes, the master node became unstable because of several mount points that were damaged. All of the mount were runtime filesystems so a reboot was requested and fulfilled by SRCF personnel the next morning. No SLURM jobs were reported as affected but new logins were denied until the master node was rebooted. The outage lasted for about 8hrs.</p> Floyd Fayton[email protected]urn:noticeable:publications:omAlGcd5omhvBQgZVFeq2020-01-14T21:45:00.001Z2020-01-17T17:08:01.805ZMemex unable to accept new user loginsIssue 01/14/20: While installing new packages on the login server, the file /etc/resolv.conf was overwritten and caused new user logins to fail. Once the file was replaced with the proper nameserver values, Memex accepted new logins...<p><strong>Issue 01/14/20:</strong></p> <p>While installing new packages on the login server, the file /etc/resolv.conf was overwritten and caused new user logins to fail. Once the file was replaced with the proper nameserver values, Memex accepted new logins again.</p> Floyd Fayton[email protected]urn:noticeable:publications:5csmLDRBAVK9iyQKDttS2019-12-19T17:35:00.001Z2020-01-15T17:27:49.876ZUser reported that rsync/cp/scp too slow on /memexnfs/apps,Temporary Resolution 01/02/20: Using rclone instead of rsync/cp/scp is 10x faster for large directory and possibly large file syncs to /memexnfs/ mountpoints. Although all reads from /memexnfs mounts are performing performing as...<p><strong>Temporary Resolution 01/02/20:</strong><br> Using rclone instead of rsync/cp/scp is 10x faster for large directory and possibly large file syncs to /memexnfs/ mountpoints. Although all reads from /memexnfs mounts are performing performing as expected, all disk to disk writes to /memexnfs are not. As load increases, presumably from cluster jobs and transfers (syncs to /home), write performance suffers. The workaround is to use rclone and this email was sent out to all users:</p> <blockquote> <p>Please use rclone for large local or remote transfers while using /memenfs/* filesystems, or /share/apps/dept, /home/username, /work/DEPT, or /scratch/username. There seems to be an issue with the common linux commands, rsync and cp, when transferring large directories (size and number of files).</p> </blockquote> <blockquote> <p>The solution is to use rclone instead of rsync or cp or scp for large directories (size and number of files),</p> </blockquote> <p><code>rclone sync /home/username/directory/ /scratch/username/directory/ -LP</code></p> <blockquote> <p>This syncing issue currently affects write speeds but not read speeds for large directories to /memexnfs/*. This solution has been tested and should also work fine for small directories and files.</p> </blockquote> <blockquote> <p>Of course, rclone is used to sync files to/from GDrive as well.</p> </blockquote> <blockquote> <p>Rclone Tutorial:<br> <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000040389?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.user-reported-that-rsync-cp-scp-too-slow-on-memexnfs-apps&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.5csmLDRBAVK9iyQKDttS&amp;utm_medium=newspage" target="_blank" rel="noopener">https://carnegiescience.freshservice.com/support/solutions/articles/3000040389</a></p> </blockquote> <p><strong>Issue 12/19/19:</strong><br> A user reported rsyncs were tool slow on the /share/apps mount of /memexnfs/apps. Since all /memexnfs/* mounts share the same disks/setup, it was determined the issue was not isolated to /share/apps, /work, /scratch, and /home (all /memexnfs mountpoints across the cluster).</p> Floyd Fayton[email protected]urn:noticeable:publications:EC7QoW1ZmiKkRDgyNIDr2019-12-02T19:57:00.001Z2020-01-15T17:25:26.267ZLogin Node Slowness (module command hanging on memex.carnegiescience.edu)Resolved 12/10/19: After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x...<p><strong>Resolved 12/10/19:</strong></p> <p>After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x bandwidth). This tuning of the system was partially documented in <a href="https://carnegiescience.freshservice.com/support/solutions/articles/3000044399?utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.login-slowness-module-command-hanging&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.EC7QoW1ZmiKkRDgyNIDr&amp;utm_medium=newspage" target="_blank" rel="noopener">our ticket system</a> for future reference.</p> <p><strong>Update 12/04/19:</strong></p> <p>To see improvements to the module command, please log out and log back in. This step is required for you to take advantage of Lmod’s caching feature which improves its responsiveness. In the meantime, I am still investigating ways to improve filesystem performance for <code>/memexnfs/*</code> mountpoints. If you are running SLURM jobs which writes or reads large files, I suggest using <code>/lustre/scratch/$USER</code> as a working directory. If you are running a parallel or multiprocessing job, also use <code>/lustre/scratch/$USER</code> as a working directory. For instance,</p> <blockquote> <p>mkdir /lustre/scratch/$USER<br> rsync -aWz /home/$USER/workdir/ /lustre/scratch/$USER/workdir/<br> cd /lustre/scratch/$USER/workdir/</p> </blockquote> <p>then submit your job as normal.</p> <p>After the job finishes, you can <code>rsync</code> the directory back to <code>/home/$USER/workdir</code> for safe keeping. Please keep in mind, you’ll need to add <code>"--delete"</code> to the <code>rsync</code> command for an exact copy of the <code>/lustre/scratch/$USER/workdir</code> (which deletes any files/dirs in <code>/home/$USER/workdir</code> that are not in <code>/lustre/scratch/$USER/workdir</code>).</p> <p>Use cautiously,</p> <blockquote> <p>rsync -aWz --delete /lustre/scratch/$USER/workdir/ /home/$USER/workdir/</p> </blockquote> <p>Please note, the directories under <code>/lustre/scratch</code> are <strong>not</strong> backed up and are <strong>not</strong> currently scrubbed.</p> <p>The use of <code>/lustre/scratch/$USER</code> is recommended because the read/write performance of <code>/memexnfs/*</code> is being hindered by multiple I/O streams, including transfers (by root and users), SLURM jobs, and any other login node activity by users (VNC, shell/interpreter scripts, etc.).</p> <p><strong>Update 12/03/19:</strong></p> <p>Lmod’s cache was enabled to improve the performance of the module commands on Memex. However, I/O performance is still a bit sluggish, so more investigation is required to improve performance on the mounted filesystems.</p> <p><strong>Issue (started week of 11/25/2019):</strong></p> <p>After logging onto Memex (password/DUO), the user login hangs while trying to load modules. This is an ongoing issue which seems to be caused by remote and/or local mounted filesystems,</p> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/EC7QoW1ZmiKkRDgyNIDr/01h55ta3gsaf68z6xy02yhv34s-image.png" alt="Screen Shot 2019-12-02 at 3.12.27 PM.png"></p> <p>while Lmod is traversing one or more of the module paths (usually in “$MODULEPATH”). We are investigating the issue, but in the meantime <strong>enter</strong> “Ctrl+c” if the bash command prompt doesn’t appear swiftly after the following banner:</p> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/EC7QoW1ZmiKkRDgyNIDr/01h55ta3gswf49314edn913r1t-image.png" alt="Screen Shot 2019-12-02 at 3.03.54 PM.png"></p> Floyd Fayton[email protected]urn:noticeable:publications:oXsj5DmMQxuikxEAwXp62019-07-09T19:41:00.001Z2020-01-15T17:26:29.981ZIntel Python Issue.. conda base corruptedOn July 8th, the conda environment for modules, python/2.7.0 and python/3.6.0, was affected after an incomplete install for seaborn and pandas was aborted. Subsequent steps to fallback to a sane state and install those packages failed...<p>On July 8th, the conda environment for modules, python/2.7.0 and python/3.6.0, was affected after an incomplete install for seaborn and pandas was aborted. Subsequent steps to fallback to a sane state and install those packages failed. Although the previous conda environments and Python base was still intact, new package installations using conda were failing altogether. Since Intel released their 2019 Parallel Studio XE (compilers and Python) and the 2018 installed version was still functional, effort is being made to migrate to the 2019 versions and eventually abandon the 2018 Python versions.</p> <p>The 2019 Intel Parallel Studio XE toolkit, including Python versions 2.7.16 and 3.6.7, have just been installed on Memex. To use the 2019 Intel compilers, use module “intel/2019” (includes icc, ifort, etc.), which is an upgrade to the “intel/2018” module (lowercase ‘i’ matters). I am currently working on modules for Intel’s Python 2 and 3, located under /share/apps/intel/2019 as intelpython2/ and intelpython3/.</p> <p>If you are only interested in the 2019 Intel Compilers and don’t use Intel’s Python versions, you can stop reading here. If you don’t know which python version you’re using, type “python --version” from the command line. It will indicate Intel or GNU. For those who are interested in the 2019 Python installation on Memex, please continue reading…</p> <p>The 2019 Intel Python installation does not inherit the packages and conda environments from the previous 2018 Intel Python installations, which are modules “python/2.7.0” and “python/3.6.0” on Memex. This means you can continue to use those 2018 Intel installations (including compilers, python, and python conda envs), but updates to those 2018 packages will be abandoned by September 1st, 2019. To save and recreate your current conda environment for the 2019 installation, see <a href="https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html?highlight=base&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.intel-python-issue-conda-base-corrupted&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.oXsj5DmMQxuikxEAwXp6&amp;utm_medium=newspage#sharing-an-environment" target="_blank" rel="noopener">Sharing an environment</a> and then <a href="https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html?highlight=base&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.intel-python-issue-conda-base-corrupted&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.oXsj5DmMQxuikxEAwXp6&amp;utm_medium=newspage#creating-an-environment-from-an-environment-yml-file" target="_blank" rel="noopener">Creating an environment from an environment.yml file</a>. This file can be saved and used in other Python/Conda setups (on other machines as well).</p> <p>I am working to install a few general Python packages for the new 2019 installation, so please feel free to send requests for package installations to <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a>. The new 2019 Intel Python modules will be “python/2.7.16” and "python/3.6.7". These modules are available now but I am still working to install packages this week and establish a conda "base". These packages include:</p> <blockquote> <p>numpy<br> matplotlib<br> seaborn<br> tensorflow<br> pandas<br> keras<br> sklearn<br> r<br> rstudio<br> jupyter notebook<br> r-rgdal<br> and more… (some packages are easier installed than others!)</p> </blockquote> <p>Again, these packages will establish the “base” for each Python module and their downstream conda environments, so if you have a general package you’d like me to install, let me know by this week. This is important because If the base for either Python version changes, any conda environment created on top of it will be affected, so please send your requests this week. This work is ongoing…</p> <p>If request for package installations involve pulling from Github or other third-party sources, then a conda environment might become necessary. Not all packages, and/or package versions, are compatible. For this reason, wait until after this week to create your own conda environments. Personal conda environments can be setup without having Memex admin privileges (recommended if you want full control of your environment). For example (instructions taken from here),</p> <blockquote> <p>conda create -n myenv<br> conda activate myenv</p> </blockquote> <p>will create a conda environment in /home/username/.conda/envs/myenv and then prepend your command prompt with "(myenv)", indicating you’re now in your newly created conda environment. This conda environment allows you to install specific versions of packages as well, but your initial conda environment depends on the module you start with (i.e. “python/2.7.16” or “python/3.6.7”). You can specify what version of a package to install using following command (package here is "python", version is “3.6.8”):</p> <blockquote> <p>conda install -n myenv python=3.6.8</p> </blockquote> <p>Conda will try to accommodate this request by downgrading, removing, upgrading, superseding, or installing packages for “python=3.6.8” dependencies, and then ask if you want to proceed (y/n?). I can tell you from experience, Python 3 is easier for Conda to work than Python 2, but most issues can be worked through in conda environments.</p> <p>For issues, please email <a href="mailto:[email protected]" target="_blank" rel="noopener">[email protected]</a> to create a ticket.</p> <p>Other useful conda commands/instructions:</p> <blockquote> <p>conda deactivate #exit conda environment<br> conda env list #list available conda environments<br> conda list scipy #list package version in current environment<br> <a href="https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html?highlight=base&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.intel-python-issue-conda-base-corrupted&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.oXsj5DmMQxuikxEAwXp6&amp;utm_medium=newspage#sharing-an-environment" target="_blank" rel="noopener">sharing an environment</a><br> <a href="https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html?highlight=base&amp;utm_source=noticeable&amp;utm_campaign=3f43ej0latlbxv21efel.intel-python-issue-conda-base-corrupted&amp;utm_content=publication+link&amp;utm_id=3f43Ej0LaTLbXv21eFel.t8lIbf2iSTWZIIP91xqU.oXsj5DmMQxuikxEAwXp6&amp;utm_medium=newspage#removing-an-environment" target="_blank" rel="noopener">deleting an environment</a></p> </blockquote> <p>Updates will follow…</p> Floyd Fayton[email protected]urn:noticeable:publications:F4gcDNCdfQboogwv66pt2019-04-12T16:07:00.001Z2020-01-15T17:26:44.299ZError via getvnfs networkingFixed: by restart httpd on OpenHPC master node. Issue: Down nodes could not be reimaged because of the PXE process hanging at the getvnfs stage (seen in /var/log/messages, boot never makes it past getvnfs). Affected nodes: memex-c[014...<p><strong>Fixed:</strong> by restart httpd on OpenHPC master node.</p> <p><strong>Issue:</strong> Down nodes could not be reimaged because of the PXE process hanging at the getvnfs stage (seen in /var/log/messages, boot never makes it past getvnfs). Affected nodes: memex-c[014,042,038,072,075,088].</p> Floyd Fayton[email protected]urn:noticeable:publications:jasHqnRdfm3dqEhLIodz2019-03-08T17:25:00.001Z2020-01-15T17:25:49.589ZLogin Stalled - SSH denial/System locked upMemex's login became unresponsive to established and new SSH sessions. A reboot was initiated shortly after. The initial concern is that a new package, abrt-gui, was installed, detected a system issue and halted the server. Admin...<p>Memex’s login became unresponsive to established and new SSH sessions. A reboot was initiated shortly after. The initial concern is that a new package, abrt-gui, was installed, detected a system issue and halted the server. Admin Floyd is investigating.</p> Floyd Fayton[email protected]urn:noticeable:publications:s73wCLoCEPOtmIh4eD502019-02-26T00:40:00.001Z2019-02-26T01:34:03.286ZMemex Up: Switch failure in Rack 2 (resolved)Issue: The switch in rack 2 has powered down for reasons unknown (probably failure). We will be inspecting the switch and if necessary, have Dell replace it. Apologies for the interruption of your work but we are diligently working on...<p><strong>Issue:</strong> The switch in rack 2 has powered down for reasons unknown (probably failure). We will be inspecting the switch and if necessary, have Dell replace it. Apologies for the interruption of your work but we are diligently working on the issue.</p> <p><img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/s73wCLoCEPOtmIh4eD50/01h55ta3gsrbb97f79svqrefay-image.jpg" alt="2019-02-20.jpg"></p> <p><strong>Update (2/25/19):</strong> <img src="https://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/publications/s73wCLoCEPOtmIh4eD50/01h55ta3gsed2emzbwst07a5ge-image.png" alt="spidertocat.png"> New Dell switch configured and Memex back in service around 1pm PST. Initial issues (all internal network) were cleared up around 4pm PST.</p> <p><strong>Update (2/23/19):</strong> Floyd will be traveling to SRCF on Sunday morning to configure the switch.</p> <p><strong>Update (2/22/19):</strong> Remote hands rescheduled to 4:00pm PST.</p> <p><strong>Update (2/22/19):</strong> Remote hands said they’ll configure the switch at 1:30pm PST.</p> <p><strong>Update (2/22/19):</strong> The first attempt to configure the switch was unsuccessful on 2/21/19. Stanford’s Remote Hands service is scheduled to go to SRCF today, 2/22/19, to configure the network settings on the switch.</p> <p><strong>Update (2/21/19):</strong> Switch replaced and we are in the process of configuring it now.</p> <p><strong>Update (2/21/19):</strong> Replacement part scheduled for installation at 11AM PST.</p> <p><strong>Update (2/21/19):</strong> Memex Down until further notice. Dead switch photo included above.</p> <p><strong>Update (2/21/19):</strong> Initially assumed it was the PSU of the failed ethernet switch but it appears the entire switch has failed and needs replacement.</p> Floyd Fayton[email protected]