changelog Updates

Master Node Rebooted

by Floyd Fayton, HPC Admin
Incident (5/21/20): While fixing issues with the GPU nodes, the master node became unstable because of several mount points that were damaged. All of the mount were runtime filesystems so a reboot was requested and fulfilled by SRCF...
System Failure
Fix

Memex unable to accept new user logins

by Floyd Fayton, HPC Admin
Issue 01/14/20: While installing new packages on the login server, the file /etc/resolv.conf was overwritten and caused new user logins to fail. Once the file was replaced with the proper nameserver values, Memex accepted new logins...
Fix
System Failure

User reported that rsync/cp/scp too slow on /memexnfs/apps,

by Floyd Fayton, HPC Admin
Temporary Resolution 01/02/20: Using rclone instead of rsync/cp/scp is 10x faster for large directory and possibly large file syncs to /memexnfs/ mountpoints. Although all reads from /memexnfs mounts are performing performing as...
System Failure
Announcement
Fix

Login Node Slowness (module command hanging on memex.carnegiescience.edu)

by Floyd Fayton, HPC Admin
Resolved 12/10/19: After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x...
Announcement
System Failure
Fix

Intel Python Issue.. conda base corrupted

by Floyd Fayton, HPC Admin
On July 8th, the conda environment for modules, python/2.7.0 and python/3.6.0, was affected after an incomplete install for seaborn and pandas was aborted. Subsequent steps to fallback to a sane state and install those packages failed...
Announcement
System Failure
Fix

Error via getvnfs networking

by Floyd Fayton, HPC Admin
Fixed: by restart httpd on OpenHPC master node. Issue: Down nodes could not be reimaged because of the PXE process hanging at the getvnfs stage (seen in /var/log/messages, boot never makes it past getvnfs). Affected nodes: memex-c[014...
System Failure
Fix

Login Stalled - SSH denial/System locked up

by Floyd Fayton, HPC Admin
Memex's login became unresponsive to established and new SSH sessions. A reboot was initiated shortly after. The initial concern is that a new package, abrt-gui, was installed, detected a system issue and halted the server. Admin...
Announcement
System Failure
Fix

Memex Up: Switch failure in Rack 2 (resolved)

by Floyd Fayton, HPC Admin
Issue: The switch in rack 2 has powered down for reasons unknown (probably failure). We will be inspecting the switch and if necessary, have Dell replace it. Apologies for the interruption of your work but we are diligently working on...
System Failure
Announcement
Fix