changelog Updates

hpc-internal.carnegiescience.edu

Memex Login Hung on 11/17/21

by Floyd Fayton

Initial incident and probable cause: Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly).

Login hangs after kernel message...

by Floyd Fayton, HPC Admin

Update (6/29/20): The login hang was caused by high I/O load which is returning after a weekend hiatus. Unfortunately, limiting the I/O on the login is not yet feasible due to the design of the system. The issue is not due to a lack of...

Master Node Rebooted

by Floyd Fayton, HPC Admin

Incident (5/21/20): While fixing issues with the GPU nodes, the master node became unstable because of several mount points that were damaged. All of the mount were runtime filesystems so a reboot was requested and fulfilled by SRCF...

Memex unable to accept new user logins

by Floyd Fayton, HPC Admin

Issue 01/14/20: While installing new packages on the login server, the file /etc/resolv.conf was overwritten and caused new user logins to fail. Once the file was replaced with the proper nameserver values, Memex accepted new logins...

User reported that rsync/cp/scp too slow on /memexnfs/apps,

by Floyd Fayton, HPC Admin

Temporary Resolution 01/02/20: Using rclone instead of rsync/cp/scp is 10x faster for large directory and possibly large file syncs to /memexnfs/ mountpoints. Although all reads from /memexnfs mounts are performing performing as...

Login Node Slowness (module command hanging on memex.carnegiescience.edu)

by Floyd Fayton, HPC Admin

Resolved 12/10/19: After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x...

System Update & Failed disk in SureStoreHD, memexnfs ZFS pool degraded

by Floyd Fayton, HPC Admin

System Update 11/6/19: System was updated on November 8th, which includes updates to SLURM, ZFS (0.6 to 0.8, which improves time to rebuild failed disk), and CentOS (7.5 to 7.7). Resolved 11/2/19: Issue resolved. Replacement disk has...

Memory failing in our SureStore UHD server (replacing DIMMs today)

by Floyd Fayton, HPC Admin

Update: Replacing the failing DIMMs now.. Issue: During IOR testing of /work on Memex, it was discovered that performance was lower than usual and two DIMMs were failing. Once discovered the manufacturer was contacted for replacements...

Intel Python Issue.. conda base corrupted

by Floyd Fayton, HPC Admin

On July 8th, the conda environment for modules, python/2.7.0 and python/3.6.0, was affected after an incomplete install for seaborn and pandas was aborted. Subsequent steps to fallback to a sane state and install those packages failed...

Newer