changelog Updates

hpc-internal.carnegiescience.edu

timestamp1637184777411

Memex Login Hung on 11/17/21

by Floyd Fayton

Initial incident and probable cause: Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly).

timestamp1590112920001

Master Node Rebooted

by Floyd Fayton, HPC Admin

Incident (5/21/20): While fixing issues with the GPU nodes, the master node became unstable because of several mount points that were damaged. All of the mount were runtime filesystems so a reboot was requested and fulfilled by SRCF...

timestamp1588875120001

SLURM's Default Memory Per CPU Increased (1GB --> 2GB)

by Floyd Fayton, HPC Admin

Hi All, Based on previous usage, the default allocation of 1GB of memory per cpu is too low. I have now increased this default to 2GB of memory per cpu. If you are not setting memory requirements in your job(s), this will change will...

timestamp1588864500001

SLURM Priority Adjustment

by Floyd Fayton, HPC Admin

Since priorities were not working for those users who use Memex less frequently and in smaller batches of submitted jobs, these parameters were adjusted: As a result gres/gpu was added to: This functionality changes in SLURM 19+, but...

timestamp1579038300001

Memex unable to accept new user logins

by Floyd Fayton, HPC Admin

Issue 01/14/20: While installing new packages on the login server, the file /etc/resolv.conf was overwritten and caused new user logins to fail. Once the file was replaced with the proper nameserver values, Memex accepted new logins...

timestamp1576776900001

User reported that rsync/cp/scp too slow on /memexnfs/apps,

by Floyd Fayton, HPC Admin

Temporary Resolution 01/02/20: Using rclone instead of rsync/cp/scp is 10x faster for large directory and possibly large file syncs to /memexnfs/ mountpoints. Although all reads from /memexnfs mounts are performing performing as...

timestamp1575316620001

Login Node Slowness (module command hanging on memex.carnegiescience.edu)

by Floyd Fayton, HPC Admin

Resolved 12/10/19: After tuning the NFS server and clients, the slowness was resolved. Although there were several adjustments, the RPCNFSDCOUNT variable in /etc/sysconfig/nfs was the change that made the biggest improvement (100x...

timestamp1562701260001

Intel Python Issue.. conda base corrupted

by Floyd Fayton, HPC Admin

On July 8th, the conda environment for modules, python/2.7.0 and python/3.6.0, was affected after an incomplete install for seaborn and pandas was aborted. Subsequent steps to fallback to a sane state and install those packages failed...

timestamp1555085220001

Error via getvnfs networking

by Floyd Fayton, HPC Admin

Fixed: by restart httpd on OpenHPC master node. Issue: Down nodes could not be reimaged because of the PXE process hanging at the getvnfs stage (seen in /var/log/messages, boot never makes it past getvnfs). Affected nodes: memex-c[014...

Newer