System Update 11/6/19:

System was updated on November 8th, which includes updates to SLURM, ZFS (0.6 to 0.8, which improves time to rebuild failed disk), and CentOS (7.5 to 7.7).

Resolved 11/2/19:

Issue resolved. Replacement disk has finished resilvering.

Update 10/22/19:

The new drive in our primary filesystem is still rebuilding and will be done in about 8 days. The type of filesystem is ZFS (version 0.6) in RAIDZ1 configuration which means one failed drive puts the filesystem in a degraded state. This degraded state will continue until the drive is "resilvered", or data is copied to the healthy disk and it comes online. This process, which takes entirely too much time, was flagged as a ZFS bug back in November, 2017.

The current version of ZFS, 0.8.0, was released in May of this year and addresses the bug. The improvement to the resilvering process is said to be 5-6x better that the performance we’re currently seeing. The command line slowness you are experiencing on Memex’s login is far worse (up to 5x worse) that the I/O performance on Memex’s compute nodes, but all I/O for /home, /scratch, /work/ and /share/apps will be affected. This means you can still submit jobs from the login but all other activities will be slow in /home, /scratch, and /work.

A way around this is to use your own Lustre scratch directory, /lustre/scratch/username (if it doesn’t exist, you can create it, “mkdir -p /lustre/scratch/username”), to edit files, run local commands, etc. Cleanup for /lustre/scratch/username is turned off for now and you can even submit jobs from here.

**ANNOUNCEMENT**
We are planning a software update for SLURM and ZFS when the disk replacement is completely done. I will be sending out a notice for a planned reboot, which is necessary in order to ensure the ZFS filesystem is truly updated. Please keep this in mind as you are submitting jobs. A job intended to run for a month or so, will be killed prior to these updates in a couple of weeks.

Update 10/14/19:

Drive still resilvering - 24% done and going -

status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Oct 9 12:55:27 2019
18.8T scanned out of 77.8T at 47.3M/s, 363h15m to go
566G resilvered, 24.15% done

Update 10/9/19:
Failed drive replaced and ZFS resilvering with new drive started.

Update 10/7/19:
Increased quota for /memexnfs/home, decreased quota for /scratch as well.

Update 10/7/19:
Memexnfs has become more responsive due to ZFS resilvering after the drive failure. This has resulted in I/O improvements for /home, /work, /scratch, and /share/apps.

Update 10/5/19:
Waiting for new HDD in order to replace the failed disk.

The main issue is a failed drive for all /memexnfs/* mounts. I’ll let you know when the drive has been replaced. Until then Memex’s directories, /memexnfs/scratch (/scratch), /memexnfs/home/ (or /home), /memexnfs/work (or /work), and /memexnfs/apps (or /share/apps) will all be operating in a degraded state.