Memory failing in our SureStore UHD server (replacing DIMMs today)
timestamp1564512780001
**Update: **
Replacing the failing DIMMs now…
**Issue: **
During IOR testing of /work on Memex, it was discovered that performance was lower than usual and two DIMMs were failing. Once discovered the manufacturer was contacted for replacements, which they sent overnight.
An emergency reboot was scheduled for 7/30/19 1pm PST (4pm EST) and notice was sent to users to shutdown their jobs before 1pm PST/4pm EST. This shutdown was necessary to avoid issues (i.e. corruption) within the affected Memex directories:
/home (all but DPB/DGE)
/work
/scratch
/share
Lustre, or /lustre and /lscratch, should also be rebooted during this time. There are still lingering issues from the OSS2 failure in May, 2019.
Users were warned that any jobs not canceled by shutdown was killed.
The SureStore is 2+ yrs. old and was up, 223 days prior to this emergency shutdown.