changelog Updates

System Update & Failed disk in SureStoreHD, memexnfs ZFS pool degraded

by Floyd Fayton, HPC Admin
System Update 11/6/19: System was updated on November 8th, which includes updates to SLURM, ZFS (0.6 to 0.8, which improves time to rebuild failed disk), and CentOS (7.5 to 7.7). Resolved 11/2/19: Issue resolved. Replacement disk has...
Announcement
System Failure

Memory failing in our SureStore UHD server (replacing DIMMs today)

by Floyd Fayton, HPC Admin
Update: Replacing the failing DIMMs now.. Issue: During IOR testing of /work on Memex, it was discovered that performance was lower than usual and two DIMMs were failing. Once discovered the manufacturer was contacted for replacements...
System Failure
Announcement

Intel Python Issue.. conda base corrupted

by Floyd Fayton, HPC Admin
On July 8th, the conda environment for modules, python/2.7.0 and python/3.6.0, was affected after an incomplete install for seaborn and pandas was aborted. Subsequent steps to fallback to a sane state and install those packages failed...
Announcement
System Failure
Fix

Error via getvnfs networking

by Floyd Fayton, HPC Admin
Fixed: by restart httpd on OpenHPC master node. Issue: Down nodes could not be reimaged because of the PXE process hanging at the getvnfs stage (seen in /var/log/messages, boot never makes it past getvnfs). Affected nodes: memex-c[014...
System Failure
Fix

Login Stalled - SSH denial/System locked up

by Floyd Fayton, HPC Admin
Memex's login became unresponsive to established and new SSH sessions. A reboot was initiated shortly after. The initial concern is that a new package, abrt-gui, was installed, detected a system issue and halted the server. Admin...
Announcement
System Failure
Fix

Memex Up: Switch failure in Rack 2 (resolved)

by Floyd Fayton, HPC Admin
Issue: The switch in rack 2 has powered down for reasons unknown (probably failure). We will be inspecting the switch and if necessary, have Dell replace it. Apologies for the interruption of your work but we are diligently working on...
System Failure
Announcement
Fix

New Nodes Added - memex-c[117-124]

by Floyd Fayton, HPC Admin
Nodes, memex-c[117-124], were added to Memex on 2/13/19. These nodes are identical to memex-c[109-116], which all have 256GB of raw memory and up to 250GB of free/unused memory per node. Users can request any of these nodes by adding...
Improvement
Announcement

Failure of memex-c[013-014,038,072,075] - Dell ticket opened

by Floyd Fayton, HPC Admin
A Dell support ticket was opened on 2/7/19 to address the failure of compute nodes, memex-c[013-014,038,072,075]. Onsite and remote logs were emailed to Dell subject matter experts but they haven't found any hardware issues yet. Memex...
System Failure
Announcement

Welcome 👋

by Floyd Fayton, HPC Admin
Changelog for Memex updates and announcements.
Welcome Guide