Login hangs after kernel message...
timestamp1593115740001
Update (6/29/20):
The login hang was caused by high I/O load which is returning after a weekend hiatus. Unfortunately, limiting the I/O on the login is not yet feasible due to the design of the system. The issue is not due to a lack of memory, CPU, or bandwidth on the login but the ability of the login to process lots of I/O by users (including transfers and local login processes). The issue is not present when performing the same work on compute nodes (transfers, normal commands, writing files, etc.). Short of an entire system redesign, we are planning a way forward with the way storage is used and configured on Memex. The earliest possible change(s) will be made during the July 8th shutdown. Details will be sent once those plans are set.
Incident (06/24/20 @1540):
Login stalled after the global message below
kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [$IP-m:69361]
Message from syslogd@memex at Jun 24 12:23:02 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [$IP-m:69361]
Message from syslogd@memex at Jun 24 12:23:34 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [$IP-m:69361]
Message from syslogd@memex at Jun 24 12:24:02 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [$IP-m:69361]
Message from syslogd@memex at Jun 24 12:24:30 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [$IP-m:69361]
Message from syslogd@memex at Jun 24 12:24:58 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [$IP-m:69361]```
-------------