urn:noticeable:projects:3f43Ej0LaTLbXv21eFelchangelog Updateshpc-internal.carnegiescience.edu2021-12-07T18:26:14.900ZCopyright © changelogNoticeablehttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.pnghttps://storage.noticeable.io/projects/3f43Ej0LaTLbXv21eFel/newspages/t8lIbf2iSTWZIIP91xqU/01h55ta3gshjbemty2fj8xrzn2-header-logo.png#1e88e5urn:noticeable:publications:qicYM3U7JTa4BbA78PsI2021-11-17T21:32:57.411Z2021-12-07T18:26:14.900ZMemex Login Hung on 11/17/21Initial incident and probable cause: Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly).<p>Initial incident and probable cause: </p><p>Broken pipe on the login caused by a file update on the login. The master node and login were both rebooted and “wwsh file sync” commands were automated memex_routecheck.sh (crontab and cron.hourly). After reboot, the routing table and maintenance motd were both updated sooner than before. Also added a ping check in order to determine whether the networks needs restarting (after the routing table is updated).</p>Floyd Fayton[email protected]urn:noticeable:publications:kiXk98fZM3iLHtLNxMyQ2020-06-25T20:09:00.001Z2020-06-29T14:06:52.111ZLogin hangs after kernel message...Update (6/29/20): The login hang was caused by high I/O load which is returning after a weekend hiatus. Unfortunately, limiting the I/O on the login is not yet feasible due to the design of the system. The issue is not due to a lack of...<p><strong>Update (6/29/20):</strong><br> The <strong>login</strong> hang was caused by high I/O load which is returning after a weekend hiatus. Unfortunately, limiting the I/O on the login is not yet feasible due to the design of the system. The issue is not due to a lack of memory, CPU, or bandwidth on the login but the ability of the login to process lots of I/O by users (including transfers and local login processes). The issue is not present when performing the same work on compute nodes (transfers, normal commands, writing files, etc.). Short of an entire system redesign, we are planning a way forward with the way storage is used and configured on Memex. The earliest possible change(s) will be made during the July 8th shutdown. Details will be sent once those plans are set.</p> <p><strong>Incident (06/24/20 @1540):</strong><br> Login stalled after the global message below</p> <pre><code class="hljs language-Message from syslogd@memex at Jun 24 12:22:34 ..."> kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">23</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">23</span>:<span class="hljs-number">02</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">22</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">23</span>:<span class="hljs-number">34</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">23</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">24</span>:<span class="hljs-number">02</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">23</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">24</span>:<span class="hljs-number">30</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">22</span>s! [$IP-m:<span class="hljs-number">69361</span>] Message <span class="hljs-keyword">from</span> <span class="hljs-symbol">syslogd@</span>memex at Jun <span class="hljs-number">24</span> <span class="hljs-number">12</span>:<span class="hljs-number">24</span>:<span class="hljs-number">58</span> ... kernel:NMI watchdog: BUG: soft lockup - CPU#<span class="hljs-number">10</span> stuck <span class="hljs-keyword">for</span> <span class="hljs-number">22</span>s! [$IP-m:<span class="hljs-number">69361</span>]``` ------------- </code></pre> Floyd Fayton[email protected]