The drill goes something like this: The load average on the production web server soars from something around 1 to 50 or 100 or higher, and one of the web applications stops doing any useful work. It is shown as running in the process table, but it does not answer queries and does no logging.
Killing the web application processes - a pair of tomcat daemons - works OK, but the processes remain in a defunct state, holding onto resources, and preventing a new web application from starting. The parent PID is 1, but init isn't reaping the dead process.
Worse, if one tries to reboot the machine, say via /sbin/reboot, the machine hangs. One has to cycle power on the machine to get it to restart cleanly. Thank goodness for fast boot times and journaled filesystems!
Often these drills take place early in the morning (4am, 5am) when we run big automated jobs, like updating the search index. And we always see a little something like this in /var/log/messages:
Mar 29 04:43:35 top kernel: INFO: task khugepaged:503 blocked for more than 120 seconds.
Mar 29 04:43:35 top kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 29 04:43:35 top kernel: khugepaged D 0000000000000000 0 503 2 0x00000000
Mar 29 04:43:35 top kernel: ffff8804747d9c90 0000000000000046 0000000000000000 ffff880000058d80
Mar 29 04:43:35 top kernel: ffff8804747d9c20 0000000000000002 0000000000000003 ffff880000069b00
Mar 29 04:43:35 top kernel: ffff8804747b26b8 ffff8804747d9fd8 000000000000f4e8 ffff8804747b26b8
Mar 29 04:43:35 top kernel: Call Trace:
Mar 29 04:43:35 top kernel: [<ffffffff8116a1c8>] ? __mem_cgroup_try_charge+0x78/0x420
Mar 29 04:43:35 top kernel: [<ffffffff814eef25>] rwsem_down_failed_common+0x95/0x1d0
Mar 29 04:43:35 top kernel: [<ffffffff814ef083>] rwsem_down_write_failed+0x23/0x30
Mar 29 04:43:35 top kernel: [<ffffffff81276d83>] call_rwsem_down_write_failed+0x13/0x20
Mar 29 04:43:35 top kernel: [<ffffffff814ee582>] ? down_write+0x32/0x40
Mar 29 04:43:35 top kernel: [<ffffffff8116f140>] khugepaged+0x790/0x12c0
Mar 29 04:43:35 top kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
Mar 29 04:43:35 top kernel: [<ffffffff8116e9b0>] ? khugepaged+0x0/0x12c0
Mar 29 04:43:35 top kernel: [<ffffffff81090726>] kthread+0x96/0xa0
Mar 29 04:43:35 top kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Mar 29 04:43:35 top kernel: [<ffffffff81090690>] ? kthread+0x0/0xa0
Mar 29 04:43:35 top kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
This can't be good. It is never good when the kernel logs messages. A happy kernel is a silent kernel.
Our experience has been that heavy use of NFS - especially by writers - tickles this bug more often, and so we've taken to off-loading some content from our EMC NAS to the local filesystem. This also helps performance sometimes, but makes it more problematic to back up content given how we've architected things.
We've also done some digging into the deepest Red Hat forums and IRC channels looking for answers. Maybe khugepaged is just the victim here? Maybe the real culprint is the memory management in the kernel? Or Colonel Mustard in the library?
This entry got us to try one more thing.
So we'll see how it goes next week.
If only we had Richard III to help. Or Steinbeck.