We saw two months of ghastly web service availability, well below the 99% goal. Lots of pages in the middle of the night. Lots of trips to the office at all hours and all days to cycle power on the server.
It was clear that khugepaged was involved somehow. Was it the victim of something else? Or the cause?
Based on the most scanty of evidence and great desperation we disabled khugepaged on March 29. And since then?
[ sound of knocking on wood ]
The machine is back to its old self. One very short-lived (seven minutes) outage based on a bad rewrite rule that we added in response to a request, and then had to back-out.
Who knew that this simple command:
root# echo never> /sys/kernel/mm/redhat_
could generate so much happiness?