Lunchtime Semaphore

from and to 372433 143758 48 N

Load solved!

May 23rd, 2012 by

Finally, I solved an issue that had been pestering me for months. Regularly, when casually browsing, my computer would slow down dramatically for 1-2 min before continuing as if nothing happened. That was particularly irritating and I couldn’t fix this until I found a good way to reproduce it. This half broken bugs are the harder to fix because the incentive to sit down and fix it is much weaker than when everything is broken.

The symptoms

As it appears at random time, and slows down everything, diagnosing is not easy. I usually keep gkrellm open to monitor what’s happening. During these slowdowns, almost everything seems normal: cpu at less than 5%, more than 70% of the memory free, no disk io, no network io. Nothing unusual either with iftop or iotop. The only visible problem in gkrellm and in top was the load average. The load was climbing up (sometimes to 10) before coming down slowly.

Note that I changed the default monitoring string for proc in gkrellm to <code>\w88\a$p\f procs\n\e$u\f users\n\e$l\f load</code> to display the load value.

It would happen on any website, it would happen with chrome or firefox. And I couldn’t find anything relevant to this issue on the internet.

Fixing

The first step was to find a way to replicate this as it seems to happen at random time and on random sites. I finally figured that using chrome or firefox and going to google maps, heavily moving around, zooming in and out, navigating in street view would eventually trigger the problem. At this point I suddenly though about caching issues.

And then I got it.

My /home is on a raid 1 setup: I don’t want to loose all my photos to a hard disk crash (don’t worry, the raid is far from being the only backup of it). But most linux softwares are in the habit of putting all user related stuff in ~/.* and chrome and firefox are no exception. For example chrome keeps its stuff in ~/.config/google-chrome. Writing to the cache is not a cheap write operation and my guess is that is does a lot of random access write. Then the raid system has to sync all this between the disks and that increases the load of the system. It does not appear as disk IO in the different monitoring tools as this is not a filesystem IO.

From there, the fix was pretty simple. Move the ~/.config/google-chrome to a non raid disk (as chrome sync everything to the cloud, loosing the config directory is not an issue) and create a symlink ~/.config/google-chrome to the new location.

Lessons

Several useful lessons to get from that:

  • Using /home for cache is bad. The assumption is often that data there are worth saving, mirroring etc and don’t require a fast access.
  • There is probably some room for improvement for the linux raid driver: I can’t believe it could be so inefficient for some write pattern. Or I didn’t select the correct options when setting it up.
  • I need to find a better monitoring tool for my raid.

This entry was posted on Wednesday, May 23rd, 2012 at 11:06 UTC and is filed under Linux. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Loading Facebook Comments ...

Leave a Reply