Finally, I solved an issue that had been pestering me for months. Regularly, when casually browsing, my computer would slow down dramatically for 1-2 min before continuing as if nothing happened. That was particularly irritating and I couldn’t fix this until I found a good way to reproduce it. This half broken bugs are the harder to fix because the incentive to sit down and fix it is much weaker than when everything is broken.
As it appears at random time, and slows down everything, diagnosing is not easy. I usually keep gkrellm open to monitor what’s happening. During these slowdowns, almost everything seems normal: cpu at less than 5%, more than 70% of the memory free, no disk io, no network io. Nothing unusual either with
iotop. The only visible problem in gkrellm and in
top was the load average. The load was climbing up (sometimes to 10) before coming down slowly.
Note that I changed the default monitoring string for proc in gkrellm to <code>\w88\a$p\f procs\n\e$u\f users\n\e$l\f load</code> to display the load value.
It would happen on any website, it would happen with chrome or firefox. And I couldn’t find anything relevant to this issue on the internet.
The first step was to find a way to replicate this as it seems to happen at random time and on random sites. I finally figured that using chrome or firefox and going to google maps, heavily moving around, zooming in and out, navigating in street view would eventually trigger the problem. At this point I suddenly though about caching issues.
And then I got it.
/home is on a raid 1 setup: I don’t want to loose all my photos to a hard disk crash (don’t worry, the raid is far from being the only backup of it). But most linux softwares are in the habit of putting all user related stuff in
~/.* and chrome and firefox are no exception. For example chrome keeps its stuff in
~/.config/google-chrome. Writing to the cache is not a cheap write operation and my guess is that is does a lot of random access write. Then the raid system has to sync all this between the disks and that increases the load of the system. It does not appear as disk IO in the different monitoring tools as this is not a filesystem IO.
From there, the fix was pretty simple. Move the
~/.config/google-chrome to a non raid disk (as chrome sync everything to the cloud, loosing the config directory is not an issue) and create a symlink
~/.config/google-chrome to the new location.
Several useful lessons to get from that:
/homefor cache is bad. The assumption is often that data there are worth saving, mirroring etc and don’t require a fast access.
- There is probably some room for improvement for the linux raid driver: I can’t believe it could be so inefficient for some write pattern. Or I didn’t select the correct options when setting it up.
- I need to find a better monitoring tool for my raid.