Alan McKinnon on Tue, 5 May 2009 20:59:26 +0200 (SAST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[GLUG-tech] Disk space being consumed somewhere


Hi all,

Odd one this. A colleague asked me to look at a RHEL5 server running Oracle 
that was running out of disk space on /. It's a standard RHEL fs layout on 
ext3 with 66G on LogVol00 (mounted at /). df says 62G is used, du says around 
12G is used. First check was "dd -bs 1024 count 1000000000 if=/dev/zero 
of=/1", it failed at 2.4G. OK, so there is only 2.4G space available. 

I checked the obvious things:

- bind mounted / somewhere else, du still says 12G used (so no log files 
hidden under the var mount for example)
- du -l (count hard links many times) - small increase, and no file is hard 
linked more than 3 times, verified with ls and sort
- du --apparent-size shows a small increase (thinking maybe a runaway java 
process created a million small files < 10 bytes)
- /tmp is about 200k (smaller than expected as tmpwatch isn't being used and 
/tmp is on /, but no worries)
- tune2fs -l shows 5% reserved for root (the default), a mere few % of inodes 
consumed, and 75% of blocks free (consistent with du's numbers)
- [pv|vg|lv}display all normal, set to RH defaults - this is expected, the 
machine owner clicked yes,yes,yes at install, not being familiar with LVM
- a read-only fsck (which doesn't do much as / is mounted) at least shows the 
superblock is consistent and usable (numbers agree with tune2fs, obviously)
- all on-disk filesystems are ext3, mkfs'ed with RH defaults

Checking further, a java parent process (not a child thread) is using 100% of 
one CPU, and lsof shows a huge number of FIFOs in use - looked like >500 but I 
neglected to wc to make sure :-(   Unfortunately, it didn't occur to me to 
check what swap was in use, but swap files show up as regular files AFAIK, and 
like sparse files can consume less space than appears, not more. A reboot put 
things back to normal, but I fully expect it to happen again (the owner says 
he's been watching the machine seeing usage creep up and du staying static).

My current working theory: something in the oracle/java/whatever stack is 
buggy, creating temp files of some form and not releasing them. Am I correct 
in saying that if a FIFO blocks, it's backed by disk? And where? /tmp is the 
obvious place but it had only 188k used per du.

It's a nice theory, and if I'm right, what would my next step be?
<Alan concedes to being baffled at this point>

-- 
alan dot mckinnon at gmail dot com
-- 
To unsubscribe: send the line "unsubscribe glug-tech" in the
subject of a mail to "glug-tech-request@xxxxxxxxxxxx".
Problems? Email "glug-tech-admins@xxxxxxxxxxxx". Archives are at
http://www.linux.org.za/Lists-Archives/
RULES: http://www.linux.org.za/glugrules.html