How to free 15GB of disk space in a tenth of a second
One of the MySQL servers I help manage was encountering some problems with a full data directory. It was a bit mysterious, because we couldn’t find any files to account for the increased usage. Here are some things we checked:
- A recursive
ls -ldidn’t show any more, or larger, files than usual. - Using
lsofand looking at the SIZE column didn’t either. - There were not enough temporary files or tables open (as shown by
lsof) to account for the disk space.
Oddly, someone discovered that FLUSH TABLES would drop disk usage by about 15GB in a fraction of a second, allowing the server to continue running without problems.
I carefully measured all of the items in the above list before and after FLUSH TABLES. No doubt about it: no files went away, no files shrank, yet df and du showed the difference in the space free and space used in the data directory. The changes were isolated to an ‘archive’ database that contains old archived-off data in MyISAM-only tables. Archiving jobs add rows to these tables on an ongoing basis.
I decided to use du to measure the disk usage of each file individually, and got results. Hundreds of MyISAM data and index files showed disk usage differences before and after the FLUSH TABLES. All together, these differences added up to the free space difference observed. Here’s a small sample of before-and-after:
< 131076 /var/lib/mysql/data/archive/tbl1#P#cl638.MYI
< 131076 /var/lib/mysql/data/archive/tbl2#P#cl34.MYI
< 131076 /var/lib/mysql/data/archive/tbl3#P#cl636.MYI
---
> 2652 /var/lib/mysql/data/archive/tbl1#P#cl638.MYI
> 4024 /var/lib/mysql/data/archive/tbl2#P#cl34.MYI
> 8920 /var/lib/mysql/data/archive/tbl3#P#cl636.MYI
This puzzled me a little bit. I tried to decide: is this a kernel bug? XFS bug? MyISAM bug? LVM bug? Known behavior, not-a-bug?
Then I noticed the “before” size seemed to be in some pretty consistent ranges. The samples above show file sizes of 128MB, and there were many more examples of that. Suspicious. On a hunch, I checked the mount options:
/dev/mapper/shardvg-mysql on /var/lib/mysql type xfs (rw,noatime,allocsize=128M)
A quick read of the allocsize mount option explains it. The space is preallocated for buffered I/O. InnoDB is not using buffered I/O, so the .ibd files don’t show this behavior. I think this allocation size might be excessive, and I don’t know why it was chosen, but at least now the problem is clear, and I can see a couple options for solving it.



That’s OK — the default allocsize is 8GB. We ran into this recently, with a bunch of 4GB tables taking up 8GB, causing a system to run completely out of disk space. I don’t think allocsize was very well thought through really.
Jeremy Cole
15 Sep 12 at 1:47 pm
On my systems, “man mount” says allocsize’s default is 64kb. Did someone change your systems’ defaults? 8GB is insane.
Xaprb
17 Sep 12 at 8:14 am
Baron: This went into 2.6.38 and is called “dynamic speculative EOF preallocation”. See:
http://serverfault.com/questions/406069/why-are-my-xfs-filesystems-suddenly-consuming-more-space-and-full-of-sparse-file
http://oss.sgi.com/archives/xfs/2010-12/msg00328.html
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=055388a3188f56676c21e92962fc366ac8b5cb72
Especially note the “freespace >5% == max prealloc size full extent (8GB)”
This caused use big problems with a bunch MyISAM tables which should grow to around 4GB taking 8GB of preallocated space, causing the filesystem to run out of space. This was on a machine which was mistakenly upgraded to 2.6.39 from its usual 2.6.18. Yay for blind kernel upgrades. :P
Jeremy Cole
17 Sep 12 at 3:23 pm
History trivia… DOS wasting 50% of disk.
Back when DOS reigned king, and before they got beyond FAT-16, and when disk sizes were in the MBs, …
If you put a ‘huge’ 2GB disk on DOS as a single drive, the MAU had to be 32KB since there were only 2**16 possible MAUs. There were lots of tiny files, but each took 32KB. This added up. This led to literally 50% of the disk wasted by a typical user.
64KB might be reasonable today. But 8GB? — insane.
Rick James
17 Sep 12 at 7:49 pm
Thanks for the extra info, Jeremy. That’s truly braindead. Did nobody think about the law of unintended consequences? Is it time to recommend that allocsize always be explicitly specified so that the magic is disabled?
Xaprb
17 Sep 12 at 9:26 pm
Looks like in that case allocsize is set explicitly to 128M but in the default case it is indeed 64k. It can be decreased to upto 4k though.
But dynamic speculative pre-allocation is turned off when allocsize is set.
However, it is quite easy to free up the ‘allocated’ space — dropping filesystem cache (/proc/sys/vm/drop_caches) should fix it since there is no actual I/O done in this case.
Also, another way to free it is to close / open the file again which is what FLUSH TABLES did in this case.
The manual equivalent of preallocation is a fallocate (and is required if managed explicitly with O_DIRECT).
Now, this is beneficial for append only workloads since it can reduce fragmentation and works only with non Direct-IO (which explains why InnoDB is not affected if innodb_flush_method was set to O_DIRECT).
The ENOSPC angle to this is interesting indeed. But there is also the other angle of filesystems not performing well close to ENOSPC or at ENOSPC (there are a couple of tests where some filesystems fail / used to fail at this point).
Raghavendra Prabhu
19 Sep 12 at 8:24 am
Raghavendra,
The ENOSPC angle is more subtle than that. Say, for example that a system normally handles 250 x 4GB MyISAM tables, for a total of 1.0TB, on a system with 1.5TB available. When each file is created (empty) with the new xfs dynamic allocation scheme, each of those files reserves 8GB, and is kept open continuously (since table_cache is large enough). The system will only manage to create about 178 of these tables with 8GB preallocated before the system drops below 5% free space and starts reducing the preallocation.
This will mean that the first 178 files will be able to grow to 4GB (while consuming 8GB each) but the remaining 72 of the 250 files will never be able to grow to even the nominal 4GB they should be, as they have pre-allocated at most 2GB and potentially much less.
So the file system is now “full” but may contain almost no data. Even worse, ls, stat, etc. do not show where the space went.
Regards,
Jeremy
Jeremy Cole
19 Sep 12 at 2:13 pm
What Jeremy said ^^^. In my case the filesystem is fairly small, and until I figured this out, it was pretty puzzling how it could be “full” and causing the DB server to crash when it was only “half full”.
Xaprb
19 Sep 12 at 2:25 pm
Yes, the ENOSPC angle exists.
However, few things:
1. The pre-allocated space won’t exceed the “ondisk” file size.
alloc_blocks = XFS_B_TO_FSB(mp, XFS_ISIZE(ip)) + 1;
alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN,
rounddown_pow_of_two(alloc_blocks));
In the commit, reference to 8 GB is made because the file there is larger
than MAXEXTLEN which is 8GB.
So, the preallocated space for a 4G file cannot exceed 4G.
Regarding the “ondisk”, it is so because XFS does a lot of delayed
allocation, so it is the size of inode
(reported by VFS) on the disk (after the last flush).
2. As the commit details, xfs_iomap_write_delay also checks for ENOSPC and if
there is one, it disables preallocation and it flushes all the inodes (including
the previously preallocated ones) to free up the preallocated space.
3. The reason tools like ls, stat don’t reveal it is because the space may not
yet be allocated on disk and these tools check that.
Now, when you say ENOSPC, was it tending towards it (as reported by df) or was
an actual ENOSPC returned by any of the system calls? If it is latter, there
may be a bug in how it is freed up / when it is freed up.
Raghavendra Prabhu
19 Sep 12 at 6:28 pm
Quick update over what I mentioned earlier.
Seems it is possible that an ENOSPC can result, there is work in progress to fix that — http://oss.sgi.com/archives/xfs/2012-09/msg00179.html
Raghavendra Prabhu
19 Sep 12 at 6:47 pm