Thu, 14 Nov 2013 08:58:39 +0100From: Jan Kara
Re: 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
On Wed 13-11-13 05:59:11, Denys Fedoryshchenko wrote:
> On 2013-11-12 23:46, Jan Kara wrote:
> >On Tue 12-11-13 16:34:07, Denys Fedoryshchenko wrote:
> >>I just did some fault testing for test nbd setup, and found that if
> >>i reboot nbd server i will get immediately BUG() message on nbd
> >>client and filesystem that i cannot unmount, and any operations on
> >>it will freeze and lock processes trying to access it.
> > So how exactly did you do the fault testing? Because it seems
> >has discarded the block device under filesystem's toes and the
> >buffer_head got unmapped. Didn't something call NBD_CLEAR_SOCK ioctl?
> >Because that calls kill_bdev() which would do exactly that...
> Client side:
> modprobe nbd
> nbd-client 126.96.36.199 /dev/nbd0 -name export1
> nbd-client 188.8.131.52 /dev/nbd1 -name export2
> nbd-client 184.108.40.206 /dev/nbd2 -name export3
> mount /dev/nbd0 /mnt/disk1
> mount /dev/nbd1 /mnt/disk2
> mount /dev/nbd2 /mnt/disk3
> On server i have config:
> exportname = /dev/sda1
> exportname = /dev/sdb1
> exportname = /dev/sdc1
> Steps to reproduce:
> 1)Start some large file copy on client side to /mnt/disk1/
> 2)Reboot server. It reboots quite fast, just few seconds, server
> system will get ip before nbd-server process started listening, so
> probably nbd-client will see connection refused.
> 3)seems when client gets connection refused - it is going mad
> I can try to capture traffic dump, or do any other debug operation,
> please let me know, what i should run :)
> P.S. I noticed maybe i should run persist mode, but anyway it should
> not crash like this i think.
OK, no need for further debugging. I see what's going on. In NBD_DO_IT
ioctl() nbd calls kill_bdev() after the kthread returned - and that happens
in your case as we can see from "queue cleared" messages.
Now there is a question how to fix this. Filesystems don't really expect
device buffers to disappear under us as they do when nbd calls kill_bdev().
Also that never happens with normal block devices - if a similar situation
happens to SCSI / SATA disk, corresponding block devices hang around
refusing any IO until the filesystem is unmounted and at that point they
disappear (device's refcount - bd_openers - reaches zero). It would be good
if NBD behaved the same way - maybe we should return from NBD_DO_IT ioctl
only after bd_openers drops to 1 (not zero because the nbd client has the
device open as well for the ioctl if I'm right)?
Jan Kara <firstname.lastname@example.org>
SUSE Labs, CR