bizo developer blog: mdadm: device or resource busy

I just spent a few hours tracking an issue with mdadm (Linux utility used to manage software RAID devices) and figured I'd write a quick blog post to share the solution so others don't have to waste time on the same.

As a short background, we use mdadm to create RAID-0 stripped devices for our Sugarcube analytics (OLAP) servers using Amazon EBS volumes.

The issue manifested itself as a random failure during device creation:

$ mdadm --create /dev/md0 --level=0 --chunk 256 --raid-devices=4 /dev/xvdh1 /dev/xvdh2 /dev/xvdh3 /dev/xvdh4
mdadm: Defaulting to version 1.2 metadata
mdadm: ADD_NEW_DISK for /dev/xvdh3 failed: Device or resource busy

I searched and searched the interwebs and tried every trick I found to no avail. We don't have dmraid installed on our Linux images (Ubuntu 12.04 LTS / Alestic cloud image) so there's no possible conflict there. All devices were clean, as they are freshly created EBS volumes and I knew none of them were in use.

Before running mdadm --create, mdstat was clean:

$ cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]

unused devices: <none>

And yet after running it the devices were assigned to two different devices instead of just /dev/md0:

$ cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]

md127 : inactive xvdh4[3](S) xvdh3[2](S)

1048573952 blocks super 1.2

md0 : inactive xvdh2[1](S) xvdh1[0](S)

1048573952 blocks super 1.2

unused devices: <none>

Looking into dmesg didn't reveal anything interesting either:

$ dmesg

...

[3963010.552493] md: bind<xvdh1>

[3963010.553011] md: bind<xvdh2>

[3963010.553040] md: could not open unknown-block(202,115).

[3963010.553052] md: md_import_device returned -16

[3963010.566543] md: bind<xvdh3>

[3963010.731009] md: bind<xvdh4>

And strangely, the creation or assembly would sometime work and sometime not:

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: /dev/md0 has been started with 4 drives.

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: cannot open device /dev/xvdh3: Device or resource busy

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: cannot open device /dev/xvdh1: Device or resource busy

mdadm: /dev/xvdh1 has no superblock - assembly aborted

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: /dev/md0 has been started with 4 drives.

I started suspecting I was facing some kind of underlying race condition where the devices would get assigned/locked during the device creation process. So I started googling for "mdadm create race" and I finally found a post that tipped me off. While it didn't provide the solution, the post put me on the right track by mentioning udev and it took only a few more minutes to narrow down on the solution: disabling udev events during device creation to avoid contention on device handles.

So now our script goes something like:

$ udevadm control --stop-exec-queue

$ mdadm --create /dev/md0 --run --level=0 --raid-devices=4 ...

$ udevadm control --start-exec-queue

And we now have consistent reliable device creation.

Hopefully this blog post will help other passers-by with a similar problem. Good luck!

9 comments:

gmanley said...: Thank a lot for the post! I had been pulling my hair out trying to solve this issue setting up an NAS.; July 9, 2012 at 11:14 PM
Andrew said...: Thank you so much! It took me a while to find this solution via google. I'm surprised it's not more prevalent around the web.; July 20, 2012 at 12:26 PM
Abhishek said...: This comment has been removed by the author.; July 22, 2012 at 12:14 PM
Anonymous said...: Very useful. I was going mad. Thanks a lot.; July 22, 2012 at 12:18 PM
Unknown said...: I was having this same issue! Thank you!; July 26, 2012 at 7:30 PM
Unknown said...: Wow, seems like you saved my life here. Thanks a lot !; August 10, 2012 at 7:17 AM
revacuate said...: Thank you so much! This has been a huge issue for us. Now it's solved.; September 14, 2012 at 1:09 PM
Slavik Goltser said...: Thanks for the udevadm tip. I also could understand why mdadm would report devices as busy, even though nothing was using (no dmraid, not device mapper).; December 27, 2012 at 3:52 AM
Unknown said...: Wonderful! I was going mad looking for the source of intermittent failures in Chef's mdadm provider on EC2. You've saved what remains of my hair.; January 30, 2013 at 10:29 AM

Links

Saturday, July 7, 2012

mdadm: device or resource busy

9 comments:

Blog Archive

Links

Saturday, July 7, 2012

mdadm: device or resource busy

9 comments:

Blog Archive

Subscribe To