UnRAID XFS MetaData Corruption

One thing you never want to see when you look at the screen for your file server is an error about corrupted data. Let alone after seeing read errors from one of your disks during a parity check. I recently had that sort of fun with my NAS.

It started out with some maintenance work I was doing on my NAS. It didn’t want to boot so I had plugged in a monitor, the motherboard is aging and needs to be replaced, so this part wasn’t abnormal. The abnormal part came after it finished booting, when the display showed metadata corruption on one of my 8TB disks. Specifically one of the disks that were doing fine on the parity check, not the disk that threw errors on it (this was concerning as I only have a single parity drive).

I followed the instructions on screen. I first ran df -h so I found determine which physical disk was showing the problem. The /dev/md1 mapped to disk1 via that command, and disk 1 from the GUI maps to /dev/sdf. Since all the disks should only have one partition, I figured /dev/sdf1 was the correct point to run the commands on. I stopped my array by the unRAID GUI as that will unmount my disks. I then ran the xfs_repair command on /dev/sdf1 with the following command line flags.

-n : don’t write corrections to the disk
-v : verbose

The output from that command is below.

root@ShadowOfIntent:~# xfs_repair -nv /dev/sdf1
Phase 1 - find and verify superblock...
        - block cache size set to 1464296 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1129663 tail block 1129663
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
bad CRC for inode 11120729215
bad CRC for inode 11120729215, would rewrite
would have cleared inode 11120729215
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 4
        - agno = 0
        - agno = 6
        - agno = 3
        - agno = 2
        - agno = 7
        - agno = 5
bad CRC for inode 11120729215, would rewrite
would have cleared inode 11120729215
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed Aug  9 20:46:54 2023

Phase           Start           End             Duration
Phase 1:        08/09 20:46:27  08/09 20:46:27
Phase 2:        08/09 20:46:27  08/09 20:46:28  1 second
Phase 3:        08/09 20:46:28  08/09 20:46:43  15 seconds
Phase 4:        08/09 20:46:43  08/09 20:46:43
Phase 5:        Skipped
Phase 6:        08/09 20:46:43  08/09 20:46:54  11 seconds
Phase 7:        08/09 20:46:54  08/09 20:46:54

Total run time: 27 seconds

The output only shows one problem on inode 11120729215. Because I don’t have extensive damage to the filesystem, I figured I would go right ahead to repairing it without looking into other options yet (rebuilding from parity). I took the -n option off the command and re-ran it to repair the system.

root@ShadowOfIntent:~# xfs_repair -v /dev/sdf1
Phase 1 - find and verify superblock...
        - block cache size set to 1464296 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1129663 tail block 1129663
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
bad CRC for inode 11120729215
bad CRC for inode 11120729215, will rewrite
cleared inode 11120729215
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 7
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 0
clearing reflink flag on inode 15049157028
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...

        XFS_REPAIR Summary    Wed Aug  9 20:55:07 2023

Phase           Start           End             Duration
Phase 1:        08/09 20:54:41  08/09 20:54:41
Phase 2:        08/09 20:54:41  08/09 20:54:42  1 second
Phase 3:        08/09 20:54:42  08/09 20:54:55  13 seconds
Phase 4:        08/09 20:54:55  08/09 20:54:55
Phase 5:        08/09 20:54:55  08/09 20:54:55
Phase 6:        08/09 20:54:55  08/09 20:55:07  12 seconds
Phase 7:        08/09 20:55:07  08/09 20:55:07

Total run time: 26 seconds
done

Running the first command again, xfs_repair -nv to double check and make sure there are no problems reporting after the repair. If there had been, I’d start looking more into rebuilding from parity.

root@ShadowOfIntent:~# xfs_repair -nv /dev/sdf1
Phase 1 - find and verify superblock...
        - block cache size set to 1464296 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1129663 tail block 1129663
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 5
        - agno = 1
        - agno = 6
        - agno = 4
        - agno = 7
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed Aug  9 20:56:21 2023

Phase           Start           End             Duration
Phase 1:        08/09 20:55:56  08/09 20:55:57  1 second
Phase 2:        08/09 20:55:57  08/09 20:55:57
Phase 3:        08/09 20:55:57  08/09 20:56:10  13 seconds
Phase 4:        08/09 20:56:10  08/09 20:56:10
Phase 5:        Skipped
Phase 6:        08/09 20:56:10  08/09 20:56:21  11 seconds
Phase 7:        08/09 20:56:21  08/09 20:56:21

Total run time: 25 seconds

Everything now looks to be working well. No more errors on the filesystem, though I’ll be running the repair command on the rest of my disks to make sure everything is healthy. Running with the -n option is safe to do as it won’t write any repairs to disk, so I’m not worried about running that on all the rest of the disks.