linux-stable/drivers
colyli@suse.de fd76863e37 RAID1: a new I/O barrier implementation to remove resync window
'Commit 79ef3a8aa1 ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but code complexity is a
challenge. Sliding resync window requires several variables to work
collectively, this is complexed and very hard to make it work correctly.
Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
to fix the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a global unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflicts, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be reduced to an acceptable number if there are enough
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio requests won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
   For random I/O 64MB is large enough for both read and write requests,
   for sequential I/O considering underlying block layer may merge them
   into larger requests, 64MB is still good enough.
   Neil also points out that for resync operation, "we want the resync to
   move from region to region fairly quickly so that the slowness caused
   by having to synchronize with the resync is averaged out over a fairly
   small time frame". For full speed resync, 64MB should take less then 1
   second. When resync is competing with other I/O, it could take up a few
   minutes. Therefore 64MB size is fairly good range for resync.

 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
        #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
        #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
   this patch makes the bellowed members of struct r1conf from integer
   to array of integers,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     *nr_pending;
        +       int                     *nr_waiting;
        +       int                     *nr_queued;
        +       int                     *barrier;
   number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
   kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
   barrier buckets, and each array of integers occupies single memory page.
   1024 means for a request which is smaller than the I/O barrier unit size
   has ~0.1% chance to wait for resync to pause, which is quite a small
   enough fraction. Also requesting single memory page is more friendly to
   kernel page allocator than larger memory size.

 - I/O barrier bucket is indexed by bio start sector
   If multiple I/O requests hit different I/O barrier units, they only need
   to compete I/O barrier with other I/Os which hit the same I/O barrier
   bucket index with each other. The index of a barrier bucket which a
   bio should look for is calculated by sector_to_idx() which is defined
   in raid1.h as an inline function,
        static inline int sector_to_idx(sector_t sector)
        {
                return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
                                BARRIER_BUCKETS_NR_BITS);
        }
   Here sector_nr is the start sector number of a bio.

 - Single bio won't go across boundary of a I/O barrier unit
   If a request goes across boundary of barrier unit, it will be split. A
   bio may be split in raid1_make_request() or raid1_sync_request(), if
   sectors returned by align_to_barrier_unit_end() is smaller than
   original bio size.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will conflict within a single barrier units. So the I/O behavior is
   similar to single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of conflict might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket indexs with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low conflict rate with much
   simpler barrier algorithm and implementation.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
        spin_lock_irqsave(&conf->device_lock, flags);
        conf->nr_queued[idx]--;
        spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
        atomic_dec(&conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - Mainline raid1 code split original raid1_make_request() into
   raid1_read_request() and raid1_write_request(). If the original bio
   goes across an I/O barrier unit size, this bio will be split before
   calling raid1_read_request() or raid1_write_request(),  this change
   the code logic more simple and clear.
 - In this patch wait_barrier() is moved from raid1_make_request() to
   raid1_write_request(). In raid_read_request(), original wait_barrier()
   is replaced by raid1_read_request().
   The differnece is wait_read_barrier() only waits if array is frozen,
   using different barrier function in different code path makes the code
   more clean and easy to read.
Changelog
V4:
- Add alloc_r1bio() to remove redundant r1bio memory allocation code.
- Fix many typos in patch comments.
- Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
V3:
- Rebase the patch against latest upstream kernel code.
- Many fixes by review comments from Neil,
  - Back to use pointers to replace arraries in struct r1conf
  - Remove total_barriers from struct r1conf
  - Add more patch comments to explain how/why the values of
    BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
  - Use get_unqueued_pending() to replace get_all_pendings() and
    get_all_queued()
  - Increase bucket number from 512 to 1024
- Change code comments format by review from Shaohua.
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
  bounday, to make the code more simple, by suggestion from Shaohua and
  Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
  confilict between resync I/O and sequential write I/O, by suggestion from
  Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
  control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
  allocated in memory page. To make the code more simple, V2 patch moves
  the memory space into struct r1conf, like this,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     nr_pending[BARRIER_BUCKETS_NR];
        +       int                     nr_waiting[BARRIER_BUCKETS_NR];
        +       int                     nr_queued[BARRIER_BUCKETS_NR];
        +       int                     barrier[BARRIER_BUCKETS_NR];
  This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
  raid1_make_write_request().
V1:
- Original RFC patch for comments

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19 22:04:24 -08:00
..
accessibility
acpi Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 2017-02-06 19:36:04 -08:00
amba
android
ata ata: sata_mv:- Handle return value of devm_ioremap. 2017-01-06 15:45:32 -05:00
atm Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
auxdisplay auxdisplay: fix new ht16k33 build errors 2017-01-11 09:27:30 +01:00
base Merge branches 'pm-core-fixes' and 'pm-cpufreq-fixes' 2017-02-06 14:52:10 +01:00
bcma Revert "bcma: init serial console directly from ChipCommon code" 2017-01-17 14:23:44 +02:00
block xen-blkfront: correct maximum segment accounting 2017-01-23 13:27:42 -05:00
bluetooth Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-12-16 10:24:44 -08:00
bus cpu/hotplug: Cleanup state names 2016-12-25 10:47:44 +01:00
cdrom Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
char Revert "hwrng: core - zeroize buffers with random data" 2017-02-08 18:06:03 -08:00
clk One fix for Samsung Exynos524x SoCs where recent IOMMU patches have 2017-01-21 18:46:45 -08:00
clocksource clocksource/exynos_mct: Clear interrupt when cpu is shut down 2017-01-17 10:08:38 +01:00
connector
cpufreq Merge branches 'pm-core-fixes' and 'pm-cpufreq-fixes' 2017-02-06 14:52:10 +01:00
cpuidle Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2017-02-06 14:16:23 -08:00
dax libnvdimm for 4.10 2016-12-18 15:49:10 -08:00
dca
devfreq PM / devfreq: exynos-bus: Fix the wrong return value 2017-01-03 00:21:45 +01:00
dio Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
dma dmaengine: pl330: fix double lock 2017-01-25 15:35:11 +05:30
dma-buf
edac Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
eisa
extcon extcon: return error code on failure 2017-01-11 09:11:39 +01:00
firewire
firmware efi/fdt: Avoid FDT manipulation after ExitBootServices() 2017-02-01 21:17:49 +01:00
fmc
fpga fpga: Clarify how write_init works streaming modes 2016-11-29 15:51:49 -06:00
gpio gpio: provide lockdep keys for nested/unnested irqchips 2017-01-19 09:57:20 +01:00
gpu Merge tag 'drm-intel-fixes-2017-02-09' of git://anongit.freedesktop.org/git/drm-intel into drm-fixes 2017-02-10 10:14:24 +10:00
hid HID: cp2112: fix gpio-callback error handling 2017-01-31 12:59:33 +01:00
hsi
hv Drivers: hv: vmbus: finally fix hv_need_to_signal_on_read() 2017-01-31 10:59:48 +01:00
hwmon hwmon: (lm90) fix temp1_max_alarm attribute 2017-01-02 10:15:28 -08:00
hwspinlock
hwtracing coresight/etm3/4x: Consolidate hotplug state space 2016-12-25 10:47:44 +01:00
i2c i2c: piix4: Request the SMBUS semaphore inside the mutex 2017-02-09 17:13:01 +01:00
ide Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
idle Power management material for v4.10-rc1 2016-12-13 10:41:53 -08:00
iio iio: dht11: Use usleep_range instead of msleep for start signal 2017-01-22 13:35:40 +00:00
infiniband IB/rxe: Fix mem_check_range integer overflow 2017-02-08 12:28:30 -05:00
input Input: synaptics-rmi4 - select 'SERIO' when needed 2017-02-07 14:31:45 -08:00
iommu IOMMU Fixes for Linux v4.10-rc2 2017-01-06 10:49:36 -08:00
ipack
irqchip irqchip/mxs: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND 2016-12-31 19:06:44 +00:00
isdn ISDN: eicon: silence misleading array-bounds warning 2017-01-27 11:27:34 -05:00
leds cpu/hotplug: Cleanup state names 2016-12-25 10:47:44 +01:00
lguest Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
lightnvm Char/Misc driver patches for 4.10-rc1 2016-12-13 12:11:01 -08:00
macintosh Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mailbox ktime: Cleanup ktime_set() usage 2016-12-25 17:21:22 +01:00
mcb
md RAID1: a new I/O barrier implementation to remove resync window 2017-02-19 22:04:24 -08:00
media media fixes for v4.10-rc8 2017-02-06 14:37:55 -08:00
memory Fixes for drivers already queued to prevent 2016-11-30 14:58:00 +01:00
memstick drivers/memstick/core/memstick.c: avoid -Wnonnull warning 2017-01-24 16:26:14 -08:00
message Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mfd - New Device Support 2016-12-19 08:16:26 -08:00
misc mei: bus: enable OS version only for SPT and newer 2017-01-11 07:43:57 +01:00
mmc mmc: mmci: avoid clearing ST Micro busy end interrupt mistakenly 2017-02-08 12:22:27 +01:00
mtd mtd: nand: lpc32xx: fix invalid error handling of a requested irq 2017-01-04 20:50:18 +01:00
net xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend() 2017-02-10 13:44:49 -05:00
nfc
ntb ntb_transport: Remove unnecessary call to ntb_peer_spad_read 2016-12-23 16:11:07 -05:00
nubus Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
nvdimm libnvdimm, pfn: fix memmap reservation size versus 4K alignment 2017-02-04 14:47:31 -08:00
nvme nvme-fc: use blk_rq_nr_phys_segments 2017-01-26 17:49:14 +02:00
nvmem nvmem: fix nvmem_cell_read() return type doc 2017-01-04 18:22:47 +01:00
of pci-v4.10-changes 2016-12-15 12:46:48 -08:00
oprofile Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
parisc Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
parport parisc, parport_gsc: Fixes for printk continuation lines 2017-01-28 21:54:21 +01:00
pci Revert "PCI: pciehp: Add runtime PM support for PCIe hotplug ports" 2017-02-03 08:53:51 -06:00
pcmcia drivers/pcmcia/m32r_pcc.c: check return from add_pcc_socket 2016-12-12 18:55:06 -08:00
perf cpu/hotplug: Cleanup state names 2016-12-25 10:47:44 +01:00
phy SCSI misc on 20161213 2016-12-14 10:49:33 -08:00
pinctrl pinctrl: baytrail: Add missing spinlock usage in byt_gpio_irq_handler 2017-01-30 15:53:57 +01:00
platform platform/x86: ideapad-laptop: handle ACPI event 1 2017-01-22 12:47:06 +02:00
pnp Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
power ktime: Cleanup ktime_set() usage 2016-12-25 17:21:22 +01:00
powercap powercap / RAPL: Add Knights Mill CPUID 2016-11-30 23:41:33 +01:00
pps
ps3
ptp Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-12-12 19:56:15 -08:00
pwm pwm: Changes for v4.10-rc1 2016-12-15 11:45:13 -08:00
rapidio
ras
regulator regulator: Fixes for v4.10 2017-02-03 13:46:38 -08:00
remoteproc Revert "remoteproc: Merge table_ptr and cached_table pointers" 2016-12-30 03:26:31 -08:00
reset ARM: SoC driver updates for v4.10 2016-12-15 16:03:25 -08:00
rpmsg rpmsg: virtio_rpmsg_bus: fix channel creation 2016-12-30 03:12:11 -08:00
rtc rtc: jz4740: make the driver buildable as a module again 2017-01-26 23:03:21 +01:00
s390 SCSI fixes on 20170210 2017-02-11 09:01:03 -08:00
sbus Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
scsi SCSI fixes on 20170210 2017-02-11 09:01:03 -08:00
sfi
sh lib: radix-tree: check accounting of existing slot replacement users 2016-12-12 18:55:08 -08:00
sn
soc soc: ti: wkup_m3_ipc: Fix error return code in wkup_m3_ipc_probe() 2017-01-12 13:37:49 -08:00
spi spi: Fixes for v4.10 2017-01-20 12:25:11 -08:00
spmi
ssb
staging mm: avoid returning VM_FAULT_RETRY from ->page_mkwrite handlers 2017-02-08 15:41:43 -08:00
target target: Fix COMPARE_AND_WRITE ref leak for non GOOD status 2017-02-08 08:25:39 -08:00
tc
thermal Revert "thermal: thermal_hwmon: Convert to hwmon_device_register_with_info()" 2017-01-25 09:51:08 +08:00
thunderbolt Char/Misc driver patches for 4.10-rc1 2016-12-13 12:11:01 -08:00
tty sysrq: attach sysrq handler correctly for 32-bit kernel 2017-01-11 09:22:54 +01:00
uio uio-hv-generic: store physical addresses instead of virtual 2016-12-10 14:57:58 +01:00
usb USB-serial fixes for v4.10-rc7 2017-02-03 22:19:15 +01:00
uwb
vfio vfio/spapr_tce: Set window when adding additional groups to container 2017-02-07 11:48:16 -07:00
vhost vhost: fix initialization for vq->is_le 2017-02-03 23:38:57 +02:00
video fbdev: color map copying bounds checking 2017-01-24 16:26:14 -08:00
virt
virtio Revert "vring: Force use of DMA API for ARM-based systems with legacy devices" 2017-02-03 23:38:50 +02:00
vlynq
vme vme: Fix wrong pointer utilization in ca91cx42_slave_get 2017-01-11 10:42:16 +01:00
w1
watchdog Watchdog updates for v4.10 2016-12-24 11:27:45 -08:00
xen Merge branch 'stable/for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb 2017-01-27 12:17:07 -08:00
zorro Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
Kconfig
Makefile