linux-stable/drivers/block/zram/zram_drv.c
Linus Torvalds 902861e34c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory.  Series
   "implement "memmap on memory" feature on s390".
 
 - More folio conversions from Matthew Wilcox in the series
 
 	"Convert memcontrol charge moving to use folios"
 	"mm: convert mm counter to take a folio"
 
 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes.  The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".
 
 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.
 
 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools".  Measured improvements are modest.
 
 - zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
   zswap: simplify zswap_swapoff()".
 
 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is hotplugged
   as system memory.
 
 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.
 
 - More DAMON work from SeongJae Park in the series
 
 	"mm/damon: make DAMON debugfs interface deprecation unignorable"
 	"selftests/damon: add more tests for core functionalities and corner cases"
 	"Docs/mm/damon: misc readability improvements"
 	"mm/damon: let DAMOS feeds and tame/auto-tune itself"
 
 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving policy
   wherein we allocate memory across nodes in a weighted fashion rather
   than uniformly.  This is beneficial in heterogeneous memory environments
   appearing with CXL.
 
 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
 
 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".
 
 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format.  Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.
 
 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP".  Mainly
   targeted at arm64, this significantly speeds up fork() when the process
   has a large number of pte-mapped folios.
 
 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP".  It
   implements batching during munmap() and other pte teardown situations.
   The microbenchmark improvements are nice.
 
 - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
   Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings").  Kernel build times on arm64 improved nicely.  Ryan's series
   "Address some contpte nits" provides some followup work.
 
 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page faults.
   He has also added a reproducer under the selftest code.
 
 - In the series "selftests/mm: Output cleanups for the compaction test",
   Mark Brown did what the title claims.
 
 - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
 
 - Even more zswap material from Nhat Pham.  The series "fix and extend
   zswap kselftests" does as claimed.
 
 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
   our handling of DAX on archiecctures which have virtually aliasing data
   caches.  The arm architecture is the main beneficiary.
 
 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
   improvements in worst-case mmap_lock hold times during certain
   userfaultfd operations.
 
 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series
 
 	"page_owner: print stacks and their outstanding allocations"
 	"page_owner: Fixup and cleanup"
 
 - Uladzislau Rezki has contributed some vmalloc scalability improvements
   in his series "Mitigate a vmap lock contention".  It realizes a 12x
   improvement for a certain microbenchmark.
 
 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".
 
 - Some zsmalloc maintenance work from Chengming Zhou in the series
 
 	"mm/zsmalloc: fix and optimize objects/page migration"
 	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
 
 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0.  This a step along the path to implementaton of the merging of
   large anonymous folios.  The series is named "Enable >0 order folio
   memory compaction".
 
 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages() to
   an iterator".
 
 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".
 
 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios.  The
   series is named "Split a folio to any lower order folios".
 
 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.
 
 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".
 
 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which are
   configured to use large numbers of hugetlb pages.
 
 - Matthew Wilcox's series "PageFlags cleanups" does that.
 
 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also.  S390 is affected.
 
 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".
 
 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
 
 - Also, of course, many singleton patches to many things.  Please see
   the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
 joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
 TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
 =TG55
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00

2450 lines
59 KiB
C

/*
* Compressed RAM block device
*
* Copyright (C) 2008, 2009, 2010 Nitin Gupta
* 2012, 2013 Minchan Kim
*
* This code is released using a dual license strategy: BSD/GPL
* You can choose the licence that better fits your requirements.
*
* Released under the terms of 3-clause BSD License
* Released under the terms of GNU General Public License Version 2.0
*
*/
#define KMSG_COMPONENT "zram"
#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/bio.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <linux/string.h>
#include <linux/vmalloc.h>
#include <linux/err.h>
#include <linux/idr.h>
#include <linux/sysfs.h>
#include <linux/debugfs.h>
#include <linux/cpuhotplug.h>
#include <linux/part_stat.h>
#include "zram_drv.h"
static DEFINE_IDR(zram_index_idr);
/* idr index must be protected */
static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
/*
* Pages that compress to sizes equals or greater than this are stored
* uncompressed in memory.
*/
static size_t huge_class_size;
static const struct block_device_operations zram_devops;
static void zram_free_page(struct zram *zram, size_t index);
static int zram_read_page(struct zram *zram, struct page *page, u32 index,
struct bio *parent);
static int zram_slot_trylock(struct zram *zram, u32 index)
{
return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].flags);
}
static void zram_slot_lock(struct zram *zram, u32 index)
{
bit_spin_lock(ZRAM_LOCK, &zram->table[index].flags);
}
static void zram_slot_unlock(struct zram *zram, u32 index)
{
bit_spin_unlock(ZRAM_LOCK, &zram->table[index].flags);
}
static inline bool init_done(struct zram *zram)
{
return zram->disksize;
}
static inline struct zram *dev_to_zram(struct device *dev)
{
return (struct zram *)dev_to_disk(dev)->private_data;
}
static unsigned long zram_get_handle(struct zram *zram, u32 index)
{
return zram->table[index].handle;
}
static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
{
zram->table[index].handle = handle;
}
/* flag operations require table entry bit_spin_lock() being held */
static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
{
return zram->table[index].flags & BIT(flag);
}
static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
{
zram->table[index].flags |= BIT(flag);
}
static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
{
zram->table[index].flags &= ~BIT(flag);
}
static inline void zram_set_element(struct zram *zram, u32 index,
unsigned long element)
{
zram->table[index].element = element;
}
static unsigned long zram_get_element(struct zram *zram, u32 index)
{
return zram->table[index].element;
}
static size_t zram_get_obj_size(struct zram *zram, u32 index)
{
return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
}
static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
{
unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
}
static inline bool zram_allocated(struct zram *zram, u32 index)
{
return zram_get_obj_size(zram, index) ||
zram_test_flag(zram, index, ZRAM_SAME) ||
zram_test_flag(zram, index, ZRAM_WB);
}
#if PAGE_SIZE != 4096
static inline bool is_partial_io(struct bio_vec *bvec)
{
return bvec->bv_len != PAGE_SIZE;
}
#define ZRAM_PARTIAL_IO 1
#else
static inline bool is_partial_io(struct bio_vec *bvec)
{
return false;
}
#endif
static inline void zram_set_priority(struct zram *zram, u32 index, u32 prio)
{
prio &= ZRAM_COMP_PRIORITY_MASK;
/*
* Clear previous priority value first, in case if we recompress
* further an already recompressed page
*/
zram->table[index].flags &= ~(ZRAM_COMP_PRIORITY_MASK <<
ZRAM_COMP_PRIORITY_BIT1);
zram->table[index].flags |= (prio << ZRAM_COMP_PRIORITY_BIT1);
}
static inline u32 zram_get_priority(struct zram *zram, u32 index)
{
u32 prio = zram->table[index].flags >> ZRAM_COMP_PRIORITY_BIT1;
return prio & ZRAM_COMP_PRIORITY_MASK;
}
static void zram_accessed(struct zram *zram, u32 index)
{
zram_clear_flag(zram, index, ZRAM_IDLE);
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
zram->table[index].ac_time = ktime_get_boottime();
#endif
}
static inline void update_used_max(struct zram *zram,
const unsigned long pages)
{
unsigned long cur_max = atomic_long_read(&zram->stats.max_used_pages);
do {
if (cur_max >= pages)
return;
} while (!atomic_long_try_cmpxchg(&zram->stats.max_used_pages,
&cur_max, pages));
}
static inline void zram_fill_page(void *ptr, unsigned long len,
unsigned long value)
{
WARN_ON_ONCE(!IS_ALIGNED(len, sizeof(unsigned long)));
memset_l(ptr, value, len / sizeof(unsigned long));
}
static bool page_same_filled(void *ptr, unsigned long *element)
{
unsigned long *page;
unsigned long val;
unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
page = (unsigned long *)ptr;
val = page[0];
if (val != page[last_pos])
return false;
for (pos = 1; pos < last_pos; pos++) {
if (val != page[pos])
return false;
}
*element = val;
return true;
}
static ssize_t initstate_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
u32 val;
struct zram *zram = dev_to_zram(dev);
down_read(&zram->init_lock);
val = init_done(zram);
up_read(&zram->init_lock);
return scnprintf(buf, PAGE_SIZE, "%u\n", val);
}
static ssize_t disksize_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct zram *zram = dev_to_zram(dev);
return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
}
static ssize_t mem_limit_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
u64 limit;
char *tmp;
struct zram *zram = dev_to_zram(dev);
limit = memparse(buf, &tmp);
if (buf == tmp) /* no chars parsed, invalid input */
return -EINVAL;
down_write(&zram->init_lock);
zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
up_write(&zram->init_lock);
return len;
}
static ssize_t mem_used_max_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
int err;
unsigned long val;
struct zram *zram = dev_to_zram(dev);
err = kstrtoul(buf, 10, &val);
if (err || val != 0)
return -EINVAL;
down_read(&zram->init_lock);
if (init_done(zram)) {
atomic_long_set(&zram->stats.max_used_pages,
zs_get_total_pages(zram->mem_pool));
}
up_read(&zram->init_lock);
return len;
}
/*
* Mark all pages which are older than or equal to cutoff as IDLE.
* Callers should hold the zram init lock in read mode
*/
static void mark_idle(struct zram *zram, ktime_t cutoff)
{
int is_idle = 1;
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
int index;
for (index = 0; index < nr_pages; index++) {
/*
* Do not mark ZRAM_UNDER_WB slot as ZRAM_IDLE to close race.
* See the comment in writeback_store.
*/
zram_slot_lock(zram, index);
if (zram_allocated(zram, index) &&
!zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
is_idle = !cutoff || ktime_after(cutoff,
zram->table[index].ac_time);
#endif
if (is_idle)
zram_set_flag(zram, index, ZRAM_IDLE);
}
zram_slot_unlock(zram, index);
}
}
static ssize_t idle_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct zram *zram = dev_to_zram(dev);
ktime_t cutoff_time = 0;
ssize_t rv = -EINVAL;
if (!sysfs_streq(buf, "all")) {
/*
* If it did not parse as 'all' try to treat it as an integer
* when we have memory tracking enabled.
*/
u64 age_sec;
if (IS_ENABLED(CONFIG_ZRAM_TRACK_ENTRY_ACTIME) && !kstrtoull(buf, 0, &age_sec))
cutoff_time = ktime_sub(ktime_get_boottime(),
ns_to_ktime(age_sec * NSEC_PER_SEC));
else
goto out;
}
down_read(&zram->init_lock);
if (!init_done(zram))
goto out_unlock;
/*
* A cutoff_time of 0 marks everything as idle, this is the
* "all" behavior.
*/
mark_idle(zram, cutoff_time);
rv = len;
out_unlock:
up_read(&zram->init_lock);
out:
return rv;
}
#ifdef CONFIG_ZRAM_WRITEBACK
static ssize_t writeback_limit_enable_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct zram *zram = dev_to_zram(dev);
u64 val;
ssize_t ret = -EINVAL;
if (kstrtoull(buf, 10, &val))
return ret;
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
zram->wb_limit_enable = val;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
ret = len;
return ret;
}
static ssize_t writeback_limit_enable_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
bool val;
struct zram *zram = dev_to_zram(dev);
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
val = zram->wb_limit_enable;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
return scnprintf(buf, PAGE_SIZE, "%d\n", val);
}
static ssize_t writeback_limit_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct zram *zram = dev_to_zram(dev);
u64 val;
ssize_t ret = -EINVAL;
if (kstrtoull(buf, 10, &val))
return ret;
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
zram->bd_wb_limit = val;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
ret = len;
return ret;
}
static ssize_t writeback_limit_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
u64 val;
struct zram *zram = dev_to_zram(dev);
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
val = zram->bd_wb_limit;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
}
static void reset_bdev(struct zram *zram)
{
if (!zram->backing_dev)
return;
fput(zram->bdev_file);
/* hope filp_close flush all of IO */
filp_close(zram->backing_dev, NULL);
zram->backing_dev = NULL;
zram->bdev_file = NULL;
zram->disk->fops = &zram_devops;
kvfree(zram->bitmap);
zram->bitmap = NULL;
}
static ssize_t backing_dev_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct file *file;
struct zram *zram = dev_to_zram(dev);
char *p;
ssize_t ret;
down_read(&zram->init_lock);
file = zram->backing_dev;
if (!file) {
memcpy(buf, "none\n", 5);
up_read(&zram->init_lock);
return 5;
}
p = file_path(file, buf, PAGE_SIZE - 1);
if (IS_ERR(p)) {
ret = PTR_ERR(p);
goto out;
}
ret = strlen(p);
memmove(buf, p, ret);
buf[ret++] = '\n';
out:
up_read(&zram->init_lock);
return ret;
}
static ssize_t backing_dev_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
char *file_name;
size_t sz;
struct file *backing_dev = NULL;
struct inode *inode;
struct address_space *mapping;
unsigned int bitmap_sz;
unsigned long nr_pages, *bitmap = NULL;
struct file *bdev_file = NULL;
int err;
struct zram *zram = dev_to_zram(dev);
file_name = kmalloc(PATH_MAX, GFP_KERNEL);
if (!file_name)
return -ENOMEM;
down_write(&zram->init_lock);
if (init_done(zram)) {
pr_info("Can't setup backing device for initialized device\n");
err = -EBUSY;
goto out;
}
strscpy(file_name, buf, PATH_MAX);
/* ignore trailing newline */
sz = strlen(file_name);
if (sz > 0 && file_name[sz - 1] == '\n')
file_name[sz - 1] = 0x00;
backing_dev = filp_open(file_name, O_RDWR|O_LARGEFILE, 0);
if (IS_ERR(backing_dev)) {
err = PTR_ERR(backing_dev);
backing_dev = NULL;
goto out;
}
mapping = backing_dev->f_mapping;
inode = mapping->host;
/* Support only block device in this moment */
if (!S_ISBLK(inode->i_mode)) {
err = -ENOTBLK;
goto out;
}
bdev_file = bdev_file_open_by_dev(inode->i_rdev,
BLK_OPEN_READ | BLK_OPEN_WRITE, zram, NULL);
if (IS_ERR(bdev_file)) {
err = PTR_ERR(bdev_file);
bdev_file = NULL;
goto out;
}
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
if (!bitmap) {
err = -ENOMEM;
goto out;
}
reset_bdev(zram);
zram->bdev_file = bdev_file;
zram->backing_dev = backing_dev;
zram->bitmap = bitmap;
zram->nr_pages = nr_pages;
up_write(&zram->init_lock);
pr_info("setup backing device %s\n", file_name);
kfree(file_name);
return len;
out:
kvfree(bitmap);
if (bdev_file)
fput(bdev_file);
if (backing_dev)
filp_close(backing_dev, NULL);
up_write(&zram->init_lock);
kfree(file_name);
return err;
}
static unsigned long alloc_block_bdev(struct zram *zram)
{
unsigned long blk_idx = 1;
retry:
/* skip 0 bit to confuse zram.handle = 0 */
blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, blk_idx);
if (blk_idx == zram->nr_pages)
return 0;
if (test_and_set_bit(blk_idx, zram->bitmap))
goto retry;
atomic64_inc(&zram->stats.bd_count);
return blk_idx;
}
static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
{
int was_set;
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
atomic64_dec(&zram->stats.bd_count);
}
static void read_from_bdev_async(struct zram *zram, struct page *page,
unsigned long entry, struct bio *parent)
{
struct bio *bio;
bio = bio_alloc(file_bdev(zram->bdev_file), 1, parent->bi_opf, GFP_NOIO);
bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
__bio_add_page(bio, page, PAGE_SIZE, 0);
bio_chain(bio, parent);
submit_bio(bio);
}
#define PAGE_WB_SIG "page_index="
#define PAGE_WRITEBACK 0
#define HUGE_WRITEBACK (1<<0)
#define IDLE_WRITEBACK (1<<1)
#define INCOMPRESSIBLE_WRITEBACK (1<<2)
static ssize_t writeback_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct zram *zram = dev_to_zram(dev);
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
unsigned long index = 0;
struct bio bio;
struct bio_vec bio_vec;
struct page *page;
ssize_t ret = len;
int mode, err;
unsigned long blk_idx = 0;
if (sysfs_streq(buf, "idle"))
mode = IDLE_WRITEBACK;
else if (sysfs_streq(buf, "huge"))
mode = HUGE_WRITEBACK;
else if (sysfs_streq(buf, "huge_idle"))
mode = IDLE_WRITEBACK | HUGE_WRITEBACK;
else if (sysfs_streq(buf, "incompressible"))
mode = INCOMPRESSIBLE_WRITEBACK;
else {
if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1))
return -EINVAL;
if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) ||
index >= nr_pages)
return -EINVAL;
nr_pages = 1;
mode = PAGE_WRITEBACK;
}
down_read(&zram->init_lock);
if (!init_done(zram)) {
ret = -EINVAL;
goto release_init_lock;
}
if (!zram->backing_dev) {
ret = -ENODEV;
goto release_init_lock;
}
page = alloc_page(GFP_KERNEL);
if (!page) {
ret = -ENOMEM;
goto release_init_lock;
}
for (; nr_pages != 0; index++, nr_pages--) {
spin_lock(&zram->wb_limit_lock);
if (zram->wb_limit_enable && !zram->bd_wb_limit) {
spin_unlock(&zram->wb_limit_lock);
ret = -EIO;
break;
}
spin_unlock(&zram->wb_limit_lock);
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
ret = -ENOSPC;
break;
}
}
zram_slot_lock(zram, index);
if (!zram_allocated(zram, index))
goto next;
if (zram_test_flag(zram, index, ZRAM_WB) ||
zram_test_flag(zram, index, ZRAM_SAME) ||
zram_test_flag(zram, index, ZRAM_UNDER_WB))
goto next;
if (mode & IDLE_WRITEBACK &&
!zram_test_flag(zram, index, ZRAM_IDLE))
goto next;
if (mode & HUGE_WRITEBACK &&
!zram_test_flag(zram, index, ZRAM_HUGE))
goto next;
if (mode & INCOMPRESSIBLE_WRITEBACK &&
!zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE))
goto next;
/*
* Clearing ZRAM_UNDER_WB is duty of caller.
* IOW, zram_free_page never clear it.
*/
zram_set_flag(zram, index, ZRAM_UNDER_WB);
/* Need for hugepage writeback racing */
zram_set_flag(zram, index, ZRAM_IDLE);
zram_slot_unlock(zram, index);
if (zram_read_page(zram, page, index, NULL)) {
zram_slot_lock(zram, index);
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
zram_clear_flag(zram, index, ZRAM_IDLE);
zram_slot_unlock(zram, index);
continue;
}
bio_init(&bio, file_bdev(zram->bdev_file), &bio_vec, 1,
REQ_OP_WRITE | REQ_SYNC);
bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
__bio_add_page(&bio, page, PAGE_SIZE, 0);
/*
* XXX: A single page IO would be inefficient for write
* but it would be not bad as starter.
*/
err = submit_bio_wait(&bio);
if (err) {
zram_slot_lock(zram, index);
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
zram_clear_flag(zram, index, ZRAM_IDLE);
zram_slot_unlock(zram, index);
/*
* BIO errors are not fatal, we continue and simply
* attempt to writeback the remaining objects (pages).
* At the same time we need to signal user-space that
* some writes (at least one, but also could be all of
* them) were not successful and we do so by returning
* the most recent BIO error.
*/
ret = err;
continue;
}
atomic64_inc(&zram->stats.bd_writes);
/*
* We released zram_slot_lock so need to check if the slot was
* changed. If there is freeing for the slot, we can catch it
* easily by zram_allocated.
* A subtle case is the slot is freed/reallocated/marked as
* ZRAM_IDLE again. To close the race, idle_store doesn't
* mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
* Thus, we could close the race by checking ZRAM_IDLE bit.
*/
zram_slot_lock(zram, index);
if (!zram_allocated(zram, index) ||
!zram_test_flag(zram, index, ZRAM_IDLE)) {
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
zram_clear_flag(zram, index, ZRAM_IDLE);
goto next;
}
zram_free_page(zram, index);
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
zram_set_flag(zram, index, ZRAM_WB);
zram_set_element(zram, index, blk_idx);
blk_idx = 0;
atomic64_inc(&zram->stats.pages_stored);
spin_lock(&zram->wb_limit_lock);
if (zram->wb_limit_enable && zram->bd_wb_limit > 0)
zram->bd_wb_limit -= 1UL << (PAGE_SHIFT - 12);
spin_unlock(&zram->wb_limit_lock);
next:
zram_slot_unlock(zram, index);
}
if (blk_idx)
free_block_bdev(zram, blk_idx);
__free_page(page);
release_init_lock:
up_read(&zram->init_lock);
return ret;
}
struct zram_work {
struct work_struct work;
struct zram *zram;
unsigned long entry;
struct page *page;
int error;
};
static void zram_sync_read(struct work_struct *work)
{
struct zram_work *zw = container_of(work, struct zram_work, work);
struct bio_vec bv;
struct bio bio;
bio_init(&bio, file_bdev(zw->zram->bdev_file), &bv, 1, REQ_OP_READ);
bio.bi_iter.bi_sector = zw->entry * (PAGE_SIZE >> 9);
__bio_add_page(&bio, zw->page, PAGE_SIZE, 0);
zw->error = submit_bio_wait(&bio);
}
/*
* Block layer want one ->submit_bio to be active at a time, so if we use
* chained IO with parent IO in same context, it's a deadlock. To avoid that,
* use a worker thread context.
*/
static int read_from_bdev_sync(struct zram *zram, struct page *page,
unsigned long entry)
{
struct zram_work work;
work.page = page;
work.zram = zram;
work.entry = entry;
INIT_WORK_ONSTACK(&work.work, zram_sync_read);
queue_work(system_unbound_wq, &work.work);
flush_work(&work.work);
destroy_work_on_stack(&work.work);
return work.error;
}
static int read_from_bdev(struct zram *zram, struct page *page,
unsigned long entry, struct bio *parent)
{
atomic64_inc(&zram->stats.bd_reads);
if (!parent) {
if (WARN_ON_ONCE(!IS_ENABLED(ZRAM_PARTIAL_IO)))
return -EIO;
return read_from_bdev_sync(zram, page, entry);
}
read_from_bdev_async(zram, page, entry, parent);
return 0;
}
#else
static inline void reset_bdev(struct zram *zram) {};
static int read_from_bdev(struct zram *zram, struct page *page,
unsigned long entry, struct bio *parent)
{
return -EIO;
}
static void free_block_bdev(struct zram *zram, unsigned long blk_idx) {};
#endif
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
static struct dentry *zram_debugfs_root;
static void zram_debugfs_create(void)
{
zram_debugfs_root = debugfs_create_dir("zram", NULL);
}
static void zram_debugfs_destroy(void)
{
debugfs_remove_recursive(zram_debugfs_root);
}
static ssize_t read_block_state(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
char *kbuf;
ssize_t index, written = 0;
struct zram *zram = file->private_data;
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
struct timespec64 ts;
kbuf = kvmalloc(count, GFP_KERNEL);
if (!kbuf)
return -ENOMEM;
down_read(&zram->init_lock);
if (!init_done(zram)) {
up_read(&zram->init_lock);
kvfree(kbuf);
return -EINVAL;
}
for (index = *ppos; index < nr_pages; index++) {
int copied;
zram_slot_lock(zram, index);
if (!zram_allocated(zram, index))
goto next;
ts = ktime_to_timespec64(zram->table[index].ac_time);
copied = snprintf(kbuf + written, count,
"%12zd %12lld.%06lu %c%c%c%c%c%c\n",
index, (s64)ts.tv_sec,
ts.tv_nsec / NSEC_PER_USEC,
zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.',
zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.',
zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.',
zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.',
zram_get_priority(zram, index) ? 'r' : '.',
zram_test_flag(zram, index,
ZRAM_INCOMPRESSIBLE) ? 'n' : '.');
if (count <= copied) {
zram_slot_unlock(zram, index);
break;
}
written += copied;
count -= copied;
next:
zram_slot_unlock(zram, index);
*ppos += 1;
}
up_read(&zram->init_lock);
if (copy_to_user(buf, kbuf, written))
written = -EFAULT;
kvfree(kbuf);
return written;
}
static const struct file_operations proc_zram_block_state_op = {
.open = simple_open,
.read = read_block_state,
.llseek = default_llseek,
};
static void zram_debugfs_register(struct zram *zram)
{
if (!zram_debugfs_root)
return;
zram->debugfs_dir = debugfs_create_dir(zram->disk->disk_name,
zram_debugfs_root);
debugfs_create_file("block_state", 0400, zram->debugfs_dir,
zram, &proc_zram_block_state_op);
}
static void zram_debugfs_unregister(struct zram *zram)
{
debugfs_remove_recursive(zram->debugfs_dir);
}
#else
static void zram_debugfs_create(void) {};
static void zram_debugfs_destroy(void) {};
static void zram_debugfs_register(struct zram *zram) {};
static void zram_debugfs_unregister(struct zram *zram) {};
#endif
/*
* We switched to per-cpu streams and this attr is not needed anymore.
* However, we will keep it around for some time, because:
* a) we may revert per-cpu streams in the future
* b) it's visible to user space and we need to follow our 2 years
* retirement rule; but we already have a number of 'soon to be
* altered' attrs, so max_comp_streams need to wait for the next
* layoff cycle.
*/
static ssize_t max_comp_streams_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
return scnprintf(buf, PAGE_SIZE, "%d\n", num_online_cpus());
}
static ssize_t max_comp_streams_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
return len;
}
static void comp_algorithm_set(struct zram *zram, u32 prio, const char *alg)
{
/* Do not free statically defined compression algorithms */
if (zram->comp_algs[prio] != default_compressor)
kfree(zram->comp_algs[prio]);
zram->comp_algs[prio] = alg;
}
static ssize_t __comp_algorithm_show(struct zram *zram, u32 prio, char *buf)
{
ssize_t sz;
down_read(&zram->init_lock);
sz = zcomp_available_show(zram->comp_algs[prio], buf);
up_read(&zram->init_lock);
return sz;
}
static int __comp_algorithm_store(struct zram *zram, u32 prio, const char *buf)
{
char *compressor;
size_t sz;
sz = strlen(buf);
if (sz >= CRYPTO_MAX_ALG_NAME)
return -E2BIG;
compressor = kstrdup(buf, GFP_KERNEL);
if (!compressor)
return -ENOMEM;
/* ignore trailing newline */
if (sz > 0 && compressor[sz - 1] == '\n')
compressor[sz - 1] = 0x00;
if (!zcomp_available_algorithm(compressor)) {
kfree(compressor);
return -EINVAL;
}
down_write(&zram->init_lock);
if (init_done(zram)) {
up_write(&zram->init_lock);
kfree(compressor);
pr_info("Can't change algorithm for initialized device\n");
return -EBUSY;
}
comp_algorithm_set(zram, prio, compressor);
up_write(&zram->init_lock);
return 0;
}
static ssize_t comp_algorithm_show(struct device *dev,
struct device_attribute *attr,
char *buf)
{
struct zram *zram = dev_to_zram(dev);
return __comp_algorithm_show(zram, ZRAM_PRIMARY_COMP, buf);
}
static ssize_t comp_algorithm_store(struct device *dev,
struct device_attribute *attr,
const char *buf,
size_t len)
{
struct zram *zram = dev_to_zram(dev);
int ret;
ret = __comp_algorithm_store(zram, ZRAM_PRIMARY_COMP, buf);
return ret ? ret : len;
}
#ifdef CONFIG_ZRAM_MULTI_COMP
static ssize_t recomp_algorithm_show(struct device *dev,
struct device_attribute *attr,
char *buf)
{
struct zram *zram = dev_to_zram(dev);
ssize_t sz = 0;
u32 prio;
for (prio = ZRAM_SECONDARY_COMP; prio < ZRAM_MAX_COMPS; prio++) {
if (!zram->comp_algs[prio])
continue;
sz += scnprintf(buf + sz, PAGE_SIZE - sz - 2, "#%d: ", prio);
sz += __comp_algorithm_show(zram, prio, buf + sz);
}
return sz;
}
static ssize_t recomp_algorithm_store(struct device *dev,
struct device_attribute *attr,
const char *buf,
size_t len)
{
struct zram *zram = dev_to_zram(dev);
int prio = ZRAM_SECONDARY_COMP;
char *args, *param, *val;
char *alg = NULL;
int ret;
args = skip_spaces(buf);
while (*args) {
args = next_arg(args, &param, &val);
if (!val || !*val)
return -EINVAL;
if (!strcmp(param, "algo")) {
alg = val;
continue;
}
if (!strcmp(param, "priority")) {
ret = kstrtoint(val, 10, &prio);
if (ret)
return ret;
continue;
}
}
if (!alg)
return -EINVAL;
if (prio < ZRAM_SECONDARY_COMP || prio >= ZRAM_MAX_COMPS)
return -EINVAL;
ret = __comp_algorithm_store(zram, prio, alg);
return ret ? ret : len;
}
#endif
static ssize_t compact_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct zram *zram = dev_to_zram(dev);
down_read(&zram->init_lock);
if (!init_done(zram)) {
up_read(&zram->init_lock);
return -EINVAL;
}
zs_compact(zram->mem_pool);
up_read(&zram->init_lock);
return len;
}
static ssize_t io_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu 0 %8llu\n",
(u64)atomic64_read(&zram->stats.failed_reads),
(u64)atomic64_read(&zram->stats.failed_writes),
(u64)atomic64_read(&zram->stats.notify_free));
up_read(&zram->init_lock);
return ret;
}
static ssize_t mm_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct zram *zram = dev_to_zram(dev);
struct zs_pool_stats pool_stats;
u64 orig_size, mem_used = 0;
long max_used;
ssize_t ret;
memset(&pool_stats, 0x00, sizeof(struct zs_pool_stats));
down_read(&zram->init_lock);
if (init_done(zram)) {
mem_used = zs_get_total_pages(zram->mem_pool);
zs_pool_stats(zram->mem_pool, &pool_stats);
}
orig_size = atomic64_read(&zram->stats.pages_stored);
max_used = atomic_long_read(&zram->stats.max_used_pages);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
orig_size << PAGE_SHIFT,
(u64)atomic64_read(&zram->stats.compr_data_size),
mem_used << PAGE_SHIFT,
zram->limit_pages << PAGE_SHIFT,
max_used << PAGE_SHIFT,
(u64)atomic64_read(&zram->stats.same_pages),
atomic_long_read(&pool_stats.pages_compacted),
(u64)atomic64_read(&zram->stats.huge_pages),
(u64)atomic64_read(&zram->stats.huge_pages_since));
up_read(&zram->init_lock);
return ret;
}
#ifdef CONFIG_ZRAM_WRITEBACK
#define FOUR_K(x) ((x) * (1 << (PAGE_SHIFT - 12)))
static ssize_t bd_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu %8llu\n",
FOUR_K((u64)atomic64_read(&zram->stats.bd_count)),
FOUR_K((u64)atomic64_read(&zram->stats.bd_reads)),
FOUR_K((u64)atomic64_read(&zram->stats.bd_writes)));
up_read(&zram->init_lock);
return ret;
}
#endif
static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
int version = 1;
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"version: %d\n%8llu %8llu\n",
version,
(u64)atomic64_read(&zram->stats.writestall),
(u64)atomic64_read(&zram->stats.miss_free));
up_read(&zram->init_lock);
return ret;
}
static DEVICE_ATTR_RO(io_stat);
static DEVICE_ATTR_RO(mm_stat);
#ifdef CONFIG_ZRAM_WRITEBACK
static DEVICE_ATTR_RO(bd_stat);
#endif
static DEVICE_ATTR_RO(debug_stat);
static void zram_meta_free(struct zram *zram, u64 disksize)
{
size_t num_pages = disksize >> PAGE_SHIFT;
size_t index;
/* Free all pages that are still in this zram device */
for (index = 0; index < num_pages; index++)
zram_free_page(zram, index);
zs_destroy_pool(zram->mem_pool);
vfree(zram->table);
}
static bool zram_meta_alloc(struct zram *zram, u64 disksize)
{
size_t num_pages;
num_pages = disksize >> PAGE_SHIFT;
zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
if (!zram->table)
return false;
zram->mem_pool = zs_create_pool(zram->disk->disk_name);
if (!zram->mem_pool) {
vfree(zram->table);
return false;
}
if (!huge_class_size)
huge_class_size = zs_huge_class_size(zram->mem_pool);
return true;
}
/*
* To protect concurrent access to the same index entry,
* caller should hold this table index entry's bit_spinlock to
* indicate this index entry is accessing.
*/
static void zram_free_page(struct zram *zram, size_t index)
{
unsigned long handle;
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
zram->table[index].ac_time = 0;
#endif
if (zram_test_flag(zram, index, ZRAM_IDLE))
zram_clear_flag(zram, index, ZRAM_IDLE);
if (zram_test_flag(zram, index, ZRAM_HUGE)) {
zram_clear_flag(zram, index, ZRAM_HUGE);
atomic64_dec(&zram->stats.huge_pages);
}
if (zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE))
zram_clear_flag(zram, index, ZRAM_INCOMPRESSIBLE);
zram_set_priority(zram, index, 0);
if (zram_test_flag(zram, index, ZRAM_WB)) {
zram_clear_flag(zram, index, ZRAM_WB);
free_block_bdev(zram, zram_get_element(zram, index));
goto out;
}
/*
* No memory is allocated for same element filled pages.
* Simply clear same page flag.
*/
if (zram_test_flag(zram, index, ZRAM_SAME)) {
zram_clear_flag(zram, index, ZRAM_SAME);
atomic64_dec(&zram->stats.same_pages);
goto out;
}
handle = zram_get_handle(zram, index);
if (!handle)
return;
zs_free(zram->mem_pool, handle);
atomic64_sub(zram_get_obj_size(zram, index),
&zram->stats.compr_data_size);
out:
atomic64_dec(&zram->stats.pages_stored);
zram_set_handle(zram, index, 0);
zram_set_obj_size(zram, index, 0);
WARN_ON_ONCE(zram->table[index].flags &
~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
}
/*
* Reads (decompresses if needed) a page from zspool (zsmalloc).
* Corresponding ZRAM slot should be locked.
*/
static int zram_read_from_zspool(struct zram *zram, struct page *page,
u32 index)
{
struct zcomp_strm *zstrm;
unsigned long handle;
unsigned int size;
void *src, *dst;
u32 prio;
int ret;
handle = zram_get_handle(zram, index);
if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
unsigned long value;
void *mem;
value = handle ? zram_get_element(zram, index) : 0;
mem = kmap_local_page(page);
zram_fill_page(mem, PAGE_SIZE, value);
kunmap_local(mem);
return 0;
}
size = zram_get_obj_size(zram, index);
if (size != PAGE_SIZE) {
prio = zram_get_priority(zram, index);
zstrm = zcomp_stream_get(zram->comps[prio]);
}
src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
if (size == PAGE_SIZE) {
dst = kmap_local_page(page);
copy_page(dst, src);
kunmap_local(dst);
ret = 0;
} else {
dst = kmap_local_page(page);
ret = zcomp_decompress(zstrm, src, size, dst);
kunmap_local(dst);
zcomp_stream_put(zram->comps[prio]);
}
zs_unmap_object(zram->mem_pool, handle);
return ret;
}
static int zram_read_page(struct zram *zram, struct page *page, u32 index,
struct bio *parent)
{
int ret;
zram_slot_lock(zram, index);
if (!zram_test_flag(zram, index, ZRAM_WB)) {
/* Slot should be locked through out the function call */
ret = zram_read_from_zspool(zram, page, index);
zram_slot_unlock(zram, index);
} else {
/*
* The slot should be unlocked before reading from the backing
* device.
*/
zram_slot_unlock(zram, index);
ret = read_from_bdev(zram, page, zram_get_element(zram, index),
parent);
}
/* Should NEVER happen. Return bio error if it does. */
if (WARN_ON(ret < 0))
pr_err("Decompression failed! err=%d, page=%u\n", ret, index);
return ret;
}
/*
* Use a temporary buffer to decompress the page, as the decompressor
* always expects a full page for the output.
*/
static int zram_bvec_read_partial(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset)
{
struct page *page = alloc_page(GFP_NOIO);
int ret;
if (!page)
return -ENOMEM;
ret = zram_read_page(zram, page, index, NULL);
if (likely(!ret))
memcpy_to_bvec(bvec, page_address(page) + offset);
__free_page(page);
return ret;
}
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
{
if (is_partial_io(bvec))
return zram_bvec_read_partial(zram, bvec, index, offset);
return zram_read_page(zram, bvec->bv_page, index, bio);
}
static int zram_write_page(struct zram *zram, struct page *page, u32 index)
{
int ret = 0;
unsigned long alloced_pages;
unsigned long handle = -ENOMEM;
unsigned int comp_len = 0;
void *src, *dst, *mem;
struct zcomp_strm *zstrm;
unsigned long element = 0;
enum zram_pageflags flags = 0;
mem = kmap_local_page(page);
if (page_same_filled(mem, &element)) {
kunmap_local(mem);
/* Free memory associated with this sector now. */
flags = ZRAM_SAME;
atomic64_inc(&zram->stats.same_pages);
goto out;
}
kunmap_local(mem);
compress_again:
zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
src = kmap_local_page(page);
ret = zcomp_compress(zstrm, src, &comp_len);
kunmap_local(src);
if (unlikely(ret)) {
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
pr_err("Compression failed! err=%d\n", ret);
zs_free(zram->mem_pool, handle);
return ret;
}
if (comp_len >= huge_class_size)
comp_len = PAGE_SIZE;
/*
* handle allocation has 2 paths:
* a) fast path is executed with preemption disabled (for
* per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
* since we can't sleep;
* b) slow path enables preemption and attempts to allocate
* the page with __GFP_DIRECT_RECLAIM bit set. we have to
* put per-cpu compression stream and, thus, to re-do
* the compression once handle is allocated.
*
* if we have a 'non-null' handle here then we are coming
* from the slow path and handle has already been allocated.
*/
if (IS_ERR_VALUE(handle))
handle = zs_malloc(zram->mem_pool, comp_len,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
__GFP_HIGHMEM |
__GFP_MOVABLE);
if (IS_ERR_VALUE(handle)) {
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
atomic64_inc(&zram->stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
GFP_NOIO | __GFP_HIGHMEM |
__GFP_MOVABLE);
if (IS_ERR_VALUE(handle))
return PTR_ERR((void *)handle);
if (comp_len != PAGE_SIZE)
goto compress_again;
/*
* If the page is not compressible, you need to acquire the
* lock and execute the code below. The zcomp_stream_get()
* call is needed to disable the cpu hotplug and grab the
* zstrm buffer back. It is necessary that the dereferencing
* of the zstrm variable below occurs correctly.
*/
zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
}
alloced_pages = zs_get_total_pages(zram->mem_pool);
update_used_max(zram, alloced_pages);
if (zram->limit_pages && alloced_pages > zram->limit_pages) {
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
zs_free(zram->mem_pool, handle);
return -ENOMEM;
}
dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
src = zstrm->buffer;
if (comp_len == PAGE_SIZE)
src = kmap_local_page(page);
memcpy(dst, src, comp_len);
if (comp_len == PAGE_SIZE)
kunmap_local(src);
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
zs_unmap_object(zram->mem_pool, handle);
atomic64_add(comp_len, &zram->stats.compr_data_size);
out:
/*
* Free memory associated with this sector
* before overwriting unused sectors.
*/
zram_slot_lock(zram, index);
zram_free_page(zram, index);
if (comp_len == PAGE_SIZE) {
zram_set_flag(zram, index, ZRAM_HUGE);
atomic64_inc(&zram->stats.huge_pages);
atomic64_inc(&zram->stats.huge_pages_since);
}
if (flags) {
zram_set_flag(zram, index, flags);
zram_set_element(zram, index, element);
} else {
zram_set_handle(zram, index, handle);
zram_set_obj_size(zram, index, comp_len);
}
zram_slot_unlock(zram, index);
/* Update stats */
atomic64_inc(&zram->stats.pages_stored);
return ret;
}
/*
* This is a partial IO. Read the full page before writing the changes.
*/
static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
{
struct page *page = alloc_page(GFP_NOIO);
int ret;
if (!page)
return -ENOMEM;
ret = zram_read_page(zram, page, index, bio);
if (!ret) {
memcpy_from_bvec(page_address(page) + offset, bvec);
ret = zram_write_page(zram, page, index);
}
__free_page(page);
return ret;
}
static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
{
if (is_partial_io(bvec))
return zram_bvec_write_partial(zram, bvec, index, offset, bio);
return zram_write_page(zram, bvec->bv_page, index);
}
#ifdef CONFIG_ZRAM_MULTI_COMP
/*
* This function will decompress (unless it's ZRAM_HUGE) the page and then
* attempt to compress it using provided compression algorithm priority
* (which is potentially more effective).
*
* Corresponding ZRAM slot should be locked.
*/
static int zram_recompress(struct zram *zram, u32 index, struct page *page,
u32 threshold, u32 prio, u32 prio_max)
{
struct zcomp_strm *zstrm = NULL;
unsigned long handle_old;
unsigned long handle_new;
unsigned int comp_len_old;
unsigned int comp_len_new;
unsigned int class_index_old;
unsigned int class_index_new;
u32 num_recomps = 0;
void *src, *dst;
int ret;
handle_old = zram_get_handle(zram, index);
if (!handle_old)
return -EINVAL;
comp_len_old = zram_get_obj_size(zram, index);
/*
* Do not recompress objects that are already "small enough".
*/
if (comp_len_old < threshold)
return 0;
ret = zram_read_from_zspool(zram, page, index);
if (ret)
return ret;
class_index_old = zs_lookup_class_index(zram->mem_pool, comp_len_old);
/*
* Iterate the secondary comp algorithms list (in order of priority)
* and try to recompress the page.
*/
for (; prio < prio_max; prio++) {
if (!zram->comps[prio])
continue;
/*
* Skip if the object is already re-compressed with a higher
* priority algorithm (or same algorithm).
*/
if (prio <= zram_get_priority(zram, index))
continue;
num_recomps++;
zstrm = zcomp_stream_get(zram->comps[prio]);
src = kmap_local_page(page);
ret = zcomp_compress(zstrm, src, &comp_len_new);
kunmap_local(src);
if (ret) {
zcomp_stream_put(zram->comps[prio]);
return ret;
}
class_index_new = zs_lookup_class_index(zram->mem_pool,
comp_len_new);
/* Continue until we make progress */
if (class_index_new >= class_index_old ||
(threshold && comp_len_new >= threshold)) {
zcomp_stream_put(zram->comps[prio]);
continue;
}
/* Recompression was successful so break out */
break;
}
/*
* We did not try to recompress, e.g. when we have only one
* secondary algorithm and the page is already recompressed
* using that algorithm
*/
if (!zstrm)
return 0;
if (class_index_new >= class_index_old) {
/*
* Secondary algorithms failed to re-compress the page
* in a way that would save memory, mark the object as
* incompressible so that we will not try to compress
* it again.
*
* We need to make sure that all secondary algorithms have
* failed, so we test if the number of recompressions matches
* the number of active secondary algorithms.
*/
if (num_recomps == zram->num_active_comps - 1)
zram_set_flag(zram, index, ZRAM_INCOMPRESSIBLE);
return 0;
}
/* Successful recompression but above threshold */
if (threshold && comp_len_new >= threshold)
return 0;
/*
* No direct reclaim (slow path) for handle allocation and no
* re-compression attempt (unlike in zram_write_bvec()) since
* we already have stored that object in zsmalloc. If we cannot
* alloc memory for recompressed object then we bail out and
* simply keep the old (existing) object in zsmalloc.
*/
handle_new = zs_malloc(zram->mem_pool, comp_len_new,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
__GFP_HIGHMEM |
__GFP_MOVABLE);
if (IS_ERR_VALUE(handle_new)) {
zcomp_stream_put(zram->comps[prio]);
return PTR_ERR((void *)handle_new);
}
dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
memcpy(dst, zstrm->buffer, comp_len_new);
zcomp_stream_put(zram->comps[prio]);
zs_unmap_object(zram->mem_pool, handle_new);
zram_free_page(zram, index);
zram_set_handle(zram, index, handle_new);
zram_set_obj_size(zram, index, comp_len_new);
zram_set_priority(zram, index, prio);
atomic64_add(comp_len_new, &zram->stats.compr_data_size);
atomic64_inc(&zram->stats.pages_stored);
return 0;
}
#define RECOMPRESS_IDLE (1 << 0)
#define RECOMPRESS_HUGE (1 << 1)
static ssize_t recompress_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t len)
{
u32 prio = ZRAM_SECONDARY_COMP, prio_max = ZRAM_MAX_COMPS;
struct zram *zram = dev_to_zram(dev);
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
char *args, *param, *val, *algo = NULL;
u32 mode = 0, threshold = 0;
unsigned long index;
struct page *page;
ssize_t ret;
args = skip_spaces(buf);
while (*args) {
args = next_arg(args, &param, &val);
if (!val || !*val)
return -EINVAL;
if (!strcmp(param, "type")) {
if (!strcmp(val, "idle"))
mode = RECOMPRESS_IDLE;
if (!strcmp(val, "huge"))
mode = RECOMPRESS_HUGE;
if (!strcmp(val, "huge_idle"))
mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE;
continue;
}
if (!strcmp(param, "threshold")) {
/*
* We will re-compress only idle objects equal or
* greater in size than watermark.
*/
ret = kstrtouint(val, 10, &threshold);
if (ret)
return ret;
continue;
}
if (!strcmp(param, "algo")) {
algo = val;
continue;
}
}
if (threshold >= huge_class_size)
return -EINVAL;
down_read(&zram->init_lock);
if (!init_done(zram)) {
ret = -EINVAL;
goto release_init_lock;
}
if (algo) {
bool found = false;
for (; prio < ZRAM_MAX_COMPS; prio++) {
if (!zram->comp_algs[prio])
continue;
if (!strcmp(zram->comp_algs[prio], algo)) {
prio_max = min(prio + 1, ZRAM_MAX_COMPS);
found = true;
break;
}
}
if (!found) {
ret = -EINVAL;
goto release_init_lock;
}
}
page = alloc_page(GFP_KERNEL);
if (!page) {
ret = -ENOMEM;
goto release_init_lock;
}
ret = len;
for (index = 0; index < nr_pages; index++) {
int err = 0;
zram_slot_lock(zram, index);
if (!zram_allocated(zram, index))
goto next;
if (mode & RECOMPRESS_IDLE &&
!zram_test_flag(zram, index, ZRAM_IDLE))
goto next;
if (mode & RECOMPRESS_HUGE &&
!zram_test_flag(zram, index, ZRAM_HUGE))
goto next;
if (zram_test_flag(zram, index, ZRAM_WB) ||
zram_test_flag(zram, index, ZRAM_UNDER_WB) ||
zram_test_flag(zram, index, ZRAM_SAME) ||
zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE))
goto next;
err = zram_recompress(zram, index, page, threshold,
prio, prio_max);
next:
zram_slot_unlock(zram, index);
if (err) {
ret = err;
break;
}
cond_resched();
}
__free_page(page);
release_init_lock:
up_read(&zram->init_lock);
return ret;
}
#endif
static void zram_bio_discard(struct zram *zram, struct bio *bio)
{
size_t n = bio->bi_iter.bi_size;
u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
u32 offset = (bio->bi_iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
SECTOR_SHIFT;
/*
* zram manages data in physical block size units. Because logical block
* size isn't identical with physical block size on some arch, we
* could get a discard request pointing to a specific offset within a
* certain physical block. Although we can handle this request by
* reading that physiclal block and decompressing and partially zeroing
* and re-compressing and then re-storing it, this isn't reasonable
* because our intent with a discard request is to save memory. So
* skipping this logical block is appropriate here.
*/
if (offset) {
if (n <= (PAGE_SIZE - offset))
return;
n -= (PAGE_SIZE - offset);
index++;
}
while (n >= PAGE_SIZE) {
zram_slot_lock(zram, index);
zram_free_page(zram, index);
zram_slot_unlock(zram, index);
atomic64_inc(&zram->stats.notify_free);
index++;
n -= PAGE_SIZE;
}
bio_endio(bio);
}
static void zram_bio_read(struct zram *zram, struct bio *bio)
{
unsigned long start_time = bio_start_io_acct(bio);
struct bvec_iter iter = bio->bi_iter;
do {
u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
SECTOR_SHIFT;
struct bio_vec bv = bio_iter_iovec(bio, iter);
bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
if (zram_bvec_read(zram, &bv, index, offset, bio) < 0) {
atomic64_inc(&zram->stats.failed_reads);
bio->bi_status = BLK_STS_IOERR;
break;
}
flush_dcache_page(bv.bv_page);
zram_slot_lock(zram, index);
zram_accessed(zram, index);
zram_slot_unlock(zram, index);
bio_advance_iter_single(bio, &iter, bv.bv_len);
} while (iter.bi_size);
bio_end_io_acct(bio, start_time);
bio_endio(bio);
}
static void zram_bio_write(struct zram *zram, struct bio *bio)
{
unsigned long start_time = bio_start_io_acct(bio);
struct bvec_iter iter = bio->bi_iter;
do {
u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
SECTOR_SHIFT;
struct bio_vec bv = bio_iter_iovec(bio, iter);
bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
atomic64_inc(&zram->stats.failed_writes);
bio->bi_status = BLK_STS_IOERR;
break;
}
zram_slot_lock(zram, index);
zram_accessed(zram, index);
zram_slot_unlock(zram, index);
bio_advance_iter_single(bio, &iter, bv.bv_len);
} while (iter.bi_size);
bio_end_io_acct(bio, start_time);
bio_endio(bio);
}
/*
* Handler function for all zram I/O requests.
*/
static void zram_submit_bio(struct bio *bio)
{
struct zram *zram = bio->bi_bdev->bd_disk->private_data;
switch (bio_op(bio)) {
case REQ_OP_READ:
zram_bio_read(zram, bio);
break;
case REQ_OP_WRITE:
zram_bio_write(zram, bio);
break;
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, bio);
break;
default:
WARN_ON_ONCE(1);
bio_endio(bio);
}
}
static void zram_slot_free_notify(struct block_device *bdev,
unsigned long index)
{
struct zram *zram;
zram = bdev->bd_disk->private_data;
atomic64_inc(&zram->stats.notify_free);
if (!zram_slot_trylock(zram, index)) {
atomic64_inc(&zram->stats.miss_free);
return;
}
zram_free_page(zram, index);
zram_slot_unlock(zram, index);
}
static void zram_destroy_comps(struct zram *zram)
{
u32 prio;
for (prio = 0; prio < ZRAM_MAX_COMPS; prio++) {
struct zcomp *comp = zram->comps[prio];
zram->comps[prio] = NULL;
if (!comp)
continue;
zcomp_destroy(comp);
zram->num_active_comps--;
}
}
static void zram_reset_device(struct zram *zram)
{
down_write(&zram->init_lock);
zram->limit_pages = 0;
if (!init_done(zram)) {
up_write(&zram->init_lock);
return;
}
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
/* I/O operation under all of CPU are done so let's free */
zram_meta_free(zram, zram->disksize);
zram->disksize = 0;
zram_destroy_comps(zram);
memset(&zram->stats, 0, sizeof(zram->stats));
reset_bdev(zram);
comp_algorithm_set(zram, ZRAM_PRIMARY_COMP, default_compressor);
up_write(&zram->init_lock);
}
static ssize_t disksize_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
u64 disksize;
struct zcomp *comp;
struct zram *zram = dev_to_zram(dev);
int err;
u32 prio;
disksize = memparse(buf, NULL);
if (!disksize)
return -EINVAL;
down_write(&zram->init_lock);
if (init_done(zram)) {
pr_info("Cannot change disksize for initialized device\n");
err = -EBUSY;
goto out_unlock;
}
disksize = PAGE_ALIGN(disksize);
if (!zram_meta_alloc(zram, disksize)) {
err = -ENOMEM;
goto out_unlock;
}
for (prio = 0; prio < ZRAM_MAX_COMPS; prio++) {
if (!zram->comp_algs[prio])
continue;
comp = zcomp_create(zram->comp_algs[prio]);
if (IS_ERR(comp)) {
pr_err("Cannot initialise %s compressing backend\n",
zram->comp_algs[prio]);
err = PTR_ERR(comp);
goto out_free_comps;
}
zram->comps[prio] = comp;
zram->num_active_comps++;
}
zram->disksize = disksize;
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
return len;
out_free_comps:
zram_destroy_comps(zram);
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
return err;
}
static ssize_t reset_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
int ret;
unsigned short do_reset;
struct zram *zram;
struct gendisk *disk;
ret = kstrtou16(buf, 10, &do_reset);
if (ret)
return ret;
if (!do_reset)
return -EINVAL;
zram = dev_to_zram(dev);
disk = zram->disk;
mutex_lock(&disk->open_mutex);
/* Do not reset an active device or claimed device */
if (disk_openers(disk) || zram->claim) {
mutex_unlock(&disk->open_mutex);
return -EBUSY;
}
/* From now on, anyone can't open /dev/zram[0-9] */
zram->claim = true;
mutex_unlock(&disk->open_mutex);
/* Make sure all the pending I/O are finished */
sync_blockdev(disk->part0);
zram_reset_device(zram);
mutex_lock(&disk->open_mutex);
zram->claim = false;
mutex_unlock(&disk->open_mutex);
return len;
}
static int zram_open(struct gendisk *disk, blk_mode_t mode)
{
struct zram *zram = disk->private_data;
WARN_ON(!mutex_is_locked(&disk->open_mutex));
/* zram was claimed to reset so open request fails */
if (zram->claim)
return -EBUSY;
return 0;
}
static const struct block_device_operations zram_devops = {
.open = zram_open,
.submit_bio = zram_submit_bio,
.swap_slot_free_notify = zram_slot_free_notify,
.owner = THIS_MODULE
};
static DEVICE_ATTR_WO(compact);
static DEVICE_ATTR_RW(disksize);
static DEVICE_ATTR_RO(initstate);
static DEVICE_ATTR_WO(reset);
static DEVICE_ATTR_WO(mem_limit);
static DEVICE_ATTR_WO(mem_used_max);
static DEVICE_ATTR_WO(idle);
static DEVICE_ATTR_RW(max_comp_streams);
static DEVICE_ATTR_RW(comp_algorithm);
#ifdef CONFIG_ZRAM_WRITEBACK
static DEVICE_ATTR_RW(backing_dev);
static DEVICE_ATTR_WO(writeback);
static DEVICE_ATTR_RW(writeback_limit);
static DEVICE_ATTR_RW(writeback_limit_enable);
#endif
#ifdef CONFIG_ZRAM_MULTI_COMP
static DEVICE_ATTR_RW(recomp_algorithm);
static DEVICE_ATTR_WO(recompress);
#endif
static struct attribute *zram_disk_attrs[] = {
&dev_attr_disksize.attr,
&dev_attr_initstate.attr,
&dev_attr_reset.attr,
&dev_attr_compact.attr,
&dev_attr_mem_limit.attr,
&dev_attr_mem_used_max.attr,
&dev_attr_idle.attr,
&dev_attr_max_comp_streams.attr,
&dev_attr_comp_algorithm.attr,
#ifdef CONFIG_ZRAM_WRITEBACK
&dev_attr_backing_dev.attr,
&dev_attr_writeback.attr,
&dev_attr_writeback_limit.attr,
&dev_attr_writeback_limit_enable.attr,
#endif
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
#ifdef CONFIG_ZRAM_WRITEBACK
&dev_attr_bd_stat.attr,
#endif
&dev_attr_debug_stat.attr,
#ifdef CONFIG_ZRAM_MULTI_COMP
&dev_attr_recomp_algorithm.attr,
&dev_attr_recompress.attr,
#endif
NULL,
};
ATTRIBUTE_GROUPS(zram_disk);
/*
* Allocate and initialize new zram device. the function returns
* '>= 0' device_id upon success, and negative value otherwise.
*/
static int zram_add(void)
{
struct queue_limits lim = {
.logical_block_size = ZRAM_LOGICAL_BLOCK_SIZE,
/*
* To ensure that we always get PAGE_SIZE aligned and
* n*PAGE_SIZED sized I/O requests.
*/
.physical_block_size = PAGE_SIZE,
.io_min = PAGE_SIZE,
.io_opt = PAGE_SIZE,
.max_hw_discard_sectors = UINT_MAX,
/*
* zram_bio_discard() will clear all logical blocks if logical
* block size is identical with physical block size(PAGE_SIZE).
* But if it is different, we will skip discarding some parts of
* logical blocks in the part of the request range which isn't
* aligned to physical block size. So we can't ensure that all
* discarded logical blocks are zeroed.
*/
#if ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE
.max_write_zeroes_sectors = UINT_MAX,
#endif
};
struct zram *zram;
int ret, device_id;
zram = kzalloc(sizeof(struct zram), GFP_KERNEL);
if (!zram)
return -ENOMEM;
ret = idr_alloc(&zram_index_idr, zram, 0, 0, GFP_KERNEL);
if (ret < 0)
goto out_free_dev;
device_id = ret;
init_rwsem(&zram->init_lock);
#ifdef CONFIG_ZRAM_WRITEBACK
spin_lock_init(&zram->wb_limit_lock);
#endif
/* gendisk structure */
zram->disk = blk_alloc_disk(&lim, NUMA_NO_NODE);
if (IS_ERR(zram->disk)) {
pr_err("Error allocating disk structure for device %d\n",
device_id);
ret = PTR_ERR(zram->disk);
goto out_free_idr;
}
zram->disk->major = zram_major;
zram->disk->first_minor = device_id;
zram->disk->minors = 1;
zram->disk->flags |= GENHD_FL_NO_PART;
zram->disk->fops = &zram_devops;
zram->disk->private_data = zram;
snprintf(zram->disk->disk_name, 16, "zram%d", device_id);
/* Actual capacity set using sysfs (/sys/block/zram<id>/disksize */
set_capacity(zram->disk, 0);
/* zram devices sort of resembles non-rotational disks */
blk_queue_flag_set(QUEUE_FLAG_NONROT, zram->disk->queue);
blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, zram->disk->queue);
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
ret = device_add_disk(NULL, zram->disk, zram_disk_groups);
if (ret)
goto out_cleanup_disk;
comp_algorithm_set(zram, ZRAM_PRIMARY_COMP, default_compressor);
zram_debugfs_register(zram);
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
out_cleanup_disk:
put_disk(zram->disk);
out_free_idr:
idr_remove(&zram_index_idr, device_id);
out_free_dev:
kfree(zram);
return ret;
}
static int zram_remove(struct zram *zram)
{
bool claimed;
mutex_lock(&zram->disk->open_mutex);
if (disk_openers(zram->disk)) {
mutex_unlock(&zram->disk->open_mutex);
return -EBUSY;
}
claimed = zram->claim;
if (!claimed)
zram->claim = true;
mutex_unlock(&zram->disk->open_mutex);
zram_debugfs_unregister(zram);
if (claimed) {
/*
* If we were claimed by reset_store(), del_gendisk() will
* wait until reset_store() is done, so nothing need to do.
*/
;
} else {
/* Make sure all the pending I/O are finished */
sync_blockdev(zram->disk->part0);
zram_reset_device(zram);
}
pr_info("Removed device: %s\n", zram->disk->disk_name);
del_gendisk(zram->disk);
/* del_gendisk drains pending reset_store */
WARN_ON_ONCE(claimed && zram->claim);
/*
* disksize_store() may be called in between zram_reset_device()
* and del_gendisk(), so run the last reset to avoid leaking
* anything allocated with disksize_store()
*/
zram_reset_device(zram);
put_disk(zram->disk);
kfree(zram);
return 0;
}
/* zram-control sysfs attributes */
/*
* NOTE: hot_add attribute is not the usual read-only sysfs attribute. In a
* sense that reading from this file does alter the state of your system -- it
* creates a new un-initialized zram device and returns back this device's
* device_id (or an error code if it fails to create a new device).
*/
static ssize_t hot_add_show(const struct class *class,
const struct class_attribute *attr,
char *buf)
{
int ret;
mutex_lock(&zram_index_mutex);
ret = zram_add();
mutex_unlock(&zram_index_mutex);
if (ret < 0)
return ret;
return scnprintf(buf, PAGE_SIZE, "%d\n", ret);
}
/* This attribute must be set to 0400, so CLASS_ATTR_RO() can not be used */
static struct class_attribute class_attr_hot_add =
__ATTR(hot_add, 0400, hot_add_show, NULL);
static ssize_t hot_remove_store(const struct class *class,
const struct class_attribute *attr,
const char *buf,
size_t count)
{
struct zram *zram;
int ret, dev_id;
/* dev_id is gendisk->first_minor, which is `int' */
ret = kstrtoint(buf, 10, &dev_id);
if (ret)
return ret;
if (dev_id < 0)
return -EINVAL;
mutex_lock(&zram_index_mutex);
zram = idr_find(&zram_index_idr, dev_id);
if (zram) {
ret = zram_remove(zram);
if (!ret)
idr_remove(&zram_index_idr, dev_id);
} else {
ret = -ENODEV;
}
mutex_unlock(&zram_index_mutex);
return ret ? ret : count;
}
static CLASS_ATTR_WO(hot_remove);
static struct attribute *zram_control_class_attrs[] = {
&class_attr_hot_add.attr,
&class_attr_hot_remove.attr,
NULL,
};
ATTRIBUTE_GROUPS(zram_control_class);
static struct class zram_control_class = {
.name = "zram-control",
.class_groups = zram_control_class_groups,
};
static int zram_remove_cb(int id, void *ptr, void *data)
{
WARN_ON_ONCE(zram_remove(ptr));
return 0;
}
static void destroy_devices(void)
{
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
}
static int __init zram_init(void)
{
int ret;
BUILD_BUG_ON(__NR_ZRAM_PAGEFLAGS > BITS_PER_LONG);
ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
zcomp_cpu_up_prepare, zcomp_cpu_dead);
if (ret < 0)
return ret;
ret = class_register(&zram_control_class);
if (ret) {
pr_err("Unable to register zram-control class\n");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
return ret;
}
zram_debugfs_create();
zram_major = register_blkdev(0, "zram");
if (zram_major <= 0) {
pr_err("Unable to get major number\n");
class_unregister(&zram_control_class);
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
return -EBUSY;
}
while (num_devices != 0) {
mutex_lock(&zram_index_mutex);
ret = zram_add();
mutex_unlock(&zram_index_mutex);
if (ret < 0)
goto out_error;
num_devices--;
}
return 0;
out_error:
destroy_devices();
return ret;
}
static void __exit zram_exit(void)
{
destroy_devices();
}
module_init(zram_init);
module_exit(zram_exit);
module_param(num_devices, uint, 0);
MODULE_PARM_DESC(num_devices, "Number of pre-created zram devices");
MODULE_LICENSE("Dual BSD/GPL");
MODULE_AUTHOR("Nitin Gupta <ngupta@vflare.org>");
MODULE_DESCRIPTION("Compressed RAM Block Device");