Hugh notes in response to commit 4cb19355ea "device-dax: fail all
private mapping attempts":
"I think that is more restrictive than you intended: haven't tried, but I
believe it rejects a PROT_READ, MAP_SHARED, O_RDONLY fd mmap, leaving no
way to mmap /dev/dax without write permission to it."
Indeed it does restrict read-only mappings, switch to checking
VM_MAYSHARE, not VM_SHARED.
Cc: <stable@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Fixes: 4cb19355ea ("device-dax: fail all private mapping attempts")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Here is an example /proc/iomem listing for a system with 2 namespaces,
one in "sector" mode and one in "memory" mode:
1fc000000-2fbffffff : Persistent Memory (legacy)
1fc000000-2fbffffff : namespace1.0
340000000-34fffffff : Persistent Memory
340000000-34fffffff : btt0.1
Here is the corresponding ndctl listing:
# ndctl list
[
{
"dev":"namespace1.0",
"mode":"memory",
"size":4294967296,
"blockdev":"pmem1"
},
{
"dev":"namespace0.0",
"mode":"sector",
"size":267091968,
"uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
"sector_size":4096,
"blockdev":"pmem0s"
}
]
Notice that the ndctl listing is purely in terms of namespace devices,
while the iomem listing leaks the internal "btt0.1" implementation
detail. Given that ndctl requires the namespace device name to change
the mode, for example:
# ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force
...use the namespace name in the iomem listing to keep the claiming
device name consistent across different mode settings.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The device-dax implementation originally tried to be tricky and allow
private read-only mappings, but in the process allowed writable
MAP_PRIVATE + MAP_NORESERVE mappings. For simplicity and predictability
just fail all private mapping attempts since device-dax memory is
statically allocated and will never support overcommit.
Cc: <stable@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Fixes: dee4107924 ("/dev/dax, core: file operations and dax-mmap")
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If the dax_pmem driver is passed a resource that is already busy the
driver probe attempt should fail with a message like the following:
dax_pmem dax0.1: could not reserve region [mem 0x100000000-0x11fffffff]
However, if we do not catch the error we crash for the obvious reason of
accessing memory that is not mapped.
BUG: unable to handle kernel paging request at ffffc90020001000
IP: [<ffffffff81496712>] __memcpy+0x12/0x20
[..]
Call Trace:
[<ffffffff815c4960>] ? nsio_rw_bytes+0x60/0x180
[<ffffffff815c6045>] nd_pfn_validate+0x75/0x320
[<ffffffff815c63a9>] nvdimm_setup_pfn+0xb9/0x5d0
[<ffffffff815c48ef>] ? devm_nsio_enable+0xff/0x110
[<ffffffff815cb699>] dax_pmem_probe+0x59/0x260
Cc: <stable@vger.kernel.org>
Fixes: ab68f26221 ("/dev/dax, pmem: direct access to persistent memory")
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We need to wait until the percpu_ref is released before exit. Otherwise,
we sometimes lose the race and trigger this new warning that was added
in v4.9 (commit a67823c1ed "percpu-refcount: init ->confirm_switch
member properly"):
WARNING: CPU: 0 PID: 3629 at lib/percpu-refcount.c:107 percpu_ref_exit+0x51/0x60
[..]
Call Trace:
[<ffffffff814bf093>] dump_stack+0x85/0xc2
[<ffffffff810b15db>] __warn+0xcb/0xf0
[<ffffffff810b170d>] warn_slowpath_null+0x1d/0x20
[<ffffffff814d70c1>] percpu_ref_exit+0x51/0x60
[<ffffffffa005706a>] dax_pmem_percpu_exit+0x1a/0x50 [dax_pmem]
[<ffffffff81615f1f>] devm_action_release+0xf/0x20
Cc: <stable@vger.kernel.org>
Fixes: ab68f26221 ("/dev/dax, pmem: direct access to persistent memory")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A bugfix just tried to address a randconfig build problem and introduced
a variant of the same problem: with CONFIG_LIBNVDIMM=y and
CONFIG_NVDIMM_DAX=m, the nvdimm module now fails to link:
drivers/nvdimm/built-in.o: In function `to_nd_device_type':
bus.c:(.text+0x1b5d): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nd_region_notify_driver_action.constprop.2':
region_devs.c:(.text+0x6b6c): undefined reference to `is_nd_dax'
region_devs.c:(.text+0x6b8c): undefined reference to `to_nd_dax'
drivers/nvdimm/built-in.o: In function `nd_region_probe':
region.c:(.text+0x70f3): undefined reference to `nd_dax_create'
drivers/nvdimm/built-in.o: In function `mode_show':
namespace_devs.c:(.text+0xa196): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
(.text+0xa55f): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
(.text+0xa56e): undefined reference to `to_nd_dax'
This reverts the earlier fix, making NVDIMM_DAX a 'bool' option again
as it should be (it gets linked into the libnvdimm module). To fix
the original problem, I'm adding a dependency on LIBNVDIMM to
DEV_DAX_PMEM, which ensures we can't have that one built-in if the
rest is a module.
Fixes: 4e65e9381c ("/dev/dax: fix Kconfig dependency build breakage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The dev_t variable in devm_create_dax_dev() is used before it's
first set:
drivers/dax/dax.c: In function 'devm_create_dax_dev':
drivers/dax/dax.c:205:39: error: 'dev_t' may be used uninitialized in this function [-Werror=maybe-uninitialized]
inode = iget5_locked(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/dax/dax.c:688:8: note: 'dev_t' was declared here
This reorders the code to how it looks correct to me.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 3bc52c45ba ("dax: define a unified inode/address_space for device-dax mappings")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
pgoff_to_phys() validates that both the starting address and the length
of the mapping against the resource list. We need to check for a
mapping size of PMD_SIZE not PAGE_SIZE in the pmd fault path.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The data offset for a dax region needs to account for a reservation in
the resource range. Otherwise, device-dax is allowing mappings directly
into the memmap or device-info-block area with crash signatures like the
following:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: get_zone_device_page+0x11/0x30
Call Trace:
follow_devmap_pmd+0x298/0x2c0
follow_page_mask+0x275/0x530
__get_user_pages+0xe3/0x750
__gfn_to_pfn_memslot+0x1b2/0x450 [kvm]
tdp_page_fault+0x130/0x280 [kvm]
kvm_mmu_page_fault+0x5f/0xf0 [kvm]
handle_ept_violation+0x94/0x180 [kvm_intel]
vmx_handle_exit+0x1d3/0x1440 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm]
kvm_vcpu_ioctl+0x33c/0x620 [kvm]
do_vfs_ioctl+0xa2/0x5d0
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x1a/0xa4
Fixes: ab68f26221 ("/dev/dax, pmem: direct access to persistent memory")
Link: http://lkml.kernel.org/r/147205536732.1606.8994275381938837346.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Abhilash Kumar Mulumudi <m.abhilash-kumar@hpe.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the extents of a dax-device must match the alignment of the region.
Otherwise, we are unable to guarantee fault semantics of a given page
size. The region must be self-consistent itself as well.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In support of enabling resize / truncate of device-dax instances, define
a pseudo-fs to provide a unified inode/address space for vm operations.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A goal of the device-DAX interface is to be able to support many
exclusive allocations (partitions) of performance / feature
differentiated memory. This count may exceed the default minors limit
of 256.
As a result of switching to an embedded cdev the inode-to-dax_dev
conversion is simplified, as well as reference counting which can switch
to the cdev kobject lifetime.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The kref in dax_dev can be made redundant if the final put_device() on
the device associated with the dax_dev frees the dax_dev. This can be
accomplished by embedding a struct device in struct dax_dev, open coding
device_create() and specifying a custom release method.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Shorten the prefix of the file operations to distinguish them from
operations on the struct device associated with the dax_dev.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In order to convert devm_create_dax_dev() to use cdev, it will need
access to dax_fops. Move dax_fops and related function definitions
before devm_create_dax_dev().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
drivers/dax/dax.c:75:6: warning: symbol 'dax_region_put' was not declared.
drivers/dax/dax.c:95:19: warning: symbol 'alloc_dax_region' was not declared.
drivers/dax/dax.c:173:5: warning: symbol 'devm_create_dax_dev' was not declared.
drivers/dax/pmem.c:27:17: warning: symbol 'to_dax_pmem' was not declared.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If devm_add_action() fails, we are explicitly calling the cleanup to free
the resources allocated. Use the helper devm_add_action_or_reset()
and return directly in case of error, since the cleanup function
has been already called by the helper if there was any error.
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Vikas C Sajjan <vikas.cha.sajjan@hpe.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The "Device DAX" core enables dax mappings of performance / feature
differentiated memory. An open mapping or file handle keeps the backing
struct device live, but new mappings are only possible while the device
is enabled. Faults are handled under rcu_read_lock to synchronize
with the enabled state of the device.
Similar to the filesystem-dax case the backing memory may optionally
have struct page entries. However, unlike fs-dax there is no support
for private mappings, or mappings that are not backed by media (see
use of zero-page in fs-dax).
Mappings are always guaranteed to match the alignment of the dax_region.
If the dax_region is configured to have a 2MB alignment, all mappings
are guaranteed to be backed by a pmd entry. Contrast this determinism
with the fs-dax case where pmd mappings are opportunistic. If userspace
attempts to force a misaligned mapping, the driver will fail the mmap
attempt. See dax_dev_check_vma() for other scenarios that are rejected,
like MAP_PRIVATE mappings.
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped
without need of an intervening file system. Device DAX is strict,
precise and predictable. Specifically this interface:
1/ Guarantees fault granularity with respect to a given page size (pte,
pmd, or pud) set at configuration time.
2/ Enforces deterministic behavior by being strict about what fault
scenarios are supported.
For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE
support device-dax guarantees that a mapping always behaves/performs the
same once established. It is the "what you see is what you get" access
mechanism to differentiated memory vs filesystem DAX which has
filesystem specific implementation semantics.
Persistent memory is the first target, but the mechanism is also
targeted for exclusive allocations of performance differentiated memory
ranges.
This commit is limited to the base device driver infrastructure to
associate a dax device with pmem range.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>