linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-23 11:01:15 +00:00

Author	SHA1	Message	Date
David Sterba	907d2710d7	btrfs: add cancellable chunk relocation support Add support code that will allow canceling relocation on the chunk granularity. This is different and independent of balance, that also uses relocation but is a higher level operation and manages it's own state and pause/cancellation requests. Relocation is used for resize (shrink) and device deletion so this will be a common point to implement cancellation for both. The context is entirely in btrfs_relocate_block_group and btrfs_recover_relocation, enclosing one chunk relocation. The status bit is set and unset between the chunks. As relocation can take long, the effects may not be immediate and the request and actual action can slightly race. The fs_info::reloc_cancel_req is only supposed to be increased and does not pair with decrement like fs_info::balance_cancel_req. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:07 +02:00
David Sterba	0d7ed32c1e	btrfs: protect exclusive_operation by super_lock The exclusive operation is now atomically checked and set using bit operations. Switch it to protection by spinlock. The super block lock is not frequently used and adding a new lock seems like an overkill so it should be safe to reuse it. The reason to use spinlock is to enhance the locking context so more checks can be done, eg. allowing the same exclusive operation enter the exclop section and cancel the running one. This will be used for resize and device delete. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	24880be59c	btrfs: clean up header members offsets in write helpers Move header offsetof() to the expression that calculates the address so it's part of get_eb_offset_in_page where the 2nd parameter is the member offset. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	dfd29eed4a	btrfs: simplify eb checksum verification in btrfs_validate_metadata_buffer The verification copies the calculated checksum bytes to a temporary buffer but this is not necessary. We can map the eb header on the first page and use the checksum bytes directly. This saves at least one function call and boundary checks so it could lead to a minor performance improvement. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	ff14aa7987	btrfs: remove extra sb::s_id from message in btrfs_validate_metadata_buffer The s_id is already printed by message helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	282ab3ff16	btrfs: reduce compressed_bio members' types Several members of compressed_bio are of type that's unnecessarily big for the values that they'd hold: - the size of the uncompressed and compressed data is 128K now, we can keep is as int - same for number of pages - the compress type fits to a byte - the errors is 0/1 The size of the unpatched structure is 80 bytes with several holes. Reordering nr_pages next to the pages the hole after pending_bios is filled and the resulting size is 56 bytes. This keeps the csums array aligned to 8 bytes, which is nice. Further size optimizations may be possible but right now it looks good to me: struct compressed_bio { refcount_t pending_bios; /* 0 4 / unsigned int nr_pages; / 4 4 / struct page * compressed_pages; /* 8 8 / struct inode inode; /* 16 8 / u64 start; / 24 8 / unsigned int len; / 32 4 / unsigned int compressed_len; / 36 4 / u8 compress_type; / 40 1 / u8 errors; / 41 1 / / XXX 2 bytes hole, try to pack / int mirror_num; / 44 4 / struct bio orig_bio; /* 48 8 / u8 sums[]; / 56 0 / / size: 56, cachelines: 1, members: 12 / / sum members: 54, holes: 1, sum holes: 2 / / last cacheline: 56 bytes */ }; Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	49547068f6	btrfs: document byte swap optimization of root_item::flags accessors Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	7735cd755b	btrfs: scrub: factor out common scrub_stripe constraints There are common values set for the stripe constraints, some of them are already factored out. Do that for increment and mirror_num as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	1aeb6b563a	btrfs: clear log tree recovering status if starting transaction fails When a log recovery is in progress, lots of operations have to take that into account, so we keep this status per tree during the operation. Long time ago error handling revamp patch `79787eaab4` ("btrfs: replace many BUG_ONs with proper error handling") removed clearing of the status in an error branch. Add it back as was intended in `e02119d5a7` ("Btrfs: Add a write ahead tree log to optimize synchronous operations"). There are probably no visible effects, log replay is done only during mount and if it fails all structures are cleared so the stale status won't be kept. Fixes: `79787eaab4` ("btrfs: replace many BUG_ONs with proper error handling") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	6819703f5a	btrfs: clear defrag status of a root if starting transaction fails The defrag loop processes leaves in batches and starting transaction for each. The whole defragmentation on a given root is protected by a bit but in case the transaction fails, the bit is not cleared In case the transaction fails the bit would prevent starting defragmentation again, so make sure it's cleared. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
David Sterba	8c5ec99561	btrfs: sysfs: fix format string for some discard stats The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the format should be %llu, though the actual values would hardly ever overflow to negative values. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
Josef Bacik	5963ffcaf3	btrfs: always abort the transaction if we abort a trans handle While stress testing our error handling I noticed that sometimes we would still commit the transaction even though we had aborted the transaction. Currently we track if a trans handle has dirtied any metadata, and if it hasn't we mark the filesystem as having an error (so no new transactions can be started), but we will allow the current transaction to complete as we do not mark the transaction itself as having been aborted. This sounds good in theory, but we were not properly tracking IO errors in btrfs_finish_ordered_io, and thus committing the transaction with bogus free space data. This isn't necessarily a problem per-se with the free space cache, as the other guards in place would have kept us from accepting the free space cache as valid, but highlights a real world case where we had a bug and could have corrupted the filesystem because of it. This "skip abort on empty trans handle" is nice in theory, but assumes we have perfect error handling everywhere, which we clearly do not. Also we do not allow further transactions to be started, so all this does is save the last transaction that was happening, which doesn't necessarily gain us anything other than the potential for real corruption. Remove this particular bit of code, if we decide we need to abort the transaction then abort the current one and keep us from doing real harm to the file system, regardless of whether this specific trans handle dirtied anything or not. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:06 +02:00
Filipe Manana	0d7d316597	btrfs: don't set the full sync flag when truncation does not touch extents At btrfs_truncate() where we truncate the inode either to the same size or to a smaller size, we always set the full sync flag on the inode. This is needed in case the truncation drops or trims any file extent items that start beyond or cross the new inode size, so that the next fsync drops all inode items from the log and scans again the fs/subvolume tree to find all items that must be logged. However if the truncation does not drop or trims any file extent items, we do not need to set the full sync flag and force the next fsync to use the slow code path. So do not set the full sync flag in such cases. One use case where it is frequent to do truncations that do not change the inode size and do not drop any extents (no prealloc extents beyond i_size) is when running Microsoft's SQL Server inside a Docker container. One example workload is the one Philipp Fent reported recently, in the thread with a link below. In this workload a large number of fsyncs are preceded by such truncate operations. After this change I constantly get the runtime for that workload from Philipp to be reduced by about -12%, for example from 184 seconds down to 162 seconds. Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/ Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Filipe Manana	4f7e67378e	btrfs: fix misleading and incomplete comment of btrfs_truncate() The comment at the top of btrfs_truncate() mentions that csum items are dropped or truncated to the new i_size, but this is wrong and non sense, as they are unrelated to the i_size and are located in the csums tree and not on a tree with inode items (fs/subvolume tree or a log tree). Instead that claim applies to file extent items, so fix the comment to refer to them instead. While at it make the whole comment for the function more descriptive and follow the kernel doc style. Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Josef Bacik	04587ad9be	btrfs: abort transaction if we fail to update the delayed inode If we fail to update the delayed inode we need to abort the transaction, because we could leave an inode with the improper counts or some other such corruption behind. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Josef Bacik	bb385bedde	btrfs: fix error handling in __btrfs_update_delayed_inode If we get an error while looking up the inode item we'll simply bail without cleaning up the delayed node. This results in this style of warning happening on commit: WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90 CPU: 0 PID: 76403 Comm: fsstress Tainted: G W 5.13.0-rc1+ #373 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90 RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000 RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060 RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000 R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00 R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90 FS: 00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0 Call Trace: btrfs_commit_transaction+0x43c/0xb00 ? finish_wait+0x80/0x80 ? vfs_fsync_range+0x90/0x90 iterate_supers+0x8c/0x100 ksys_sync+0x50/0x90 __do_sys_sync+0xa/0x10 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Because the iref isn't dropped and this leaves an elevated node->count, so any release just re-queues it onto the delayed inodes list. Fix this by going to the out label to handle the proper cleanup of the delayed node. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Josef Bacik	a4cb90dc01	btrfs: make btrfs_release_delayed_iref handle the !iref case Right now we only cleanup the delayed iref if we have BTRFS_DELAYED_NODE_DEL_IREF set on the node. However we have some error conditions that need to cleanup the iref if it still exists, so to make this code cleaner move the test_bit into btrfs_release_delayed_iref itself and unconditionally call it in each of the cases instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
David Sterba	eb3b505366	btrfs: scrub: per-device bandwidth control Add sysfs interface to limit io during scrub. We relied on the ionice interface to do that, eg. the idle class let the system usable while scrub was running. This has changed when mq-deadline got widespread and did not implement the scheduling classes. That was a CFQ thing that got deleted. We've got numerous complaints from users about degraded performance. Currently only BFQ supports that but it's not a common scheduler and we can't ask everybody to switch to it. Alternatively the cgroup io limiting can be used but that also a non-trivial setup (v2 required, the controller must be enabled on the system). This can still be used if desired. Other ideas that have been explored: piggy-back on ionice (that is set per-process and is accessible) and interpret the class and classdata as bandwidth limits, but this does not have enough flexibility as there are only 8 allowed and we'd have to map fixed limits to each value. Also adjusting the value would need to lookup the process that currently runs scrub on the given device, and the value is not sticky so would have to be adjusted each time scrub runs. Running out of options, sysfs does not look that bad: - it's accessible from scripts, or udev rules - the name is similar to what MD-RAID has (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max) - the value is sticky at least for filesystem mount time - adjusting the value has immediate effect - sysfs is available in constrained environments (eg. system rescue) - the limit also applies to device replace Sysfs: - raw value is in bytes - values written to the file accept suffixes like K, M - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max - 0 means use default priority of IO The scheduler is a simple deadline one and the accuracy is up to nearest 128K. Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Johannes Thumshirn	e7ff9e6b8e	btrfs: zoned: factor out zoned device lookup To be able to construct a zone append bio we need to look up the btrfs_device. The code doing the chunk map lookup to get the device is present in btrfs_submit_compressed_write and submit_extent_page. Factor out the lookup calls into a helper and use it in the submission paths. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Tian Tao	50535db8fb	btrfs: return EAGAIN if defrag is canceled When inode defrag is canceled, the error is set to EAGAIN but then overwritten by number of defragmented bytes. As this would hide the error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it will print the error and continue to next file (as it does in for any other error). Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Qu Wenruo	1245835d24	btrfs: remove io_failure_record::in_validation The io_failure_record::in_validation was introduced to handle failed bio which cross several sectors. In such case, we still need to verify which sectors are corrupted. But since we've changed the way how we handle corrupted sectors, by only submitting repair for each corrupted sector, there is no need for extra validation any more. This patch will cleanup all io_failure_record::in_validation related code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Qu Wenruo	150e4b0597	btrfs: submit read time repair only for each corrupted sector Currently btrfs_submit_read_repair() has some extra check on whether the failed bio needs extra validation for repair. But we can avoid all these extra mechanisms if we submit the repair for each sector. By this, each read repair can be easily handled without the need to verify which sector is corrupted. This will also benefit subpage, as one subpage bvec can contain several sectors, making the extra verification more complex. So this patch will: - Introduce repair_one_sector() The main code submitting repair, which is more or less the same as old btrfs_submit_read_repair(). But this time, it only repairs one sector. - Make btrfs_submit_read_repair() to handle sectors differently There are 3 different cases: * Good sector We need to release the page and extent, set the range uptodate. * Bad sector and failed to submit repair bio We need to release the page and extent, but not set the range uptodate. * Bad sector but repair bio submitted The page and extent release will be handled by the submitted repair bio. Nothing needs to be done. Since btrfs_submit_read_repair() will handle the page and extent release now, we need to skip to next bvec even we hit some error. - Change the lifespan of @uptodate in end_bio_extent_readpage() Since now btrfs_submit_read_repair() will handle the full bvec which contains any corruption, we don't need to bother updating @uptodate bit anymore. Just let @uptodate to be local variable inside the main loop, so that any error from one bvec won't affect later bvec. - Only export btrfs_repair_one_sector(), unexport btrfs_submit_read_repair() The only outside caller for read repair is DIO, which already submits its repair for just one sector. Only export btrfs_repair_one_sector() for DIO. This patch will focus on the change on the repair path, the extra validation code is still kept as is, and will be cleaned up later. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Qu Wenruo	08508fea07	btrfs: make btrfs_verify_data_csum() to return a bitmap This will provide the basis for later per-sector repair for subpage, while still keeping the existing code happy. As if all csums match, the return value will be 0, same as now. Only when csum mismatches, the return value is different. The new return value will be a bitmap, for 4K sectorsize and 4K page size, it will be either 1, instead of the -EIO (which is not used directly by the callers, no effective change). But for 4K sectorsize and 64K page size, aka subpage case, since the bvec can contain multiple sectors, knowing which sectors are corrupted will allow us to submit repair only for corrupted sectors. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Johannes Thumshirn	f4dcfb3045	btrfs: rename check_async_write and let it return bool The 'check_async_write' function is a helper used in 'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be used for metadata. Make the function return bool and get rid of the local variable async in btrfs_submit_metadata_bio storing the result of check_async_write's tests. As this is touching all function call sites, also rename it to should_async_write as this is more in line with the naming we use. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Johannes Thumshirn	06e1e7f422	btrfs: zoned: bail out if we can't read a reliable write pointer If we can't read a reliable write pointer from a sequential zone fail creating the block group with an I/O error. Also if the read write pointer is beyond the end of the respective zone, fail the creation of the block group on this zone with an I/O error. While this could also happen in real world scenarios with misbehaving drives, this issue addresses a problem uncovered by fstests' test case generic/475. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Naohiro Aota	47cdfb5e1d	btrfs: zoned: print message when zone sanity check type fails This extends patch `784daf2b96` ("btrfs: zoned: sanity check zone type"), the message was supposed to be there but was lost during merge. We want to make the error noticeable so add it. Fixes: `784daf2b96` ("btrfs: zoned: sanity check zone type") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	385f421f18	btrfs: handle preemptive delalloc flushing slightly differently If we decide to flush delalloc from the preemptive flusher, we really do not want to wait on ordered extents, as it gains us nothing. However there was logic to go ahead and wait on ordered extents if there was more ordered bytes than delalloc bytes. We do not want this behavior, so pass through whether this flushing is for preemption, and do not wait for ordered extents if that's the case. Also break out of the shrink loop after the first flushing, as we just want to one shot shrink delalloc. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	3e10156997	btrfs: only ignore delalloc if delalloc is much smaller than ordered While testing heavy delalloc workloads I noticed that sometimes we'd just stop preemptively flushing when we had loads of delalloc available to flush. This is because we skip preemptive flushing if delalloc <= ordered. However if we start with say 4gib of delalloc, and we flush 2gib of that, we'll stop flushing there, when we still have 2gib of delalloc to flush. Instead adjust the ordered bytes down by half, this way if 2/3 of our outstanding delalloc reservations are tied up by ordered extents we don't bother preemptive flushing, as we're getting close to the state where we need to wait on ordered extents. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	30acce4eb0	btrfs: don't include the global rsv size in the preemptive used amount When deciding if we should preemptively flush space, we will add in the amount of space used by all block rsvs. However this also includes the global block rsv, which isn't flushable so shouldn't be accounted for in this calculation. If we decide to use ->bytes_may_use in our used calculation we need to subtract the global rsv size from this amount so it most closely matches the flushable space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	1239e2da16	btrfs: use the global rsv size in the preemptive thresh calculation We calculate the amount of "free" space available for normal reservations by taking the total space and subtracting out the hard used space, which is readonly, used, and reserved space. However we weren't taking into account the global block rsv, which is essentially hard used space. Handle this by subtracting it from the available free space, so that our threshold more closely mirrors reality. We need to do the check because it's possible that the global_rsv_size + used is > total_bytes, sometimes the global reserve can end up being calculated as larger than the available size (think small filesystems where we only have the original 8MiB chunk of metadata). It doesn't usually happen, but that can get us into trouble so this is safer. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	610a6ef44e	btrfs: take into account global rsv in need_preemptive_reclaim Global rsv can't be used for normal allocations, and for very full file systems we can decide to try and async flush constantly even though there's really not a lot of space to reclaim. Deal with this by including the global block rsv size in the "total used" calculation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	0aae4ca9e9	btrfs: only clamp the first time we have to start flushing We were clamping the threshold for preemptive reclaim any time we added a ticket to wait on, which if we have a lot of threads means we'd essentially max out the clamp the first time we start to flush. Instead of doing this, simply do it every time we have to start flushing, this will make us ramp up gradually instead of going to max clamping as soon as we start needing to do flushing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	ed738ba7f9	btrfs: check worker before need_preemptive_reclaim need_preemptive_reclaim() does some calculations, which aren't heavy, but if we're already running preemptive reclaim there's no reason to do them at all, so re-order the checks so that we don't do the calculation if we're already doing reclaim. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Su Yue	94358c35d8	btrfs: remove stale comment for argument seed of btrfs_find_device Commit `b2598edf8b` ("btrfs: remove unused argument seed from btrfs_find_device") removed the argument seed from btrfs_find_device but forgot the comment, so remove it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Goldwyn Rodrigues	dc56219fe2	btrfs: correct try_lock_extent() usage in read_extent_buffer_subpage() try_lock_extent() returns 1 on success or 0 for failure and not an error code. If try_lock_extent() fails, read_extent_buffer_subpage() returns zero indicating subpage extent read success. Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the extent. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Linus Torvalds	13311e7425	Linux 5.13-rc7	2021-06-20 15:03:15 -07:00
Linus Torvalds	cba5e97280	- A single fix to restore fairness between control groups with equal priority -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO65oACgkQEsHwGGHe VUqEuw//R3hYrPE6luEeDb8V3er2QXJn320G0Jv2yqycmZlZYzN0tuCg0heRUWGS Mh8jnKlNZ0GWh/9CWApxQFbqMnt/G3kkLhMZRbIfNm/dvRxdVVB8MP98+5sgcMv0 mbuLNrrANB6HANMxs0lM0IvzH24YDDVI92J17zSZEi4mKjDPmHAvtZgjfhzlNU4p EZEMx7UZLyBsP4/cMXpq2SEmY8A3evjKkSjVIXhBMf929mpbGYzSM5RSPy+abGDt ai/E5644RT27HWo/g7mh/szC9OZbg6TNbsF9J6msInh6kCDLBv6Awh/0NUM3bDu4 r7H2qgxTIv3oisTZNf9qjyx1uStcJJDGF66t7NhYXRVqChQbOYNBoYJ85kVSh/FD jjiow9WtwYrbjQ8bT/+NkGu0poUL5gxnGqUfFyucofq46ct48+I36pnr+3V12OTj BxJb0sIDAJNPxv1NODpOMtEJJeiumtROU0usFHb9wnTz8jGhU/PFyKdKk8wa41f2 kG3fQp39gI6T+D0Va3tMCY3UNOFocDmDYhsYZfRtUPp+d1T2jTPP23Yx5XCQ6gew cxliIwttQAO6W4dJe4H5Txm5jmjzDfWJ9zrAXuABkVghRxXswiGHhhfUn2AhJK0w 9Sz+E7xYA/G5ht6pCvXpFfsJ0S0R4YFHddu7rS8S/EGNLe2729o= =KnMk -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "A single fix to restore fairness between control groups with equal priority" * tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Correctly insert cfs_rq's to list on unthrottle	2021-06-20 09:44:52 -07:00
Linus Torvalds	9df7f15ee9	A single fix for GICv3 to not take an interrupt in an NMI context. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO6X8ACgkQEsHwGGHe VUpz8hAAwaEfQmTDwJ+4PKOklmEdn7sgIeuIVXFrpCn1psDB0btqkrzAPkzPAM63 ISnvKuySId3uDQ5mmgLwaQDsa3j1yciOhdyA3dGBCidtR6zcm/hCM48y3iUs1kRH CS8Ai/MpUQzi8Y/bFDMkQ3yedQG5CMApy63xk3MVlNh+jUBZkQ3fSPynrVN+jVR0 nBbJXkKcMD7CGFgQnNO7weqnYJrcxWuQZSHALotJDBoVas0sgj97CLDDLmA5n8NW 42QLW7OUxEUkMfRWb/iCqkzZ7vrKVUHZC2d/rBzWCRIQy5SHIwwVg4FFeAr51cTt 72+MTf6lnA8aXQffVyvMPnHuhSp1ynin5NOMsu3YXMbF1lIU8ptKy5V3ttvF+Bcb cktI5i076PjScbvxbikTrI0QmLoeb2QTEnDgErrUN7CZLcVZQFLXtYbjrQvI9ycF 8Ezw3a76tIzb4uSaxar8e4Sn1lc8VpEDUMNlhAu1/g/mFzlHF86QLuh4y4mjrReV 9P1hMqtlfbDvLpQVu8S6KlrwXv+znqpRg9utA6SgJzt6yjnywOzlcptIdjHg3fE2 gxhGs7/edp3NkN936o5Bh5vydtrCGUMtAS2KcxuRXKyEdusOgYJUgo85/QX2mI1h xNzn1yVjQuvU/Hp91nM7V/9boeDYgGqzRCQKR9frVTDHilezn9Q= =Ss4I -----END PGP SIGNATURE----- Merge tag 'irq_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Borislav Petkov: "A single fix for GICv3 to not take an interrupt in an NMI context" * tag 'irq_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3: Workaround inconsistent PMR setting on NMI entry	2021-06-20 09:38:14 -07:00
Linus Torvalds	8363e795eb	A first set of urgent fixes to the FPU/XSTATE handling mess^W code. (There's a lot more in the pipe): - Prevent corruption of the XSTATE buffer in signal handling by validating what is being copied from userspace first. - Invalidate other task's preserved FPU registers on XRSTOR failure (#PF) because latter can still modify some of them. - Restore the proper PKRU value in case userspace modified it - Reset FPU state when signal restoring fails Other: - Map EFI boot services data memory as encrypted in a SEV guest so that the guest can access it and actually boot properly - Two SGX correctness fixes: proper resources freeing and a NUMA fix -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO5vQACgkQEsHwGGHe VUrUjw//fRU8BPZ3/SWNQO188QhHdFpm3jqtjRJsZD1FfnnLdxIg2SCP4RjFxv+Y eFyN0nYLekG8a3CMV081H9Rhr5tt3bflk0oTcGAar7m2qQiCiqaAH0wptIlQonSu nQCSs+PeaaK4nRCtW+TUJnwG0ZU/y7fEXa3pxJ6hSMnxZjz3lj70zKhpA1nQtqRZ OOStvBNtaWcDdTTE4r8XuFIxuMUUEuwHlQQmkAVHQYUf6vxGYfnDYEg83Wddvq1E 1leSRNFlLcCAbPUV/fax3KGvaekeJ1U411uWqXlain6m105+mk+irmrLxtur/lJ5 cWTVb5CbIHFZnJvC5jzNPv/03GbIIQaVm4jPI2qB1AZbjcVlAPKj1Ne+U1fzvmDT wNUob/rnIXiGptvtUMNYGURxBTj65Nnom3iAJV+AdMOThDwYMvsJJjFkMnC5wO2n ZAexumWPnUzWoxSMTraT7a6b/kilFUrcPljxSrFd9yVeU8E6a1OSW35oWoQ3itrc xx/ne8RodLmCPC9DjecFcQR+qUuXsF+XCCj07QpfKNTAObr17e9nsKJneR6MX79C Lpc7Ka/CiTGYcebWX7tqtjwGPfa6iqekswxYRRp7j54bQ4sHmKyordZy0Q8+c079 gmMlPdNbqQg3YwHyXW2yeJETDS1HBp61RRojAP15BsL73wyYQNE= =AuXr -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "A first set of urgent fixes to the FPU/XSTATE handling mess^W code. (There's a lot more in the pipe): - Prevent corruption of the XSTATE buffer in signal handling by validating what is being copied from userspace first. - Invalidate other task's preserved FPU registers on XRSTOR failure (#PF) because latter can still modify some of them. - Restore the proper PKRU value in case userspace modified it - Reset FPU state when signal restoring fails Other: - Map EFI boot services data memory as encrypted in a SEV guest so that the guest can access it and actually boot properly - Two SGX correctness fixes: proper resources freeing and a NUMA fix" * tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Avoid truncating memblocks for SGX memory x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed x86/fpu: Reset state for all signal restore failures x86/pkru: Write hardware init value to PKRU when xstate is init x86/process: Check PF_KTHREAD and not current->mm for kernel threads x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer x86/fpu: Prevent state corruption in __fpu__restore_sig() x86/ioremap: Map EFI-reserved memory as encrypted for SEV	2021-06-20 09:09:58 -07:00
Linus Torvalds	b84a7c286c	powerpc fixes for 5.13 #6 Fix initrd corruption caused by our recent change to use relative jump labels. Fix a crash using perf record on systems without a hardware PMU backend. Rework our 64-bit signal handling slighty to make it more closely match the old behaviour, after the recent change to use unsafe user accessors. Thanks to: Anastasia Kovaleva, Athira Rajeev, Christophe Leroy, Daniel Axtens, Greg Kurz, Roman Bolshakov. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmDOeA0THG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgHZgD/9NbskRdhx9Vj+lWBCa8K37Cckf+aYu bQxszcVDA65xwhASk9CotSy6NC1HgyxB7n3VO7FbCku50JNapT85/Onl07R/Aiiv PhHOuDs5Gj8hB8rdpxYQjas3C2XW/UJR6ebogMcNxf4BN6fjHHoLbGmig1o+X2Jm rd9l5dWiiK27o8McqCsTESW1VKVtjov7owX7xh/HW/U6hbDkuLdVyMSViaisIwi2 I1wfzmMWcN8JpBUv1G7pWFuKgatsTfr2p3bsdVFmPl3LjUXXcyJ8zQS5yoV6uD5F laFx/BG1T06y1ny1yvEL3sTlNHE0tQPF+FVZ75hYmPKnE7tjo5rxeGFiul4RWo5q oiSVDOrjt2urNeRVv10oSCUgs2epRosIaTqXJx1JyK/yYF2oI4FvqkTMBjPxeGot ZHqFV8QNm19ZxlMtvzNSpagp6FX8kEOVxJaGersfmzBhSZudcR/VxX0086rWguw4 RA+0+qY6KGBi+qxylmyvzpJjsp59houykDNhhbED7tpQJACex7JMDJiiH88VP8l3 vf9h1z+NbBoiR3y0a/uv1nDTMLsTQ5PbcnNbZr9u+Oc5vwu8DP1gCq41lm6BOMbz F6IxdcOqBHn3HGM11ZtAt5u6ep3ZDfPx8lMth1z3kaAXjm3nlDAJPC4N//p6vz+a IzL1Iv0r2X7qvQ== =Qrh9 -----END PGP SIGNATURE----- Merge tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Fix initrd corruption caused by our recent change to use relative jump labels. Fix a crash using perf record on systems without a hardware PMU backend. Rework our 64-bit signal handling slighty to make it more closely match the old behaviour, after the recent change to use unsafe user accessors. Thanks to Anastasia Kovaleva, Athira Rajeev, Christophe Leroy, Daniel Axtens, Greg Kurz, and Roman Bolshakov" * tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not set powerpc: Fix initrd corruption with relative jump labels powerpc/signal64: Copy siginfo before changing regs->nip powerpc/mem: Add back missing header to fix 'no previous prototype' error	2021-06-19 16:50:23 -07:00
Linus Torvalds	913ec3c22e	perf tools fixes for v5.13: 6th batch - Fix refcount usage when processing PERF_RECORD_KSYMBOL. - 'perf stat' metric group fixes. - Fix 'perf test' non-bash issue with stat bpf counters. - Update unistd, in.h and socket.h with the kernel sources, silencing perf build warnings. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCYM4gswAKCRCyPKLppCJ+ J/iZAP9xGK0IMWv8UI9bD3Npmy/nswU6+aCuRQTHBTceiu1MDAD/a0LhcBVVbXdC Y60AZbUg0vlOB14GbURACIuW3kh/Ng8= =vzwg -----END PGP SIGNATURE----- Merge tag 'perf-tools-fixes-for-v5.13-2021-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix refcount usage when processing PERF_RECORD_KSYMBOL. - 'perf stat' metric group fixes. - Fix 'perf test' non-bash issue with stat bpf counters. - Update unistd, in.h and socket.h with the kernel sources, silencing perf build warnings. * tag 'perf-tools-fixes-for-v5.13-2021-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: tools headers UAPI: Sync linux/in.h copy with the kernel sources tools headers UAPI: Sync asm-generic/unistd.h with the kernel original perf beauty: Update copy of linux/socket.h with the kernel sources perf test: Fix non-bash issue with stat bpf counters perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL perf metricgroup: Return error code from metricgroup__add_metric_sys_event_iter() perf metricgroup: Fix find_evsel_group() event selector	2021-06-19 14:50:43 -07:00
Linus Torvalds	d9403d307d	RISC-V Fixes for 5.13-rc7 * A build fix to always build modules with the medany code model, as the module loader doesn't support medlow. * A Kconfig warning fix for the SiFive errata. * A pair of fixes that for regressions to the recent memory layout changes. * A fix for the FU740 device tree. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmDOEBITHHBhbG1lckBk YWJiZWx0LmNvbQAKCRAuExnzX7sYiZhsD/90A+I0pI0MWsjHYbzpSIK/jWKQ6kgY Y6HtzZrt31BX3Cq5IshOxsOlHyQkGiu8rHXH2kWPId3WX1bM0q2bSO3EoaTch5/s OJ2KKxlp2L758VRaK41ec098QHxjs4Iv1YwnNmcKhCkYvYNV1sq/Vp2RgBXVIoS+ Jk7sWPN+5CCCcDjKuig+mwZILwHlMsLtm9w61bB4+GWCz3OHKHSIWmPkSfmQT7Sk n/KvKrLAonVQxTuBI3syEywy5uHcBooSSNf8kMi8BGwbqmo6CI0CMO9dddeubizK ehmColwQWnke3fR02Mt2ATurGeE0epZHTND1ZTyOSCLUSbJecPsAhmGkUkjPoFoH wufhc+1+KHAT731/o9KQ4TMt0SQHIZHeMOxIBjnw9evg6pVx5iYistvKmfTvsf/j YsRGdrm7//HSZ406Who2qkiTPRYnKJdKpN4wac/XZ/NXN65LL05e5UPxjrH8RShs 2PpIdyNdfBx7d9VlRkHnNNNF7EFC1S5lg7SkW6CXqOTbfVHRO4szH6OUd6QdD83Y MqKRjT2VitnW+tFkX+iYOMEj4bAfFKlmVgjWXUMBJrgnDg6KiZEXCwNbynR/nZA+ 8ssbJqQ2oDZsR/G4s+ePzrNz8Pd2R/9gfmnx1nY016p3JJ/dvX8pBzjQN/X9r8SS 3dKmB8XZ9i/LUA== =HsNS -----END PGP SIGNATURE----- Merge tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - A build fix to always build modules with the 'medany' code model, as the module loader doesn't support 'medlow'. - A Kconfig warning fix for the SiFive errata. - A pair of fixes that for regressions to the recent memory layout changes. - A fix for the FU740 device tree. * tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: dts: fu740: fix cache-controller interrupts riscv: Ensure BPF_JIT_REGION_START aligned with PMD size riscv: kasan: Fix MODULES_VADDR evaluation due to local variables' name riscv: sifive: fix Kconfig errata warning riscv32: Use medany C model for modules	2021-06-19 08:45:34 -07:00
Linus Torvalds	e14c779ade	- Fix zcrypt ioctl hang due to AP queue msg counter dropping below 0 when pending requests are purged. - Two fixes for the machine check handler in the entry code. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmDNvpgACgkQjYWKoQLX FBiuvQf/VrYPeFQotoit8ZQm5EUUMHeryxp3DVOeHFilPdANbVh1Q++NxZvf2XSw ejDUwTBuWaMmb2U8/7Yp+dbLYRu6XGKVKHLIArkXzsDzUNaJ+tMszAVKDK+SRHYe zaBYMG6Lfn/ByoME1rqQZ6e/dx+4f0rzorR5A8IlslsSFFyTzt8CSLnrDcu6aj4i bGSf8lS7dMoy49OYzuNYaPl2KcPEyrWMB//vuwj5jDru3eJcqW/m/q/7JRskTicm YsRw0x2waDs2x+SOQJ+haSrNRGY9YKSVUISVgDPX0Yz4PF9noNq1iOCRfu5JcrcR vlm3rJI8uFdZ8p4ZUMJHMS7T7Z5bRg== =GRT1 -----END PGP SIGNATURE----- Merge tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix zcrypt ioctl hang due to AP queue msg counter dropping below 0 when pending requests are purged. - Two fixes for the machine check handler in the entry code. * tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/ap: Fix hanging ioctl caused by wrong msg counter s390/mcck: fix invalid KVM guest condition check s390/mcck: fix calculation of SIE critical section size	2021-06-19 08:39:13 -07:00
Arnaldo Carvalho de Melo	1792a59eab	tools headers UAPI: Sync linux/in.h copy with the kernel sources To pick the changes in: `3218274773` ("icmp: don't send out ICMP messages with a source address of 0.0.0.0") That don't result in any change in tooling, as INADDR_ are not used to generate id->string tables used by 'perf trace'. This addresses this build warning: Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h' diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h Cc: David S. Miller <davem@davemloft.net> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2021-06-19 10:15:22 -03:00
Arnaldo Carvalho de Melo	17d27fc314	tools headers UAPI: Sync asm-generic/unistd.h with the kernel original To pick the changes in: `8b1462b67f` ("quota: finish disable quotactl_path syscall") Those headers are used in some arches to generate the syscall table used in 'perf trace' to translate syscall numbers into strings. This addresses this perf build warning: Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h' diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h Cc: Jan Kara <jack@suse.cz> Cc: Marcin Juszkiewicz <marcin@juszkiewicz.com.pl> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2021-06-19 10:12:30 -03:00
Arnaldo Carvalho de Melo	ef83f9efe8	perf beauty: Update copy of linux/socket.h with the kernel sources To pick the changes in: `ea6932d70e` ("net: make get_net_ns return error if NET_NS is disabled") That don't result in any changes in the tables generated from that header. This silences this perf build warning: Warning: Kernel ABI header at 'tools/perf/trace/beauty/include/linux/socket.h' differs from latest version at 'include/linux/socket.h' diff -u tools/perf/trace/beauty/include/linux/socket.h include/linux/socket.h Cc: Changbin Du <changbin.du@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2021-06-19 10:09:08 -03:00
Ian Rogers	482698c2f8	perf test: Fix non-bash issue with stat bpf counters $(( .. )) is a bash feature but the test's interpreter is !/bin/sh, switch the code to use expr. Signed-off-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: bpf@vger.kernel.org Link: http://lore.kernel.org/lkml/20210617184216.2075588-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2021-06-19 10:06:46 -03:00
Riccardo Mancini	c087e9480c	perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL ASan reported a memory leak of BPF-related ksymbols map and dso. The leak is caused by refount never reaching 0, due to missing __put calls in the function machine__process_ksymbol_register. Once the dso is inserted in the map, dso__put() should be called (map__new2() increases the refcount to 2). The same thing applies for the map when it's inserted into maps (maps__insert() increases the refcount to 2). $ sudo ./perf record -- sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.025 MB perf.data (8 samples) ] ================================================================= ==297735==ERROR: LeakSanitizer: detected memory leaks Direct leak of 6992 byte(s) in 19 object(s) allocated from: #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7) #1 0x8e4e53 in map__new2 /home/user/linux/tools/perf/util/map.c:216:20 #2 0x8cf68c in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:778:10 [...] Indirect leak of 8702 byte(s) in 19 object(s) allocated from: #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7) #1 0x8728d7 in dso__new_id /home/user/linux/tools/perf/util/dso.c:1256:20 #2 0x872015 in dso__new /home/user/linux/tools/perf/util/dso.c:1295:9 #3 0x8cf623 in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:774:21 [...] Indirect leak of 1520 byte(s) in 19 object(s) allocated from: #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7) #1 0x87b3da in symbol__new /home/user/linux/tools/perf/util/symbol.c:269:23 #2 0x888954 in map__process_kallsym_symbol /home/user/linux/tools/perf/util/symbol.c:710:8 [...] Indirect leak of 1406 byte(s) in 19 object(s) allocated from: #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7) #1 0x87b3da in symbol__new /home/user/linux/tools/perf/util/symbol.c:269:23 #2 0x8cfbd8 in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:803:8 [...] Signed-off-by: Riccardo Mancini <rickyman7@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tommi Rantala <tommi.t.rantala@nokia.com> Link: http://lore.kernel.org/lkml/20210612173751.188582-1-rickyman7@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2021-06-19 10:06:46 -03:00
John Garry	fe7a98b9d9	perf metricgroup: Return error code from metricgroup__add_metric_sys_event_iter() The error code is not set at all in the sys event iter function. This may lead to an uninitialized value of "ret" in metricgroup__add_metric() when no CPU metric is added. Fix by properly setting the error code. It is not necessary to init "ret" to 0 in metricgroup__add_metric(), as if we have no CPU or sys event metric matching, then "has_match" should be 0 and "ret" is set to -EINVAL. However gcc cannot detect that it may not have been set after the map_for_each_metric() loop for CPU metrics, which is strange. Fixes: `be335ec28e` ("perf metricgroup: Support adding metrics for system PMUs") Signed-off-by: John Garry <john.garry@huawei.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/1623335580-187317-3-git-send-email-john.garry@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2021-06-19 10:06:46 -03:00
John Garry	fc96ec4d5d	perf metricgroup: Fix find_evsel_group() event selector The following command segfaults on my x86 broadwell: $ ./perf stat -M frontend_bound,retiring,backend_bound,bad_speculation sleep 1 WARNING: grouped events cpus do not match, disabling group: anon group { raw 0x10e } anon group { raw 0x10e } perf: util/evsel.c:1596: get_group_fd: Assertion `!(!leader->core.fd)' failed. Aborted (core dumped) The issue shows itself as a use-after-free in evlist__check_cpu_maps(), whereby the leader of an event selector (evsel) has been deleted (yet we still attempt to verify for an evsel). Fundamentally the problem comes from metricgroup__setup_events() -> find_evsel_group(), and has developed from the previous fix attempt in commit `9c880c24cb` ("perf metricgroup: Fix for metrics containing duration_time"). The problem now is that the logic in checking if an evsel is in the same group is subtly broken for the "cycles" event. For the "cycles" event, the pmu_name is NULL; however the logic in find_evsel_group() may set an event matched against "cycles" as used, when it should not be. This leads to a condition where an evsel is set, yet its leader is not. Fix the check for evsel pmu_name by not matching evsels when either has a NULL pmu_name. There is still a pre-existing metric issue whereby the ordering of the metrics may break the 'stat' function, as discussed at: https://lore.kernel.org/lkml/49c6fccb-b716-1bf0-18a6-cace1cdb66b9@huawei.com/ Fixes: `9c880c24cb` ("perf metricgroup: Fix for metrics containing duration_time") Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> # On a Thinkpad T450S Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/1623335580-187317-2-git-send-email-john.garry@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2021-06-19 10:06:46 -03:00

1 2 3 4 5 ...

1015492 commits