block: make blk_init_free_list and elevator_init idempotent blk_init_allocated_queue_node may fail and the caller _could_ retry. Accommodate the unlikely event that blk_init_allocated_queue_node is called on an already initialized (possibly partially) request_queue. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
block: avoid unconditionally freeing previously allocated request_queue On blk_init_allocated_queue_node failure, only free the request_queue if it is wasn't previously allocated outside the block layer (e.g. blk_init_queue_node was blk_init_allocated_queue_node caller). This addresses an interface bug introduced by the following commit: 01effb0 block: allow initialization of previously allocated request_queue Otherwise the request_queue may be free'd out from underneath a caller that is managing the request_queue directly (e.g. caller uses blk_alloc_queue + blk_init_allocated_queue_node). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
block: allow initialization of previously allocated request_queue blk_init_queue() allocates the request_queue structure and then initializes it as needed (request_fn, elevator, etc). Split initialization out to blk_init_allocated_queue_node. Introduce blk_init_allocated_queue wrapper function to model existing blk_init_queue and blk_init_queue_node interfaces. Export elv_register_queue to allow a newly added elevator to be registered with sysfs. Export elv_unregister_queue for symmetry. These changes allow DM to initialize a device's request_queue with more precision. In particular, DM no longer unconditionally initializes a full request_queue (elevator et al). It only does so for a request-based DM device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blkio: Add io_merged stat This includes both the number of bios merged into requests belonging to this cgroup as well as the number of requests merged together. In the past, we've observed different merging behavior across upstream kernels, some by design some actual bugs. This stat helps a lot in debugging such problems when applications report decreased throughput with a new kernel version. This needed adding an extra elevator function to capture bios being merged as I did not want to pollute elevator code with blkiocg knowledge and hence needed the accounting invocation to come from CFQ. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
laptop-mode: Make flushes per-device One of the features of laptop-mode is that it forces a writeout of dirty pages if something else triggers a physical read or write from a device. The current implementation flushes pages on all devices, rather than only the one that triggered the flush. This patch alters the behaviour so that only the recently accessed block device is flushed, preventing other disks being spun up for no terribly good reason. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blkio: Increment the blkio cgroup stats for real now We also add start_time_ns and io_start_time_ns fields to struct request here to record the time when a request is created and when it is dispatched to device. We use ns uints here as ms and jiffies are not very useful for non-rotational media. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block: Consolidate phys_segment and hw_segment limits Except for SCSI no device drivers distinguish between physical and hardware segment limits. Consolidate the two into a single segment limit. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Merge branch 'master' into for-2.6.34 Conflicts: include/linux/blkdev.h Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk-core: use BIO list management functions Now that the bio list management stuff is generic, convert generic_make_request to use bio lists instead of its own private bio list implementation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Revert "block: improve queue_should_plug() by looking at IO depths" This reverts commit fb1e75389bd06fd5987e9cda1b4e0305c782f854. "Benjamin S." <sbenni@gmx.de> reports that the patch in question causes a big drop in sequential throughput for him, dropping from 200MB/sec down to only 70MB/sec. Needs to be investigated more fully, for now lets just revert the offending commit. Conflicts: include/linux/blkdev.h Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block: add helpers to run flush_dcache_page() against a bio and a request's pages Mtdblock driver doesn't call flush_dcache_page for pages in request. So, this causes problems on architectures where the icache doesn't fill from the dcache or with dcache aliases. The patch fixes this. The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid pointless empty cache-thrashing loops on architectures for which flush_dcache_page() is a no-op. Every architecture was provided with this flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is equal 1 or do nothing otherwise. See "fix mtd_blkdevs problem with caches on some architectures" discussion on LKML for more information. Signed-off-by: Ilya Loginov <isloginov@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Peter Horton <phorton@bitbox.co.uk> Cc: "Ed L. Cashin" <ecashin@coraid.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block: silently error unsupported empty barriers too With 2.6.32-rc5 in a KVM guest using dm and virtio_blk, we see the following errors: end_request: I/O error, dev vda, sector 0 end_request: I/O error, dev vda, sector 0 The errors go away if dm stops submitting empty barriers, by reverting: commit 52b1fd5a27c625c78373e024bf570af3c9d44a79 Author: Mikulas Patocka <mpatocka@redhat.com> dm: send empty barriers to targets in dm_flush We should silently error all barriers, even empty barriers, on devices like virtio_blk which don't support them. See also: https://bugzilla.redhat.com/514901 Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block: Seperate read and write statistics of in_flight requests v2 Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read and write statistics of in_flight requests. And exported the number of read and write requests in progress seperately through sysfs. But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange output from "iostat -kx 2". Global values for service time and utilization were garbage. For interval values, utilization was always 100%, and service time is higher than normal. So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b The problem was in part_round_stats_single(), I missed the following: if (now == part->stamp) return; - if (part->in_flight) { + if (part_in_flight(part)) { __part_stat_add(cpu, part, time_in_queue, part_in_flight(part) * (now - part->stamp)); __part_stat_add(cpu, part, io_ticks, (now - part->stamp)); With this chunk included, the reported regression gets fixed. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> -- Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block: get rid of kblock_schedule_delayed_work() It was briefly introduced to allow CFQ to to delayed scheduling, but we ended up removing that feature again. So lets kill the function and export, and just switch CFQ back to the normal work schedule since it is now passing in a '0' delay from all call sites. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Revert "Seperate read and write statistics of in_flight requests" This reverts commit a9327cac440be4d8333bba975cbbf76045096275. Corrado Zoccolo <czoccolo@gmail.com> reports: "with 2.6.32-rc1 I started getting the following strange output from "iostat -kx 2": Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 10,70 0,00 3,16 15,75 0,00 70,38 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 18,22 0,00 0,67 0,01 14,77 0,02 43,94 0,01 10,53 39043915,03 2629219,87 sdb 60,89 9,68 50,79 3,04 1724,43 50,52 65,95 0,70 13,06 488437,47 2629219,87 avg-cpu: %user %nice %system %iowait %steal %idle 2,72 0,00 0,74 0,00 0,00 96,53 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 6,68 0,00 0,99 0,00 0,00 92,33 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 4,40 0,00 0,73 1,47 0,00 93,40 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 4,00 0,00 3,00 0,00 28,00 18,67 0,06 19,50 333,33 100,00 Global values for service time and utilization are garbage. For interval values, utilization is always 100%, and service time is higher than normal. I bisected it down to: [a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write statistics of in_flight requests and verified that reverting just that commit indeed solves the issue on 2.6.32-rc1." So until this is debugged, revert the bad commit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq-iosched: implement slower async initiate and queue ramp up This slowly ramps up the async queue depth based on the time passed since the sync IO, and doesn't allow async at all until a sync slice period has passed. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add a tracepoint for block request remapping Since 2.6.31 now has request-based device-mapper, it's useful to have a tracepoint for request-remapping as well as bio-remapping. This patch adds a tracepoint for request-remapping, trace_block_rq_remap(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block: allow large discard requests Currently we set the bio size to the byte equivalent of the blocks to be trimmed when submitting the initial DISCARD ioctl. That means it is subject to the max_hw_sectors limitation of the HBA which is much lower than the size of a DISCARD request we can support. Add a separate max_discard_sectors tunable to limit the size for discard requests. We limit the max discard request size in bytes to 32bit as that is the limit for bio->bi_size. This could be much larger if we had a way to pass that information through the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
block: use normal I/O path for discard requests prepare_discard_fn() was being called in a place where memory allocation was effectively impossible. This makes it inappropriate for all but the most trivial translations of Linux's DISCARD operation to the block command set. Additionally adding a payload there makes the ownership of the bio backing unclear as it's now allocated by the device driver and not the submitter as usual. It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether the queue supports discard operations or not. blkdev_issue_discard now allocates a one-page, sector-length payload which is the right thing for the common ATA and SCSI implementations. The mtd implementation of prepare_discard_fn() is replaced with simply checking for the request being a discard. Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx> which did the prepare_discard_fn but not the different payload allocation yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits) block: use blkdev_issue_discard in blk_ioctl_discard Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads block: don't assume device has a request list backing in nr_requests store block: Optimal I/O limit wrapper cfq: choose a new next_req when a request is dispatched Seperate read and write statistics of in_flight requests aoe: end barrier bios with EOPNOTSUPP block: trace bio queueing trial only when it occurs block: enable rq CPU completion affinity by default cfq: fix the log message after dispatched a request block: use printk_once cciss: memory leak in cciss_init_one() splice: update mtime and atime on files block: make blk_iopoll_prep_sched() follow normal 0/1 return convention cfq-iosched: get rid of must_alloc flag block: use interrupts disabled version of raise_softirq_irqoff() block: fix comment in blk-iopoll.c block: adjust default budget for blk-iopoll block: fix long lines in block/blk-iopoll.c block: add blk-iopoll, a NAPI like approach for block devices ...