From: Alexander Atanasov <alexander.atana...@virtuozzo.com> Create threads to execute pios in parallel - call them pio runners. Use number of CPUs to determine the number of threads started. >From worker each pio is sent to a thread in round-robin fashion thru work_llist. Maintain the number of pios sent so we can wait for them to be processed - NB we only want to keep the order of execution of different pio types which can be different than the order of their completion. We send a batch of pios to the runners and if necessary we wait for them to be processed before moving forwards - we need this for metadata writeback and flushes.
https://virtuozzo.atlassian.net/browse/VSTOR-91821 Signed-off-by: Alexander Atanasov <alexander.atana...@virtuozzo.com> ====== Patchset description: ploop: optimistations and scalling Ploop processes requsts in a different threads in parallel where possible which results in significant improvement in performance and makes further optimistations possible. Known bugs: - delayed metadata writeback is not working and is missing error handling - patch to disable it until fixed - fast path is not working - causes rcu lockups - patch to disable it Further improvements: - optimize md pages lookups Alexander Atanasov (50): dm-ploop: md_pages map all pages at creation time dm-ploop: Use READ_ONCE/WRITE_ONCE to access md page data dm-ploop: fsync after all pios are sent dm-ploop: move md status to use proper bitops dm-ploop: convert wait_list and wb_batch_llist to use lockless lists dm-ploop: convert enospc handling to use lockless lists dm-ploop: convert suspended_pios list to use lockless list dm-ploop: convert the rest of the lists to use llist variant dm-ploop: combine processing of pios thru prepare list and remove fsync worker dm-ploop: move from wq to kthread dm-ploop: move preparations of pios into the caller from worker dm-ploop: fast path execution for reads dm-ploop: do not use a wrapper for set_bit to make a page writeback dm-ploop: BAT use only one list for writeback dm-ploop: make md writeback timeout to be per page dm-ploop: add interface to disable bat writeback delay dm-ploop: convert wb_batch_list to lockless variant dm-ploop: convert high_prio to status dm-ploop: split cow processing into two functions dm-ploop: convert md page rw lock to spin lock dm-ploop: convert bat_rwlock to bat_lock spinlock dm-ploop: prepare bat updates under bat_lock dm-ploop: make ploop_bat_write_complete ready for parallel pio completion dm-ploop: make ploop_submit_metadata_writeback return number of requests sent dm-ploop: introduce pio runner threads dm-ploop: add pio list ids to be used when passing pios to runners dm-ploop: process pios via runners dm-ploop: disable metadata writeback delay dm-ploop: disable fast path dm-ploop: use lockless lists for chained cow updates list dm-ploop: use lockless lists for data ready pios dm-ploop: give runner threads better name dm-ploop: resize operation - add holes bitmap locking dm-ploop: remove unnecessary operations dm-ploop: use filp per thread dm-ploop: catch if we try to advance pio past bio end dm-ploop: support REQ_FUA for data pios dm-ploop: proplerly access nr_bat_entries dm-ploop: fix locking and improve error handling when submitting pios dm-ploop: fix how ENOTBLK is handled dm-ploop: sync when suspended or stopping dm-ploop: rework bat completion logic dm-ploop: rework logic in pio processing dm-ploop: end fsync pios in parallel dm-ploop: make filespace preallocations async dm-ploop: resubmit enospc pios from dispatcher thread dm-ploop: dm-ploop: simplify discard completion dm-ploop: use GFP_ATOMIC instead of GFP_NOIO dm-ploop: fix locks used in mixed context dm-ploop: fix how current flags are managed inside threads Andrey Zhadchenko (13): dm-ploop: do not flush after metadata writes dm-ploop: set IOCB_DSYNC on all FUA requests dm-ploop: remove extra ploop_cluster_is_in_top_delta() dm-ploop: introduce per-md page locking dm-ploop: reduce BAT accesses on discard completion dm-ploop: simplify llseek dm-ploop: speed up ploop_prepare_bat_update() dm-ploop: make new allocations immediately visible in BAT dm-ploop: drop ploop_cluster_is_in_top_delta() dm-ploop: do not wait for BAT update for non-FUA requests dm-ploop: add delay for metadata writeback dm-ploop: submit all postponed metadata on REQ_OP_FLUSH dm-ploop: handle REQ_PREFLUSH Feature: dm-ploop: ploop target driver --- drivers/md/dm-ploop-map.c | 136 ++++++++++++++++++++++++++++++++--- drivers/md/dm-ploop-target.c | 44 ++++++++++-- drivers/md/dm-ploop.h | 13 +++- 3 files changed, 175 insertions(+), 18 deletions(-) diff --git a/drivers/md/dm-ploop-map.c b/drivers/md/dm-ploop-map.c index 818ebd4f052b..46e2ab8cfa1b 100644 --- a/drivers/md/dm-ploop-map.c +++ b/drivers/md/dm-ploop-map.c @@ -20,6 +20,8 @@ #include "dm-ploop.h" #include "dm-rq.h" +static inline int ploop_runners_add_work(struct ploop *ploop, struct pio *pio); + #define PREALLOC_SIZE (128ULL * 1024 * 1024) static void ploop_handle_cleanup(struct ploop *ploop, struct pio *pio); @@ -1892,6 +1894,11 @@ static void ploop_process_resubmit_pios(struct ploop *ploop, } } +static inline int ploop_runners_have_pending(struct ploop *ploop) +{ + return atomic_read(&ploop->kt_worker->inflight_pios); +} + static int ploop_submit_metadata_writeback(struct ploop *ploop) { unsigned long flags; @@ -1956,6 +1963,33 @@ static void process_ploop_fsync_work(struct ploop *ploop, struct llist_node *llf } } +static inline int ploop_runners_add_work(struct ploop *ploop, struct pio *pio) +{ + struct ploop_worker *wrkr; + + wrkr = READ_ONCE(ploop->last_used_runner)->next; + WRITE_ONCE(ploop->last_used_runner, wrkr); + + atomic_inc(&ploop->kt_worker->inflight_pios); + llist_add(&pio->llist, &wrkr->work_llist); + wake_up_process(wrkr->task); + + return 0; +} + +static inline int ploop_runners_add_work_list(struct ploop *ploop, struct llist_node *llist) +{ + struct llist_node *pos, *t; + struct pio *pio; + + llist_for_each_safe(pos, t, llist) { + pio = llist_entry(pos, typeof(*pio), llist); + ploop_runners_add_work(ploop, pio); + } + + return 0; +} + void do_ploop_run_work(struct ploop *ploop) { LLIST_HEAD(deferred_pios); @@ -2015,30 +2049,110 @@ void do_ploop_work(struct work_struct *ws) do_ploop_run_work(ploop); } -int ploop_worker(void *data) +int ploop_pio_runner(void *data) { struct ploop_worker *worker = data; struct ploop *ploop = worker->ploop; + struct llist_node *llwork; + struct pio *pio; + struct llist_node *pos, *t; + unsigned int old_flags = current->flags; + int did_process_pios = 0; for (;;) { + current->flags = old_flags; set_current_state(TASK_INTERRUPTIBLE); - if (kthread_should_stop()) { - __set_current_state(TASK_RUNNING); - break; +check_for_more: + llwork = llist_del_all(&worker->work_llist); + if (!llwork) { + if (did_process_pios) { + did_process_pios = 0; + wake_up_interruptible(&ploop->dispatcher_wq_data); + } + /* Only stop when there is no more pios */ + if (kthread_should_stop()) { + __set_current_state(TASK_RUNNING); + break; + } + schedule(); + continue; } + __set_current_state(TASK_RUNNING); + old_flags = current->flags; + current->flags |= PF_IO_THREAD|PF_LOCAL_THROTTLE|PF_MEMALLOC_NOIO; + + llist_for_each_safe(pos, t, llwork) { + pio = llist_entry(pos, typeof(*pio), llist); + INIT_LIST_HEAD(&pio->list); + switch (pio->queue_list_id) { + case PLOOP_LIST_FLUSH: + WARN_ON_ONCE(1); /* We must not see flushes here */ + break; + case PLOOP_LIST_PREPARE: + // fsync pios can come here for endio + // XXX: make it a FSYNC list + ploop_pio_endio(pio); + break; + case PLOOP_LIST_DEFERRED: + ploop_process_one_deferred_bio(ploop, pio); + break; + case PLOOP_LIST_COW: + ploop_process_one_delta_cow(ploop, pio); + break; + case PLOOP_LIST_DISCARD: + ploop_process_one_discard_pio(ploop, pio); + break; + // XXX: make it list MDWB + case PLOOP_LIST_INVALID: /* resubmit sets the list id to invalid */ + ploop_submit_rw_mapped(ploop, pio); + break; + default: + WARN_ON_ONCE(1); + } + atomic_dec(&ploop->kt_worker->inflight_pios); + } + cond_resched(); + did_process_pios = 1; + goto check_for_more; + } + return 0; +} + +int ploop_worker(void *data) +{ + struct ploop_worker *worker = data; + struct ploop *ploop = worker->ploop; + + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + if (llist_empty(&ploop->pios[PLOOP_LIST_FLUSH]) && - llist_empty(&ploop->pios[PLOOP_LIST_PREPARE]) && - llist_empty(&ploop->pios[PLOOP_LIST_DEFERRED]) && - llist_empty(&ploop->pios[PLOOP_LIST_DISCARD]) && - llist_empty(&ploop->pios[PLOOP_LIST_COW]) && - llist_empty(&ploop->llresubmit_pios) - ) + llist_empty(&ploop->pios[PLOOP_LIST_PREPARE]) && + llist_empty(&ploop->pios[PLOOP_LIST_DEFERRED]) && + llist_empty(&ploop->pios[PLOOP_LIST_DISCARD]) && + llist_empty(&ploop->pios[PLOOP_LIST_COW]) && + llist_empty(&ploop->llresubmit_pios) && + !ploop->force_md_writeback) { + if (kthread_should_stop()) { + wait_event_interruptible(ploop->dispatcher_wq_data, + (!ploop_runners_have_pending(ploop))); + __set_current_state(TASK_RUNNING); + break; + } schedule(); + /* now check for pending work */ + } __set_current_state(TASK_RUNNING); do_ploop_run_work(ploop); - cond_resched(); + cond_resched(); /* give other processes chance to run */ + if (kthread_should_stop()) { + wait_event_interruptible(ploop->dispatcher_wq_data, + (!ploop_runners_have_pending(ploop))); + __set_current_state(TASK_RUNNING); + break; + } } return 0; } diff --git a/drivers/md/dm-ploop-target.c b/drivers/md/dm-ploop-target.c index dc63c18cece8..3fed26137831 100644 --- a/drivers/md/dm-ploop-target.c +++ b/drivers/md/dm-ploop-target.c @@ -164,6 +164,7 @@ static void ploop_destroy(struct ploop *ploop) int i; if (ploop->kt_worker) { + ploop->force_md_writeback = 1; wake_up_process(ploop->kt_worker->task); /* try to send all pending - if we have partial io and enospc end bellow */ while (!llist_empty(&ploop->pios[PLOOP_LIST_FLUSH]) || @@ -175,9 +176,22 @@ static void ploop_destroy(struct ploop *ploop) schedule(); } + if (ploop->kt_runners) { + for (i = 0; i < ploop->nkt_runners; i++) { + if (ploop->kt_runners[i]) { + wake_up_process(ploop->kt_runners[i]->task); + kthread_stop(ploop->kt_runners[i]->task); + kfree(ploop->kt_runners[i]); + } + } + } + kthread_stop(ploop->kt_worker->task); /* waits for the thread to stop */ + WARN_ON(!llist_empty(&ploop->pios[PLOOP_LIST_PREPARE])); WARN_ON(!llist_empty(&ploop->llresubmit_pios)); + WARN_ON(!llist_empty(&ploop->enospc_pios)); + kfree(ploop->kt_runners); kfree(ploop->kt_worker); } @@ -347,7 +361,8 @@ ALLOW_ERROR_INJECTION(ploop_add_deltas_stack, ERRNO); argv++; \ } while (0); -static struct ploop_worker *ploop_worker_create(struct ploop *ploop) +static struct ploop_worker *ploop_worker_create(struct ploop *ploop, + int (*worker_fn)(void *), const char *pref, int id) { struct ploop_worker *worker; struct task_struct *task; @@ -357,12 +372,13 @@ static struct ploop_worker *ploop_worker_create(struct ploop *ploop) return NULL; worker->ploop = ploop; - task = kthread_create(ploop_worker, worker, "ploop-%d-0", - current->pid); + task = kthread_create(worker_fn, worker, "ploop-%d-%s-%d", + current->pid, pref, id); if (IS_ERR(task)) goto out_err; worker->task = task; + init_llist_head(&worker->work_llist); wake_up_process(task); @@ -521,10 +537,30 @@ static int ploop_ctr(struct dm_target *ti, unsigned int argc, char **argv) goto err; - ploop->kt_worker = ploop_worker_create(ploop); + init_waitqueue_head(&ploop->dispatcher_wq_data); + + ploop->kt_worker = ploop_worker_create(ploop, ploop_worker, "d", 0); if (!ploop->kt_worker) goto err; +/* make it a param = either module or cpu based or dev req queue */ +#define PLOOP_PIO_RUNNERS nr_cpu_ids + ploop->kt_runners = kcalloc(PLOOP_PIO_RUNNERS, sizeof(struct kt_worker *), GFP_KERNEL); + if (!ploop->kt_runners) + goto err; + + ploop->nkt_runners = PLOOP_PIO_RUNNERS; + for (i = 0; i < ploop->nkt_runners; i++) { + ploop->kt_runners[i] = ploop_worker_create(ploop, ploop_pio_runner, "r", i+1); + if (!ploop->kt_runners[i]) + goto err; + } + + for (i = 0; i < ploop->nkt_runners-1; i++) + ploop->kt_runners[i]->next = ploop->kt_runners[i+1]; + ploop->kt_runners[ploop->nkt_runners-1]->next = ploop->kt_runners[0]; + ploop->last_used_runner = ploop->kt_runners[0]; + ret = ploop_add_deltas_stack(ploop, &argv[0], argc); if (ret) goto err; diff --git a/drivers/md/dm-ploop.h b/drivers/md/dm-ploop.h index 10c8cf2e154a..de3987977d24 100644 --- a/drivers/md/dm-ploop.h +++ b/drivers/md/dm-ploop.h @@ -146,14 +146,17 @@ enum { struct ploop_worker { struct ploop *ploop; struct task_struct *task; - u64 kcov_handle; + struct llist_head work_llist; + atomic_t inflight_pios; + struct ploop_worker *next; }; struct ploop { + struct wait_queue_head dispatcher_wq_data; struct dm_target *ti; #define PLOOP_PRQ_POOL_SIZE 512 /* Twice nr_requests from blk_mq_init_sched() */ mempool_t *prq_pool; -#define PLOOP_PIO_POOL_SIZE 256 +#define PLOOP_PIO_POOL_SIZE 512 mempool_t *pio_pool; struct rb_root bat_entries; @@ -198,7 +201,10 @@ struct ploop { struct work_struct worker; struct work_struct event_work; - struct ploop_worker *kt_worker; + struct ploop_worker *kt_worker; /* dispatcher thread */ + struct ploop_worker **kt_runners; /* pio runners */ + unsigned int nkt_runners; + struct ploop_worker *last_used_runner; struct completion inflight_bios_ref_comp; struct percpu_ref inflight_bios_ref[2]; bool inflight_ref_comp_pending; @@ -611,6 +617,7 @@ extern void ploop_enospc_timer(struct timer_list *timer); extern loff_t ploop_llseek_hole(struct dm_target *ti, loff_t offset, int whence); extern int ploop_worker(void *data); +extern int ploop_pio_runner(void *data); extern void ploop_disable_writeback_delay(struct ploop *ploop); extern void ploop_enable_writeback_delay(struct ploop *ploop); -- 2.43.5 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel