Arnon Warshavsky <ar...@qwilt.com> writes: > Thanks Aaron > > Previously, it wasn't possible for mem_cfg_fd to be reused after a > > failure. Now it is - please reset it to -1. in these close conditions. > > Will do. > > Again, previously this would have aborted on a failure. So it needs to > be reset to a value that allows retry. > > Same here > > > switch (rte_config.process_type){ > > case RTE_PROC_PRIMARY: > > - rte_eal_config_create(); > > + if (rte_eal_config_create()) > > + return -1; > > break; > > case RTE_PROC_SECONDARY: > > - rte_eal_config_attach(); > > + if (rte_eal_config_attach()) > > + return -1; > > rte_eal_mcfg_wait_complete(rte_config.mem_config); > > break; > > case RTE_PROC_AUTO: > > case RTE_PROC_INVALID: > > Not for this patch, but I just noticed that this should probably use a > 'default' case. > > Will add this while Im here > > > Use rte_eal_init_alert to indicate why you are failing the init. > > Will do > > > if (rte_mp_channel_init() < 0) { > > rte_eal_init_alert("failed to init mp channel\n"); > > @@ -652,7 +676,8 @@ static void rte_eal_init_alert(const char *msg) > > > > eal_check_mem_on_local_socket(); > > > > - eal_thread_init_master(rte_config.master_lcore); > > + if (eal_thread_init_master(rte_config.master_lcore) != 0) > > + return -1; > > Is it ever possible to recover from this? > > > Definitely not recoverable, but not different than the other cases where > panic propagate all the way > up rather than aborting
Looking at the eal_thread_init_master, I think it's probably a recoverable condition. For instance, perhaps the core mask was wrong, and could be corrected by re-attempting the initialization. Just suggesting that it's probably okay to allow a re-attempt here. I would suggest: - eal_thread_init_master(rte_config.master_lcore); + if (eal_thread_init_master(rte_config.master_lcore) != 0) { + rte_eal_init_alert("Cannot assign master lcore\n"); + rte_errno = EINVAL; + return -1; + } if you agree. > Still needs > rte_eal_init_alert() call. > > Will do > > How are you cleaning up the threads that were spawned? Lets say this > loop will execute 5 times, and on the 3rd entry, these errors happen. > You now leave DPDK 'half-initialized' - you've spun up threads and > allocated memory. > > ... > > I don't see how any of this is better for the user. In fact, I think > this is worse because it will make portions of the application stop > working without any way to move forward. rte_panic() will at least give > the process a chance to recover from a potentially ephemeral condition. > > As I wrote in a different reply on this patch > I was probably too eager to get rid of this panic taking some wrong > assumptions on the way the > library will be called. Okay - guess emails got crossed in flight :) > Removing the panic from the thread is obviously more complex and also ABI > breaking. > From my own bw, I will not make it with a proper change to this version, so I > will revert back to > panicing on this patchset > and aim for the thread in the next build. I see. Most likely you'll need a proper initialization protocol both ways. As a brief example, you'll need something to guarantee the thread state (just a general outline): -- global_initial_state = UNINIT pthread_init_cond = PTHREAD_COND_INIT; global_spawned_ctr = atomic_ctr(0); rte_eal_init() ... global_initial_state = INITIALIZING thread_ctr = 0; ... for_each_lcore() spawn_thread() if (spawn_fails) global_initial_state = UNINIT; pthread_cond_broadcast() return error thread_ctr++; while (thread_ctr != global_spawned_ctr) /* wait? figure out when to declare extreme failure */ global_initial_state = THREAD_INITIALIZED pthread_cond_broadcast(pthread_init_cond) while (thread_ctr) wait_some_grace_period() for_each_lcore_thread() thread_state = lcore_state[thr]; /* probably needs a mem barrier*/ if (thread_state != THREAD_READY && thread_state != THREAD_STARTING) /* failure - message all threads to clean up */ if (thread_state == THREAD_READY) thread_ctr--; in eal_thread_loop(): /* before even the set_affinity */ atomic_inc(global_spawned_ctr); lcore_state[thr] = THREAD_STARTING; pthread_cond_wait(pthread_init_cond) if (global_initial_state != THREAD_INITIALIZED) lcore_state[thr] = THREAD_FAILED; return...; /* do all the normal checks... instead of the panic_state, just set lcore_state[thr] to THREAD_FAILED, clean up any additional allocated resources, and return... which will exit the thread */ lcore_state[thr] = THREAD_READY; -- In the above I hope it illustrates what you'll need - a way to signal to each side that initialization phase is happening, and that initialization was successful / failed, and to clean up in the failure case. Just meant for illustration so feel free to ignore / flame, but that's how I would go about removing the rte_panic() calls. > This seems to only exist as a way of triggering the run_once check in > the eal_init. It doesn't add anything except one more state variable to > check against. What is the purpose? > > > Actually this is not a run-once in purpose, rather an attempt to define a > state for the device > and on the way work around breaking abi on the the void function called > before that. I think it's a way to try and track state for initialization and to prevent future inits. Which ABI are you worried about? rte_panic()? I'm not sure how this is an ABI work around, but I'm probably not thinking about it hard enough. >> + if (rte_get_panic_state()) >> + return -1; >> + > > Please just use run_once. That's a better way of preventing this. > > > As stated above - no a run-once > > All of the comments from the bsd side apply here.