Thank you for the feedback. Comments inline.
On Wed, Apr 8, 2015 at 3:05 PM, Andi Kleen <a...@firstfloor.org> wrote: > > Nuno Diegues <n...@ist.utl.pt> writes: > > What workloads did you test this on? On the STAMP suite of benchmarks for transactional memory (described here [1]). I have ran an unmodified GCC 5.0.0 against the patched GCC with these modifications and obtain the following speedups in STAMP with 4 threads (on a Haswell with 4 cores, average 10 runs): benchmarks: speedup genome: 1.32 intruder: 1.66 labyrinth: 1.00 ssca2: 1.02 yada: 1.00 kmeans-high: 1.13 kmeans-low: 1.10 vacation-high: 2.27 vacation-low: 1.88 [1] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, Kunle Olukotun: STAMP: Stanford Transactional Applications for Multi-Processing. IISWC 2008: 35-46 > > Are you sure you need floating point here? If the program does not > use it in any other ways faulting in the floating point state can be > quite expensive. > > I bet fixed point would work for such simple purposes too. That is a good point. While I haven't ever used fixed point arithmetic, a cursory inspection reveals that it does make sense and seems applicable to this case. Are you aware of some place where this is being done already within GCC that I could use as inspiration, or should I craft some macros from scratch for this? > > + serial_lock.read_unlock(tx); > > + > > + // Obtain the delta performance with respect to the last period. > > + uint64_t current_cycles = rdtsc(); > > + uint64_t cycles_used = current_cycles - optimizer.last_cycles; > > It may be worth pointing out that rdtsc does not return cycles. > In fact the ratio to real cycles is variable depending on the changing > frequency. > I hope your algorithms can handle that. The intent here is to obtain some notion of time passed with a low cost. RDTSC seemed to be the best choice around: it is not critical that the frequency of the processor may change the relativity of the returned value with respect to actual cpu cycles. > > + > > + // Compute gradient descent for the number of retries. > > + double change_for_better = current_throughput / > > optimizer.last_throughput; > > + double change_for_worse = optimizer.last_throughput / current_throughput; > > + int32_t last_attempts = optimizer.last_attempts; > > + int32_t current_attempts = optimizer.optimized_attempts; > > + int32_t new_attempts = current_attempts; > > + if (unlikely(change_for_worse > 1.40)) > > + { > > + optimizer.optimized_attempts = optimizer.best_ever_attempts; > > + optimizer.last_throughput = current_throughput; > > + optimizer.last_attempts = current_attempts; > > + return; > > + } > > + > > + if (unlikely(random() % 100 < 1)) > > + { > > So where is the seed for that random stored? Could you corrupt some > user's random state? Is the state per thread or global? > If it's per thread how do you initialize so that they threads do > start with different seeds. > If it's global what synchronizes it? As I do not specify any seed, I was under the impression that there would be a default initialization. Furthermore, the posix documentation specifies random() to be MT-safe, so I assumed its internal state to be per-thread. Did I mis-interpret this? With regard to the self-tuning state, it is kept within the "gtm_global_optimizer optimizer" struct, which is in essence multi-reader (any thread running transactions can check the struct to use the parameters optimized in it) and single-writer (notice that the "void GTM::gtm_thread::reoptimize_htm_execution()" function is called only by one thread, the one at the end of the list of threads, i.e., whose tx->next_thread == NULL). > > Overall the algorithm looks very complicated with many mysterious magic > numbers. Are there simplifications possible? While the retry path is not > extremely critical it should be at least somewhat optimized, otherwise > it will dominate the cost of short transactions. > > One problems with so many magic numbers is that they may be good on one > system, but bad on another. Notice that the retry path is barely changed in the common case: only a designated thread (the last one in the list of threads registered in libitm) will periodically execute the re-optimization. Hence, most of the patch that you can see here is code execute in that (uncommon case). I understand the concern with magic numbers: we could self-tune them as well, but that would surely increase the complexity of the patch :) In essence, we have the following numbers at the moment: * how often we re-optimize (every 500 successful transactions for the designated thread) * how many maximum attempts we can have in hardware (20) * how much better and worse the performance must change for the gradient descent to move to a new configuration (5%) * how terrible the performance must change for the gradient descent to rollback to the best known configuration so far (40%) * how often the gradient descent can explore randomly to avoid local optima (1%) Once again, thank you for your time on this matter. Looking forward to your comments. -- Nuno Diegues > > -Andi > -- > a...@linux.intel.com -- Speaking for myself only