Hi Ufuk, here is our preliminary input formar implementation: https://gist.github.com/anonymous/dbf05cad2a6cc07b8aa88e74a2c23119
if you need a running project, I will have to create a test one cause I cannot share the current configuration. thanks a lot in advance! 2016-03-30 10:13 GMT+02:00 Ufuk Celebi <u...@apache.org>: > Do you have the code somewhere online? Maybe someone can have a quick > look over it later. I'm pretty sure that is indeed a problem with the > custom input format. > > – Ufuk > > On Tue, Mar 29, 2016 at 3:50 PM, Stefano Bortoli <s.bort...@gmail.com> > wrote: > > Perhaps there is a misunderstanding on my side over the parallelism and > > split management given a data source. > > > > We started from the current JDBCInputFormat to make it multi-thread. > Then, > > given a space of keys, we create the splits based on a fetchsize set as a > > parameter. In the open, we get a connection from the pool, and execute a > > query using the split interval. This sets the 'resultSet', and then the > > DatasourceTask iterates between reachedEnd, next and close. On close, the > > connection is returned to the pool. We set parallelism to 32, and we > would > > expect 32 connection opened but the connections opened are just 8. > > > > We tried to make an example with the textinputformat, but being a > > delimitedinpurformat, the open is called sequentially when statistics are > > built, and then the processing is executed in parallel just after all the > > open are executed. This is not feasible in our case, because there would > be > > millions of queries before the statistics are collected. > > > > Perhaps we are doing something wrong, still to figure out what. :-/ > > > > thanks a lot for your help. > > > > saluti, > > Stefano > > > > > > 2016-03-29 13:30 GMT+02:00 Stefano Bortoli <s.bort...@gmail.com>: > >> > >> That is exactly my point. I should have 32 threads running, but I have > >> only 8. 32 Task are created, but only only 8 are run concurrently. > Flavio > >> and I will try to make a simple program to produce the problem. If we > solve > >> our issues on the way, we'll let you know. > >> > >> thanks a lot anyway. > >> > >> saluti, > >> Stefano > >> > >> 2016-03-29 12:44 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > >>> > >>> Then it shouldn’t be a problem. The ExeuctionContetxt is used to run > >>> futures and their callbacks. But as Ufuk said, each task will spawn > it’s own > >>> thread and if you set the parallelism to 32 then you should have 32 > threads > >>> running. > >>> > >>> > >>> On Tue, Mar 29, 2016 at 12:29 PM, Stefano Bortoli <s.bort...@gmail.com > > > >>> wrote: > >>>> > >>>> In fact, I don't use it. I just had to crawl back the runtime > >>>> implementation to get to the point where parallelism was switching > from 32 > >>>> to 8. > >>>> > >>>> saluti, > >>>> Stefano > >>>> > >>>> 2016-03-29 12:24 GMT+02:00 Till Rohrmann <till.rohrm...@gmail.com>: > >>>>> > >>>>> Hi, > >>>>> > >>>>> for what do you use the ExecutionContext? That should actually be > >>>>> something which you shouldn’t be concerned with since it is only used > >>>>> internally by the runtime. > >>>>> > >>>>> Cheers, > >>>>> Till > >>>>> > >>>>> > >>>>> On Tue, Mar 29, 2016 at 12:09 PM, Stefano Bortoli < > s.bort...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> Well, in theory yes. Each task has a thread, but only a number is > run > >>>>>> in parallel (the job of the scheduler). Parallelism is set in the > >>>>>> environment. However, whereas the parallelism parameter is set and > read > >>>>>> correctly, when it comes to actual starting of the threads, the > number is > >>>>>> fix to 8. We run a debugger to get to the point where the thread was > >>>>>> started. As Flavio mentioned, the ExecutionContext has the > parallelims set > >>>>>> to 8. We have a pool of connections to a RDBS and il logs the > creation of > >>>>>> just 8 connections although parallelism is much higher. > >>>>>> > >>>>>> My question is whether this is a bug (or a feature) of the > >>>>>> LocalMiniCluster. :-) I am not scala expert, but I see some variable > >>>>>> assignment in setting up of the MiniCluster, involving parallelism > and > >>>>>> 'default values'. Default values in terms of parallelism are based > on the > >>>>>> number of cores. > >>>>>> > >>>>>> thanks a lot for the support! > >>>>>> > >>>>>> saluti, > >>>>>> Stefano > >>>>>> > >>>>>> 2016-03-29 11:51 GMT+02:00 Ufuk Celebi <u...@apache.org>: > >>>>>>> > >>>>>>> Hey Stefano, > >>>>>>> > >>>>>>> this should work by setting the parallelism on the environment, > e.g. > >>>>>>> > >>>>>>> env.setParallelism(32) > >>>>>>> > >>>>>>> Is this what you are doing? > >>>>>>> > >>>>>>> The task threads are not part of a pool, but each submitted task > >>>>>>> creates its own Thread. > >>>>>>> > >>>>>>> – Ufuk > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Mar 25, 2016 at 9:10 PM, Flavio Pompermaier > >>>>>>> <pomperma...@okkam.it> wrote: > >>>>>>> > Any help here? I think that the problem is that the JobManager > >>>>>>> > creates the > >>>>>>> > executionContext of the scheduler with > >>>>>>> > > >>>>>>> > val executionContext = ExecutionContext.fromExecutor(new > >>>>>>> > ForkJoinPool()) > >>>>>>> > > >>>>>>> > and thus the number of concurrently running threads is limited to > >>>>>>> > the number > >>>>>>> > of cores (using the default constructor of the ForkJoinPool). > >>>>>>> > What do you think? > >>>>>>> > > >>>>>>> > > >>>>>>> > On Wed, Mar 23, 2016 at 6:55 PM, Stefano Bortoli > >>>>>>> > <s.bort...@gmail.com> > >>>>>>> > wrote: > >>>>>>> >> > >>>>>>> >> Hi guys, > >>>>>>> >> > >>>>>>> >> I am trying to test a job that should run a number of tasks to > >>>>>>> >> read from a > >>>>>>> >> RDBMS using an improved JDBC connector. The connection and the > >>>>>>> >> reading run > >>>>>>> >> smoothly, but I cannot seem to be able to move above the limit > of > >>>>>>> >> 8 > >>>>>>> >> concurrent threads running. 8 is of course the number of cores > of > >>>>>>> >> my > >>>>>>> >> machine. > >>>>>>> >> > >>>>>>> >> I have tried working around configurations and settings, but the > >>>>>>> >> Executor > >>>>>>> >> within the ExecutionContext keeps on having a parallelism of 8. > >>>>>>> >> Although, of > >>>>>>> >> course, the parallelism of the execution environment is much > >>>>>>> >> higher (in fact > >>>>>>> >> I have many more tasks to be allocated). > >>>>>>> >> > >>>>>>> >> I feel it may be an issue of the LocalMiniCluster configuration > >>>>>>> >> that may > >>>>>>> >> just override/neglect my wish for higher degree of parallelism. > Is > >>>>>>> >> there a > >>>>>>> >> way for me to work around this issue? > >>>>>>> >> > >>>>>>> >> please let me know. Thanks a lot for you help! :-) > >>>>>>> >> > >>>>>>> >> saluti, > >>>>>>> >> Stefano > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >