Re: threads, parallelism and task managers

Stefano Bortoli Wed, 13 Apr 2016 02:45:36 -0700

Sounds you are damn right! thanks for the insight, dumb on us for not
checking this before.


saluti,
Stefano

2016-04-13 11:05 GMT+02:00 Stephan Ewen <se...@apache.org>:

> Sounds actually not like a Flink issue. I would look into the commons pool
> docs.
> Maybe they size their pools by default with the number of cores, so the
> pool has only 8 threads, and other requests are queues?
>
> On Wed, Apr 13, 2016 at 10:29 AM, Flavio Pompermaier <pomperma...@okkam.it
> > wrote:
>
>> Any feedback about our JDBC InputFormat issue..?
>>
>> On Thu, Apr 7, 2016 at 12:37 PM, Flavio Pompermaier <pomperma...@okkam.it
>> > wrote:
>>
>>> We've finally created a running example (For Flink 0.10.2) of our
>>> improved JDBC imputformat that you can run from an IDE (it creates an
>>> in-memory derby database with 1000 rows and batch of 10) at
>>> https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351.
>>> The first time you run the program you have to comment the following
>>> line:
>>>
>>>         stmt.executeUpdate("Drop Table users ");
>>>
>>> In your pom declare the following dependencies:
>>>
>>> <dependency>
>>> <groupId>org.apache.derby</groupId>
>>> <artifactId>derby</artifactId>
>>> <version>10.10.1.1</version>
>>> </dependency>
>>> <dependency>
>>> <groupId>org.apache.commons</groupId>
>>> <artifactId>commons-pool2</artifactId>
>>> <version>2.4.2</version>
>>> </dependency>
>>>
>>> In my laptop I have 8 cores and if I put parallelism to 16 I expect to
>>> see 16 calls to the connection pool (i.e. '==================== CREATING
>>> NEW CONNECTION!') while I see only 8 (up to my maximum number of cores).
>>> The number of created task instead is correct (16).
>>>
>>> I hope this could help in understanding where the problem is!
>>>
>>> Best and thank in advance,
>>> Flavio
>>>
>>> On Wed, Mar 30, 2016 at 11:01 AM, Stefano Bortoli <s.bort...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ufuk,
>>>>
>>>> here is our preliminary input formar implementation:
>>>> https://gist.github.com/anonymous/dbf05cad2a6cc07b8aa88e74a2c23119
>>>>
>>>> if you need a running project, I will have to create a test one cause I
>>>> cannot share the current configuration.
>>>>
>>>> thanks a lot in advance!
>>>>
>>>>
>>>>
>>>> 2016-03-30 10:13 GMT+02:00 Ufuk Celebi <u...@apache.org>:
>>>>
>>>>> Do you have the code somewhere online? Maybe someone can have a quick
>>>>> look over it later. I'm pretty sure that is indeed a problem with the
>>>>> custom input format.
>>>>>
>>>>> – Ufuk
>>>>>
>>>>> On Tue, Mar 29, 2016 at 3:50 PM, Stefano Bortoli <s.bort...@gmail.com>
>>>>> wrote:
>>>>> > Perhaps there is a misunderstanding on my side over the parallelism
>>>>> and
>>>>> > split management given a data source.
>>>>> >
>>>>> > We started from the current JDBCInputFormat to make it multi-thread.
>>>>> Then,
>>>>> > given a space of keys, we create the splits based on a fetchsize set
>>>>> as a
>>>>> > parameter. In the open, we get a connection from the pool, and
>>>>> execute a
>>>>> > query using the split interval. This sets the 'resultSet', and then
>>>>> the
>>>>> > DatasourceTask iterates between reachedEnd, next and close. On
>>>>> close, the
>>>>> > connection is returned to the pool. We set parallelism to 32, and we
>>>>> would
>>>>> > expect 32 connection opened but the connections opened are just 8.
>>>>> >
>>>>> > We tried to make an example with the textinputformat, but being a
>>>>> > delimitedinpurformat, the open is called sequentially when
>>>>> statistics are
>>>>> > built, and then the processing is executed in parallel just after
>>>>> all the
>>>>> > open are executed. This is not feasible in our case, because there
>>>>> would be
>>>>> > millions of queries before the statistics are collected.
>>>>> >
>>>>> > Perhaps we are doing something wrong, still to figure out what. :-/
>>>>> >
>>>>> > thanks a lot for your help.
>>>>> >
>>>>> > saluti,
>>>>> > Stefano
>>>>> >
>>>>> >
>>>>> > 2016-03-29 13:30 GMT+02:00 Stefano Bortoli <s.bort...@gmail.com>:
>>>>> >>
>>>>> >> That is exactly my point. I should have 32 threads running, but I
>>>>> have
>>>>> >> only 8. 32 Task are created, but only only 8 are run concurrently.
>>>>> Flavio
>>>>> >> and I will try to make a simple program to produce the problem. If
>>>>> we solve
>>>>> >> our issues on the way, we'll let you know.
>>>>> >>
>>>>> >> thanks a lot anyway.
>>>>> >>
>>>>> >> saluti,
>>>>> >> Stefano
>>>>> >>
>>>>> >> 2016-03-29 12:44 GMT+02:00 Till Rohrmann <trohrm...@apache.org>:
>>>>> >>>
>>>>> >>> Then it shouldn’t be a problem. The ExeuctionContetxt is used to
>>>>> run
>>>>> >>> futures and their callbacks. But as Ufuk said, each task will
>>>>> spawn it’s own
>>>>> >>> thread and if you set the parallelism to 32 then you should have
>>>>> 32 threads
>>>>> >>> running.
>>>>> >>>
>>>>> >>>
>>>>> >>> On Tue, Mar 29, 2016 at 12:29 PM, Stefano Bortoli <
>>>>> s.bort...@gmail.com>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>> In fact, I don't use it. I just had to crawl back the runtime
>>>>> >>>> implementation to get to the point where parallelism was
>>>>> switching from 32
>>>>> >>>> to 8.
>>>>> >>>>
>>>>> >>>> saluti,
>>>>> >>>> Stefano
>>>>> >>>>
>>>>> >>>> 2016-03-29 12:24 GMT+02:00 Till Rohrmann <till.rohrm...@gmail.com
>>>>> >:
>>>>> >>>>>
>>>>> >>>>> Hi,
>>>>> >>>>>
>>>>> >>>>> for what do you use the ExecutionContext? That should actually be
>>>>> >>>>> something which you shouldn’t be concerned with since it is only
>>>>> used
>>>>> >>>>> internally by the runtime.
>>>>> >>>>>
>>>>> >>>>> Cheers,
>>>>> >>>>> Till
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On Tue, Mar 29, 2016 at 12:09 PM, Stefano Bortoli <
>>>>> s.bort...@gmail.com>
>>>>> >>>>> wrote:
>>>>> >>>>>>
>>>>> >>>>>> Well, in theory yes. Each task has a thread, but only a number
>>>>> is run
>>>>> >>>>>> in parallel (the job of the scheduler).  Parallelism is set in
>>>>> the
>>>>> >>>>>> environment. However, whereas the parallelism parameter is set
>>>>> and read
>>>>> >>>>>> correctly, when it comes to actual starting of the threads, the
>>>>> number is
>>>>> >>>>>> fix to 8. We run a debugger to get to the point where the
>>>>> thread was
>>>>> >>>>>> started. As Flavio mentioned, the ExecutionContext has the
>>>>> parallelims set
>>>>> >>>>>> to 8. We have a pool of connections to a RDBS and il logs the
>>>>> creation of
>>>>> >>>>>> just 8 connections although parallelism is much higher.
>>>>> >>>>>>
>>>>> >>>>>> My question is whether this is a bug (or a feature) of the
>>>>> >>>>>> LocalMiniCluster. :-) I am not scala expert, but I see some
>>>>> variable
>>>>> >>>>>> assignment in setting up of the MiniCluster, involving
>>>>> parallelism and
>>>>> >>>>>> 'default values'. Default values in terms of parallelism are
>>>>> based on the
>>>>> >>>>>> number of cores.
>>>>> >>>>>>
>>>>> >>>>>> thanks a lot for the support!
>>>>> >>>>>>
>>>>> >>>>>> saluti,
>>>>> >>>>>> Stefano
>>>>> >>>>>>
>>>>> >>>>>> 2016-03-29 11:51 GMT+02:00 Ufuk Celebi <u...@apache.org>:
>>>>> >>>>>>>
>>>>> >>>>>>> Hey Stefano,
>>>>> >>>>>>>
>>>>> >>>>>>> this should work by setting the parallelism on the
>>>>> environment, e.g.
>>>>> >>>>>>>
>>>>> >>>>>>> env.setParallelism(32)
>>>>> >>>>>>>
>>>>> >>>>>>> Is this what you are doing?
>>>>> >>>>>>>
>>>>> >>>>>>> The task threads are not part of a pool, but each submitted
>>>>> task
>>>>> >>>>>>> creates its own Thread.
>>>>> >>>>>>>
>>>>> >>>>>>> – Ufuk
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> On Fri, Mar 25, 2016 at 9:10 PM, Flavio Pompermaier
>>>>> >>>>>>> <pomperma...@okkam.it> wrote:
>>>>> >>>>>>> > Any help here? I think that the problem is that the
>>>>> JobManager
>>>>> >>>>>>> > creates the
>>>>> >>>>>>> > executionContext of the scheduler with
>>>>> >>>>>>> >
>>>>> >>>>>>> >        val executionContext =
>>>>> ExecutionContext.fromExecutor(new
>>>>> >>>>>>> > ForkJoinPool())
>>>>> >>>>>>> >
>>>>> >>>>>>> > and thus the number of concurrently running threads is
>>>>> limited to
>>>>> >>>>>>> > the number
>>>>> >>>>>>> > of cores (using the default constructor of the ForkJoinPool).
>>>>> >>>>>>> > What do you think?
>>>>> >>>>>>> >
>>>>> >>>>>>> >
>>>>> >>>>>>> > On Wed, Mar 23, 2016 at 6:55 PM, Stefano Bortoli
>>>>> >>>>>>> > <s.bort...@gmail.com>
>>>>> >>>>>>> > wrote:
>>>>> >>>>>>> >>
>>>>> >>>>>>> >> Hi guys,
>>>>> >>>>>>> >>
>>>>> >>>>>>> >> I am trying to test a job that should run a number of tasks
>>>>> to
>>>>> >>>>>>> >> read from a
>>>>> >>>>>>> >> RDBMS using an improved JDBC connector. The connection and
>>>>> the
>>>>> >>>>>>> >> reading run
>>>>> >>>>>>> >> smoothly, but I cannot seem to be able to move above the
>>>>> limit of
>>>>> >>>>>>> >> 8
>>>>> >>>>>>> >> concurrent threads running. 8 is of course the number of
>>>>> cores of
>>>>> >>>>>>> >> my
>>>>> >>>>>>> >> machine.
>>>>> >>>>>>> >>
>>>>> >>>>>>> >> I have tried working around configurations and settings,
>>>>> but the
>>>>> >>>>>>> >> Executor
>>>>> >>>>>>> >> within the ExecutionContext keeps on having a parallelism
>>>>> of 8.
>>>>> >>>>>>> >> Although, of
>>>>> >>>>>>> >> course, the parallelism of the execution environment is much
>>>>> >>>>>>> >> higher (in fact
>>>>> >>>>>>> >> I have many more tasks to be allocated).
>>>>> >>>>>>> >>
>>>>> >>>>>>> >> I feel it may be an issue of the LocalMiniCluster
>>>>> configuration
>>>>> >>>>>>> >> that may
>>>>> >>>>>>> >> just override/neglect my wish for higher degree of
>>>>> parallelism. Is
>>>>> >>>>>>> >> there a
>>>>> >>>>>>> >> way for me to work around this issue?
>>>>> >>>>>>> >>
>>>>> >>>>>>> >> please let me know. Thanks a lot for you help! :-)
>>>>> >>>>>>> >>
>>>>> >>>>>>> >> saluti,
>>>>> >>>>>>> >> Stefano
>>>>> >>>>>>> >
>>>>> >>>>>>> >
>>>>> >>>>>>> >
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: threads, parallelism and task managers

Reply via email to