Re: Performance and large numbers of servers

Pavel Tupitsyn Tue, 28 Jun 2022 22:53:17 -0700

Thank you for tracking this down! An additional map by name is a good idea
there.


> CONCURRENCY NOTE: these two maps need to update concurrently
All updates are triggered by discovery events, which are raised under
"synchronized (discoEvtMux)" in GridDiscoveryManager,
so it is safe to update two maps together.

>  is desc.name() unique?
Yes



On Wed, Jun 29, 2022 at 2:06 AM Arthur Naseef <artnas...@apache.org> wrote:

> The following is taking most of the time:
>
> @Nullable private ServiceInfo lookupInRegisteredServices(String name) {
>     for (ServiceInfo desc : registeredServices.values()) {
>         if (desc.name().equals(name))
>             return desc;
>     }
>
>
>     return null;
> }
>
> After changing that to use a Map lookup:
>
>    - 50,000 service startup in *8s* (down from around 70s)
>    - 100,000 service startup in *14s* (right around 2x of the 50K timing)
>
>
> Here's the change I tested (note it's shortened) - it's not 100%, but fine
> for my test cast, I believe:
>
> private final ConcurrentMap<String, ServiceInfo> registeredServicesByName
> = new ConcurrentHashMap<>();
>
>
> @Nullable private ServiceInfo lookupInRegisteredServices(String name) {
>     return registeredServicesByName.get(name);
> }
>
> private void registerService(ServiceInfo desc) {
>     desc.context(ctx);
>
>
>     // (CONCURRENCY NOTE: these two maps need to update concurrently)
>     registeredServices.put(desc.serviceId(), desc);
>     registeredServicesByName.put(desc.name(), desc);
> }
>
>
> That's in IgniteServiceProcessor.java.
>
> Any thoughts?  I'll gladly clean this up and make  PR - would appreciate
> feedback to help address possible questions with this change (e.g. is
> desc.name() unique?).
>
> Art
>
>
> On Tue, Jun 28, 2022 at 12:27 PM Arthur Naseef <artnas...@apache.org>
> wrote:
>
>> Yes.  The "services" in our case will be schedules that periodically
>> perform fast operations.
>>
>> For example a service could be, "ping this device every <x> seconds".
>>
>> Art
>>
>> On Tue, Jun 28, 2022 at 12:20 PM Pavel Tupitsyn <ptupit...@apache.org>
>> wrote:
>>
>>> > we do not plan to make cross-cluster calls into the services
>>>
>>> If you are making local calls, I think there is no point in using Ignite
>>> services.
>>> Can you describe the use case - what are you trying to achieve?
>>>
>>> On Tue, Jun 28, 2022 at 8:55 PM Arthur Naseef <artnas...@apache.org>
>>> wrote:
>>>
>>>> Hello - I'm getting started with Ignite and looking seriously at using
>>>> it for a specific use-case.
>>>>
>>>> Working on a Proof-Of-Concept (POC), I am finding a question related to
>>>> performance, and wondering if the solution, using Ignite Services, is a
>>>> good fit for the use-case.
>>>>
>>>> In my testing, I am getting the following timings:
>>>>
>>>>    - Startup of 20,000 ignite services takes 30 seconds
>>>>    - Startup of 50,000 ignite services takes 250 seconds
>>>>    - The 2.5x increase from 20,000 to 50,000 yielded > 8x cost in
>>>>    startup time (appears to be exponential growth)
>>>>
>>>> Watching the JVM during this time, I see the following:
>>>>
>>>>    - Heap usage is not significant (do not see signs of GC)
>>>>    - CPU usage is only slightly increased - on the order of 20% total
>>>>    (system has 12 cores/24 threads)
>>>>    - Network utilization is reasonable
>>>>    - Futex system call (measured with "strace -r") appears to be
>>>>    taking the most time by far.
>>>>
>>>> The use-case involves the following:
>>>>
>>>>    - Startup of up-to hundreds-of-thousands of services at cluster
>>>>    spin-up
>>>>    - Frequent, small adjustments to the services running over time
>>>>    - Need to rebalance when a new node joins the cluster, or an old
>>>>    one leaves the cluster
>>>>    - Once the services are deployed, we do not plan to make
>>>>    cross-cluster calls into the services (i.e. we do *not* plan to use
>>>>    ignite's services().serviceProxy() on these)
>>>>    - Jobs don't look like a fit because these (1) are "long-running"
>>>>    (actually periodically scheduled tasks) and (2) they need to 
>>>> redistribute
>>>>    even after they start running
>>>>
>>>> This is starting to get long.  I have more details to share.  Here is
>>>> the repo with the code being used to test, and a link to a wiki page with
>>>> some of the details:
>>>>
>>>> https://github.com/opennms-forge/distributed-scheduling-poc/
>>>>
>>>>
>>>> https://github.com/opennms-forge/distributed-scheduling-poc/wiki/Ignite-Startup-Performance
>>>>
>>>>
>>>> Questions I have in mind:
>>>>
>>>>    - Are services a good fit here?  We expect to reach upwards of
>>>>    500,000 services in a cluster with multiple nodes.
>>>>    - Any thoughts on tracking down the bottleneck and alleviating it?
>>>>    (I have started taking timing measurements in the Ignite code)
>>>>
>>>> Stopping here - please ask questions and I'll gladly fill in details.
>>>> Any tips are welcome, including ideas for tracking down just where the
>>>> bottleneck exists.
>>>>
>>>> Art
>>>>
>>>>

Re: Performance and large numbers of servers

Reply via email to