The following is taking most of the time:

@Nullable private ServiceInfo lookupInRegisteredServices(String name) {
    for (ServiceInfo desc : registeredServices.values()) {
        if (desc.name().equals(name))
            return desc;
    }


    return null;
}

After changing that to use a Map lookup:

   - 50,000 service startup in *8s* (down from around 70s)
   - 100,000 service startup in *14s* (right around 2x of the 50K timing)


Here's the change I tested (note it's shortened) - it's not 100%, but fine
for my test cast, I believe:

private final ConcurrentMap<String, ServiceInfo> registeredServicesByName =
new ConcurrentHashMap<>();


@Nullable private ServiceInfo lookupInRegisteredServices(String name) {
    return registeredServicesByName.get(name);
}

private void registerService(ServiceInfo desc) {
    desc.context(ctx);


    // (CONCURRENCY NOTE: these two maps need to update concurrently)
    registeredServices.put(desc.serviceId(), desc);
    registeredServicesByName.put(desc.name(), desc);
}


That's in IgniteServiceProcessor.java.

Any thoughts?  I'll gladly clean this up and make  PR - would appreciate
feedback to help address possible questions with this change (e.g. is
desc.name() unique?).

Art


On Tue, Jun 28, 2022 at 12:27 PM Arthur Naseef <artnas...@apache.org> wrote:

> Yes.  The "services" in our case will be schedules that periodically
> perform fast operations.
>
> For example a service could be, "ping this device every <x> seconds".
>
> Art
>
> On Tue, Jun 28, 2022 at 12:20 PM Pavel Tupitsyn <ptupit...@apache.org>
> wrote:
>
>> > we do not plan to make cross-cluster calls into the services
>>
>> If you are making local calls, I think there is no point in using Ignite
>> services.
>> Can you describe the use case - what are you trying to achieve?
>>
>> On Tue, Jun 28, 2022 at 8:55 PM Arthur Naseef <artnas...@apache.org>
>> wrote:
>>
>>> Hello - I'm getting started with Ignite and looking seriously at using
>>> it for a specific use-case.
>>>
>>> Working on a Proof-Of-Concept (POC), I am finding a question related to
>>> performance, and wondering if the solution, using Ignite Services, is a
>>> good fit for the use-case.
>>>
>>> In my testing, I am getting the following timings:
>>>
>>>    - Startup of 20,000 ignite services takes 30 seconds
>>>    - Startup of 50,000 ignite services takes 250 seconds
>>>    - The 2.5x increase from 20,000 to 50,000 yielded > 8x cost in
>>>    startup time (appears to be exponential growth)
>>>
>>> Watching the JVM during this time, I see the following:
>>>
>>>    - Heap usage is not significant (do not see signs of GC)
>>>    - CPU usage is only slightly increased - on the order of 20% total
>>>    (system has 12 cores/24 threads)
>>>    - Network utilization is reasonable
>>>    - Futex system call (measured with "strace -r") appears to be taking
>>>    the most time by far.
>>>
>>> The use-case involves the following:
>>>
>>>    - Startup of up-to hundreds-of-thousands of services at cluster
>>>    spin-up
>>>    - Frequent, small adjustments to the services running over time
>>>    - Need to rebalance when a new node joins the cluster, or an old one
>>>    leaves the cluster
>>>    - Once the services are deployed, we do not plan to make
>>>    cross-cluster calls into the services (i.e. we do *not* plan to use
>>>    ignite's services().serviceProxy() on these)
>>>    - Jobs don't look like a fit because these (1) are "long-running"
>>>    (actually periodically scheduled tasks) and (2) they need to redistribute
>>>    even after they start running
>>>
>>> This is starting to get long.  I have more details to share.  Here is
>>> the repo with the code being used to test, and a link to a wiki page with
>>> some of the details:
>>>
>>> https://github.com/opennms-forge/distributed-scheduling-poc/
>>>
>>>
>>> https://github.com/opennms-forge/distributed-scheduling-poc/wiki/Ignite-Startup-Performance
>>>
>>>
>>> Questions I have in mind:
>>>
>>>    - Are services a good fit here?  We expect to reach upwards of
>>>    500,000 services in a cluster with multiple nodes.
>>>    - Any thoughts on tracking down the bottleneck and alleviating it?
>>>    (I have started taking timing measurements in the Ignite code)
>>>
>>> Stopping here - please ask questions and I'll gladly fill in details.
>>> Any tips are welcome, including ideas for tracking down just where the
>>> bottleneck exists.
>>>
>>> Art
>>>
>>>

Reply via email to