Re: Binary recovery for a very long time

Ilya Kasnacheev Mon, 18 May 2020 06:59:44 -0700

Hello!

Direct IO module is experimental and should not be used unless performance
is tested first, in your specific use case.


Regards,
-- 
Ilya Kasnacheev


пн, 18 мая 2020 г. в 16:47, 38797715 <[email protected]>:

> Hi,
>
> If direct IO is disabled, the startup speed will be doubled, including
> some other tests. I find that direct IO has a great impact on the read
> performance.
> 在 2020/5/14 上午5:16, Evgenii Zhuravlev 写道:
>
> Can you share full logs from all nodes?
>
> вт, 12 мая 2020 г. в 18:24, 38797715 <[email protected]>:
>
>> Hi Evgenii,
>>
>> The storage used is not SSD.
>>
>> We will use different versions of ignite for further testing, such as
>> ignite2.8.
>> Ignite is configured as follows:
>> <?xml version="1.0" encoding="UTF-8"?>
>> <beans xmlns="http://www.springframework.org/schema/beans";
>> <http://www.springframework.org/schema/beans>
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>> <http://www.w3.org/2001/XMLSchema-instance> xsi:schemaLocation="
>> http://www.springframework.org/schema/beans
>> http://www.springframework.org/schema/beans/spring-beans.xsd";>
>> <bean id="ignite.cfg" class=
>> "org.apache.ignite.configuration.IgniteConfiguration">
>> <property name="peerClassLoadingEnabled" value="true"/>
>> <property name="consistentId" value="20"/>
>> <property name="failureDetectionTimeout" value="120000"/>
>> <property name="workDirectory" value="/appdata/ignite"/>
>> <property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
>> <property name="rebalanceThrottle" value="100"/>
>> <property name="rebalanceThreadPoolSize" value="4"/>
>> <property name="gridLogger">
>> <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
>> <constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"
>> />
>> </bean>
>> </property>
>> <property name="cacheConfiguration">
>> <list>
>> <bean id="partitioned-cache-template" abstract="true" class=
>> "org.apache.ignite.configuration.CacheConfiguration">
>> <property name="name" value="cache-partitioned*"/>
>> <property name="cacheMode" value="PARTITIONED" />
>> <property name="backups" value="1" />
>> <property name="queryParallelism" value="16"/>
>> <property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
>> </bean>
>> <bean id="replicated-cache-template" abstract="true" class=
>> "org.apache.ignite.configuration.CacheConfiguration">
>> <property name="name" value="cache-replicated*"/>
>> <property name="cacheMode" value="REPLICATED" />
>> <property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
>> </bean>
>> </list>
>> </property>
>> <!-- Enabling Apache Ignite Persistent Store. -->
>> <property name="dataStorageConfiguration">
>> <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
>> <property name="defaultDataRegionConfiguration">
>> <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
>> <property name="persistenceEnabled" value="true"/>
>> <property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
>> </bean>
>> </property>
>> </bean>
>> </property>
>> </bean>
>> </beans>
>> 在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
>>
>> Hi,
>>
>> Can you share full logs and configuration? What disk so you use?
>>
>> Evgenii
>>
>> вт, 12 мая 2020 г. в 06:49, 38797715 <[email protected]>:
>>
>>> Among them:
>>> CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)
>>>
>>> Ignite sys cache: ~ 27 minutes
>>>
>>> PLM_ITEM:~3 minutes(repicated,1.9K)
>>>
>>>
>>> 在 2020/5/12 下午9:08, 38797715 写道:
>>>
>>> Hi community,
>>>
>>> We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
>>> We have 7 tables to test, and the data volume is
>>> respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are
>>> partitioned(backup = 1)
>>>
>>> VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC
>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails
>>> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation
>>> -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M
>>> -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true
>>> -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy
>>>
>>> Today, one of the servers was restarted(kill and then start ignite.sh)
>>> for some reason, but the node took 1.5 hours to start, which was much
>>> longer than expected.
>>>
>>> After analyzing the log, the following information is found:
>>> [2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager]
>>> Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb,
>>> pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
>>> [2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager]
>>> Binary memory state restored at node startup [restoredPtr=FileWALPointer
>>> [idx=10511, fileOff=51410110, len=0]]
>>> [2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager]
>>> Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/
>>> 0000000000000001.wal, offset=51410110, ver=2]
>>> [2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page
>>> memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB,
>>> checkpointBuffer=2.0 GiB]
>>> [2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=CO_CO_NEW, id=-189779360,
>>> dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1,
>>> mvcc=false]
>>> [2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=CO_CO_LINE, id=-1588248812,
>>> dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1,
>>> mvcc=false]
>>> [2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=ignite-sys-cache, id=-2100569601,
>>> dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=
>>> 2147483647, mvcc=false]
>>> [2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829,
>>> dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1,
>>> mvcc=false]
>>> [2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=PI_COM_DAY, id=-1904194728,
>>> dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1,
>>> mvcc=false]
>>> [2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=PLM_ITEM, id=-1283854143,
>>> dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=
>>> 2147483647, mvcc=false]
>>> [2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=CO_CO, id=64322847,
>>> dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1,
>>> mvcc=false]
>>> [2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started
>>> cache in recovery mode [name=CO_CUST, id=1684722246,
>>> dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1,
>>> mvcc=false]
>>> [2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager]
>>> Binary recovery performed in 4970233 ms.
>>>
>>> Among them, binary recovery took 4970 seconds.
>>>
>>> Our question is:
>>>
>>> 1.Why is the start time so long?
>>>
>>> 2.Is the current state of ignite, with the growth of single node data
>>> volume, the restart time will be longer and longer?
>>>
>>> 3.Do have any suggestions for optimizing the restart time?
>>>
>>>

Re: Binary recovery for a very long time

Reply via email to