Re: [Question] ORC - EMRFS Problem

Cazen Sun, 13 Sep 2015 00:47:04 -0700

Stacktrace are below.

But someone told me that it's known issue and will be patched in couple of
weeks(EMR 4.1.)


So, dont' mind about that. I'll waiting until patched.







scala> val ORCFile =
sqlContext.read.format("orc").load("s3n://S3bucketName/S3serviceCode/yymmdd=20150801/country=eu/75e91844-2a87-4d8f-af9f-9268e34daef6-000000")



2015-09-13 07:33:29,228 INFO  [main] fs.EmrFileSystem
(EmrFileSystem.java:initialize(107)) - Consistency disabled, using
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem
implementation
2015-09-13 07:33:29,314 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[CF49E1372BEF2E81],
ServiceEndpoint=[https://S3bucketName.s3.amazonaws.com],
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=0, ClientExecuteTime=[85.608],
HttpRequestTime=[85.101], HttpClientReceiveResponseTime=[13.891],
RequestSigningTime=[0.259], ResponseProcessingTime=[0.007],
HttpClientSendRequestTime=[0.305],
2015-09-13 07:33:29,351 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[55B8C5E6009F0246],
ServiceEndpoint=[https://S3bucketName.s3.amazonaws.com],
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[32.776],
HttpRequestTime=[13.17], HttpClientReceiveResponseTime=[10.961],
RequestSigningTime=[0.28], ResponseProcessingTime=[19.042],
HttpClientSendRequestTime=[0.295],
2015-09-13 07:33:29,421 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:open(1159)) - Opening
's3n://S3bucketName/S3serviceCode/yymmdd=20150801/country=eu/75e91844-2a87-4d8f-af9f-9268e34daef6-000000'
for reading
2015-09-13 07:33:29,477 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[206],
ServiceName=[Amazon S3], AWSRequestID=[F698A6A43297754E],
ServiceEndpoint=[https://S3bucketName.s3.amazonaws.com],
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[53.698],
HttpRequestTime=[50.815], HttpClientReceiveResponseTime=[48.774],
RequestSigningTime=[0.372], ResponseProcessingTime=[0.861],
HttpClientSendRequestTime=[0.362],
2015-09-13 07:33:29,478 INFO  [main] metrics.MetricsSaver
(MetricsSaver.java:<init>(915)) - Thread 1 created MetricsLockFreeSaver 1
2015-09-13 07:33:29,479 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:retrievePair(292)) - Stream for key
'S3serviceCode/yymmdd=20150801/country=eu/75e91844-2a87-4d8f-af9f-9268e34daef6-000000'
seeking to position '217260502'
2015-09-13 07:33:29,590 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[206],
ServiceName=[Amazon S3], AWSRequestID=[AD631A8AE229AFE7],
ServiceEndpoint=[https://S3bucketName.s3.amazonaws.com],
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=0, ClientExecuteTime=[109.859],
HttpRequestTime=[109.204], HttpClientReceiveResponseTime=[58.468],
RequestSigningTime=[0.286], ResponseProcessingTime=[0.133],
HttpClientSendRequestTime=[0.327],
2015-09-13 07:33:29,753 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://S3bucketName/S3serviceCode/yymmdd=20150801/country=eu/75e91844-2a87-4d8f-af9f-9268e34daef6-000000
with recursive false
2015-09-13 07:33:29,877 INFO  [main] hive.HiveContext
(Logging.scala:logInfo(59)) - Initializing HiveMetastoreConnection version
0.13.1 using Spark classes.
2015-09-13 07:33:30,593 WARN  [main] util.NativeCodeLoader
(NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
2015-09-13 07:33:30,622 INFO  [main] metastore.HiveMetaStore
(HiveMetaStore.java:newRawStore(493)) - 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
2015-09-13 07:33:30,641 INFO  [main] metastore.ObjectStore
(ObjectStore.java:initialize(246)) - ObjectStore, initialize called
2015-09-13 07:33:30,782 INFO  [main] DataNucleus.Persistence
(Log4JLogger.java:info(77)) - Property datanucleus.cache.level2 unknown -
will be ignored
2015-09-13 07:33:30,782 INFO  [main] DataNucleus.Persistence
(Log4JLogger.java:info(77)) - Property hive.metastore.integral.jdo.pushdown
unknown - will be ignored
2015-09-13 07:33:31,208 INFO  [main] metastore.ObjectStore
(ObjectStore.java:getPMF(315)) - Setting MetaStore object pin classes with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
2015-09-13 07:33:32,375 INFO  [main] DataNucleus.Datastore
(Log4JLogger.java:info(77)) - The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
2015-09-13 07:33:32,376 INFO  [main] DataNucleus.Datastore
(Log4JLogger.java:info(77)) - The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only"
so does not have its own datastore table.
2015-09-13 07:33:32,470 INFO  [main] DataNucleus.Datastore
(Log4JLogger.java:info(77)) - The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
2015-09-13 07:33:32,470 INFO  [main] DataNucleus.Datastore
(Log4JLogger.java:info(77)) - The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only"
so does not have its own datastore table.
2015-09-13 07:33:32,558 INFO  [main] DataNucleus.Query
(Log4JLogger.java:info(77)) - Reading in results for query
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is
closing
2015-09-13 07:33:32,561 INFO  [main] metastore.ObjectStore
(ObjectStore.java:setConf(229)) - Initialized ObjectStore
2015-09-13 07:33:32,816 INFO  [main] metastore.HiveMetaStore
(HiveMetaStore.java:createDefaultRoles(551)) - Added admin role in metastore
2015-09-13 07:33:32,819 INFO  [main] metastore.HiveMetaStore
(HiveMetaStore.java:createDefaultRoles(560)) - Added public role in
metastore
2015-09-13 07:33:32,888 INFO  [main] metastore.HiveMetaStore
(HiveMetaStore.java:addAdminUsers(588)) - No user is added in admin role,
since config is empty
2015-09-13 07:33:33,343 INFO  [main] session.SessionState
(SessionState.java:start(360)) - No Tez session required at this point.
hive.execution.engine=mr.
ORCFile: org.apache.spark.sql.DataFrame = [h_header1: string, h_header2:
string, h_header3: string, h_header4: string, h_header5: string, h_header6:
string, h_header7: string, h_header8: string, h_header9: string, body:
map<string,string>, yymmdd: int, country: string]




scala> ORCFile.head



2015-09-13 07:33:41,080 INFO  [main] sources.DataSourceStrategy
(Logging.scala:logInfo(59)) - Selected 1 partitions out of 1, pruned 0.0%
partitions.
2015-09-13 07:33:41,169 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - ensureFreeSpace(243112) called with curMem=0,
maxMem=280248975
2015-09-13 07:33:41,171 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - Block broadcast_0 stored as values in memory
(estimated size 237.4 KB, free 267.0 MB)
2015-09-13 07:33:41,214 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - ensureFreeSpace(22100) called with
curMem=243112, maxMem=280248975
2015-09-13 07:33:41,215 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - Block broadcast_0_piece0 stored as bytes in
memory (estimated size 21.6 KB, free 267.0 MB)
2015-09-13 07:33:41,216 INFO  [sparkDriver-akka.actor.default-dispatcher-3]
storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Added
broadcast_0_piece0 in memory on 10.0.0.112:48218 (size: 21.6 KB, free: 267.2
MB)
2015-09-13 07:33:41,221 INFO  [main] spark.SparkContext
(Logging.scala:logInfo(59)) - Created broadcast 0 from head at <console>:22
2015-09-13 07:33:41,396 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - ensureFreeSpace(244448) called with
curMem=265212, maxMem=280248975
2015-09-13 07:33:41,396 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - Block broadcast_1 stored as values in memory
(estimated size 238.7 KB, free 266.8 MB)
2015-09-13 07:33:41,422 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - ensureFreeSpace(22567) called with
curMem=509660, maxMem=280248975
2015-09-13 07:33:41,422 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - Block broadcast_1_piece0 stored as bytes in
memory (estimated size 22.0 KB, free 266.8 MB)
2015-09-13 07:33:41,423 INFO  [sparkDriver-akka.actor.default-dispatcher-3]
storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Added
broadcast_1_piece0 in memory on 10.0.0.112:48218 (size: 22.0 KB, free: 267.2
MB)
2015-09-13 07:33:41,426 INFO  [main] spark.SparkContext
(Logging.scala:logInfo(59)) - Created broadcast 1 from head at <console>:22
2015-09-13 07:33:41,495 INFO  [main] log.PerfLogger
(PerfLogger.java:PerfLogBegin(108)) - <PERFLOG method=OrcGetSplits
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
2015-09-13 07:33:41,497 INFO  [main] Configuration.deprecation
(Configuration.java:warnOnceIfDeprecated(1049)) - mapred.input.dir is
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
2015-09-13 07:33:41,501 INFO  [ORC_GET_SPLITS #0] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://S3bucketName/S3serviceCode/yymmdd=20150801/country=eu/75e91844-2a87-4d8f-af9f-9268e34daef6-000000
with recursive false
2015-09-13 07:33:41,504 INFO  [ORC_GET_SPLITS #1] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:open(1159)) - Opening
's3n://S3bucketName/S3serviceCode/yymmdd=20150801/country=eu/75e91844-2a87-4d8f-af9f-9268e34daef6-000000'
for reading
2015-09-13 07:33:41,593 INFO  [ORC_GET_SPLITS #1] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[206],
ServiceName=[Amazon S3], AWSRequestID=[8DFE404E45BFD9CD],
ServiceEndpoint=[https://S3bucketName.s3.amazonaws.com],
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=0, ClientExecuteTime=[88.129],
HttpRequestTime=[86.932], HttpClientReceiveResponseTime=[42.613],
RequestSigningTime=[0.539], ResponseProcessingTime=[0.142],
HttpClientSendRequestTime=[0.337],
2015-09-13 07:33:41,594 INFO  [ORC_GET_SPLITS #1] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:retrievePair(292)) - Stream for key
'S3serviceCode/yymmdd=20150801/country=eu/75e91844-2a87-4d8f-af9f-9268e34daef6-000000'
seeking to position '217260502'
2015-09-13 07:33:41,674 INFO  [ORC_GET_SPLITS #1] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[206],
ServiceName=[Amazon S3], AWSRequestID=[040D77B7E7E76AA5],
ServiceEndpoint=[https://S3bucketName.s3.amazonaws.com],
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=0, ClientExecuteTime=[79.608],
HttpRequestTime=[79.064], HttpClientReceiveResponseTime=[36.843],
RequestSigningTime=[0.222], ResponseProcessingTime=[0.11],
HttpClientSendRequestTime=[0.343],
2015-09-13 07:33:41,681 ERROR [ORC_GET_SPLITS #1] orc.OrcInputFormat
(OrcInputFormat.java:run(826)) - Unexpected Exception
java.lang.ArrayIndexOutOfBoundsException: 3
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.createSplit(OrcInputFormat.java:694)
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:822)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
java.lang.RuntimeException: serious problem
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.waitForTasks(OrcInputFormat.java:466)
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:919)
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:944)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at
org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.getPartitions(HadoopRDD.scala:375)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
        at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
        at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:121)
        at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:125)
        at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1269)
        at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1203)
        at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1210)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
        at $iwC$$iwC$$iwC.<init>(<console>:35)
        at $iwC$$iwC.<init>(<console>:37)
        at $iwC.<init>(<console>:39)
        at <init>(<console>:41)
        at .<init>(<console>:45)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
        at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
        at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
        at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.createSplit(OrcInputFormat.java:694)
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:822)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-ORC-EMRFS-Problem-tp24673p24675.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [Question] ORC - EMRFS Problem

Reply via email to