Maximum limit for akka.frame.size be greater than 500 MB ?

2017-01-29 Thread aravasai
I have a spark job running on 2 terabytes of data which creates more than
30,000 partitions. As a result, the spark job fails with the error 
"Map output statuses were 170415722 bytes which exceeds spark.akka.frameSize
52428800 bytes" (For 1 TB data)
However, when I increase the akka.frame.size to around 500 MB, the job hangs
with no further progress.

So, what is the ideal or maximum limit that i can assign akka.frame.size so
that I do not get the error of map output statuses exceeding limit for large
chunks of data ?

Is coalescing the data into smaller number of partitions the only solution
to this problem? Is there any better way than coalescing many intermediate
rdd's in program ?

My driver memory: 10G
Executor memory: 36G 
Executor memory overhead : 3G







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Maximum-limit-for-akka-frame-size-be-greater-than-500-MB-tp20793.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Chetan Khatri
Okey, you are saying that 2.0.0 don't have that patch fixed ? @dev cc--
I don't like everytime changing the service versions !

Thanks.

On Mon, Jan 30, 2017 at 1:10 AM, Jacek Laskowski  wrote:

> Hi,
>
> I think you have to upgrade to 2.1.0. There were few changes wrt the ERROR
> since.
>
> Jacek
>
>
> On 29 Jan 2017 9:24 a.m., "Chetan Khatri" 
> wrote:
>
> Hello Spark Users,
>
> I am getting error while saving Spark Dataframe to Hive Table:
> Hive 1.2.1
> Spark 2.0.0
> Local environment.
> Note: Job is getting executed successfully and the way I want but still
> Exception raised.
> *Source Code:*
>
> package com.chetan.poc.hbase
>
> /**
>   * Created by chetan on 24/1/17.
>   */
> import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration}
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.KeyValue.Type
> import org.apache.spark.sql.SparkSession
> import scala.collection.JavaConverters._
> import java.util.Date
> import java.text.SimpleDateFormat
>
>
> object IncrementalJob {
> val APP_NAME: String = "SparkHbaseJob"
> var HBASE_DB_HOST: String = null
> var HBASE_TABLE: String = null
> var HBASE_COLUMN_FAMILY: String = null
> var HIVE_DATA_WAREHOUSE: String = null
> var HIVE_TABLE_NAME: String = null
>   def main(args: Array[String]) {
> // Initializing HBASE Configuration variables
> HBASE_DB_HOST="127.0.0.1"
> HBASE_TABLE="university"
> HBASE_COLUMN_FAMILY="emp"
> // Initializing Hive Metastore configuration
> HIVE_DATA_WAREHOUSE = "/usr/local/hive/warehouse"
> // Initializing Hive table name - Target table
> HIVE_TABLE_NAME = "employees"
> // setting spark application
> // val sparkConf = new SparkConf().setAppName(APP_NAME).setMaster("local")
> //initialize the spark context
> //val sparkContext = new SparkContext(sparkConf)
> //val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
> // Enable Hive with Hive warehouse in SparkSession
> val spark = 
> SparkSession.builder().appName(APP_NAME).config("hive.metastore.warehouse.dir",
>  HIVE_DATA_WAREHOUSE).config("spark.sql.warehouse.dir", 
> HIVE_DATA_WAREHOUSE).enableHiveSupport().getOrCreate()
> import spark.implicits._
> import spark.sql
>
> val conf = HBaseConfiguration.create()
> conf.set(TableInputFormat.INPUT_TABLE, HBASE_TABLE)
> conf.set(TableInputFormat.SCAN_COLUMNS, HBASE_COLUMN_FAMILY)
> // Load an RDD of rowkey, result(ImmutableBytesWritable, Result) tuples 
> from the table
> val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf, 
> classOf[TableInputFormat],
>   classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
>   classOf[org.apache.hadoop.hbase.client.Result])
>
> println(hBaseRDD.count())
> //hBaseRDD.foreach(println)
>
> //keyValue is a RDD[java.util.list[hbase.KeyValue]]
> val keyValue = hBaseRDD.map(x => x._2).map(_.list)
>
> //outPut is a RDD[String], in which each line represents a record in HBase
> val outPut = keyValue.flatMap(x =>  x.asScala.map(cell =>
>
>   HBaseResult(
> Bytes.toInt(CellUtil.cloneRow(cell)),
> Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
> Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
> cell.getTimestamp,
> new SimpleDateFormat("-MM-dd HH:mm:ss:SSS").format(new 
> Date(cell.getTimestamp.toLong)),
> Bytes.toStringBinary(CellUtil.cloneValue(cell)),
> Type.codeToType(cell.getTypeByte).toString
> )
>   )
> ).toDF()
> // Output dataframe
> outPut.show
>
> // get timestamp
> val datetimestamp_threshold = "2016-08-25 14:27:02:001"
> val datetimestampformat = new SimpleDateFormat("-MM-dd 
> HH:mm:ss:SSS").parse(datetimestamp_threshold).getTime()
>
> // Resultset filteration based on timestamp
> val filtered_output_timestamp = outPut.filter($"colDatetime" >= 
> datetimestampformat)
> // Resultset filteration based on rowkey
> val filtered_output_row = 
> outPut.filter($"colDatetime".between(1668493360,1668493365))
>
>
> // Saving Dataframe to Hive Table Successfully.
> 
> filtered_output_timestamp.write.mode("append").saveAsTable(HIVE_TABLE_NAME)
>   }
>   case class HBaseResult(rowkey: Int, colFamily: String, colQualifier: 
> String, colDatetime: Long, colDatetimeStr: String, colValue: String, colType: 
> String)
> }
>
>
> Error:
>
> 17/01/29 13:51:53 INFO metastore.HiveMetaStore: 0: create_database: 
> Database(name:default, description:default database, 
> locationUri:hdfs://localhost:9000/usr/local/hive/warehouse, parameters:{})
> 17/01/29 13:51:53 INFO HiveMetaStore.audit: ugi=hduser
> ip=unknown-ip-addr  cmd=create_database: Database(name:default, 
> description:default database, 
> locationUri:hdfs://localhost:9000/usr/local/hive/warehouse, parameters:{})
> 17/01/29 13:51:53 ERROR metastore.Retrying

Re: Maximum limit for akka.frame.size be greater than 500 MB ?

2017-01-29 Thread Jörn Franke
Which Spark version are you using? What are you trying to do exactly and what 
is the input data? As far as I know, akka has been dropped in recent Spark 
versions.

> On 30 Jan 2017, at 00:44, aravasai  wrote:
> 
> I have a spark job running on 2 terabytes of data which creates more than
> 30,000 partitions. As a result, the spark job fails with the error 
> "Map output statuses were 170415722 bytes which exceeds spark.akka.frameSize
> 52428800 bytes" (For 1 TB data)
> However, when I increase the akka.frame.size to around 500 MB, the job hangs
> with no further progress.
> 
> So, what is the ideal or maximum limit that i can assign akka.frame.size so
> that I do not get the error of map output statuses exceeding limit for large
> chunks of data ?
> 
> Is coalescing the data into smaller number of partitions the only solution
> to this problem? Is there any better way than coalescing many intermediate
> rdd's in program ?
> 
> My driver memory: 10G
> Executor memory: 36G 
> Executor memory overhead : 3G
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Maximum-limit-for-akka-frame-size-be-greater-than-500-MB-tp20793.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Maximum limit for akka.frame.size be greater than 500 MB ?

2017-01-29 Thread aravasai
Currently, I am using 1.6.1 version. I continue to use it as my current
code is heavily reliant on RDD's and not dataframes. Also, because 1.6.1 is
stabler than newer versions.


The input data is user behavior data of 20 fields and 1 billion records (~
1.5 TB) . I am trying to group by user id and calculate some users
statistics. But, I guess the number of mapper tasks are too high resulting
in akka.frame.size error.

1) Does akka.frame.size has to be proportionately increased with size of
data which indirectly affects the number of partitions?
2) Or do the  huge number of mappers in the code (It may not be prevented)
result in the frame size error?

On Sun, Jan 29, 2017 at 11:15 PM, Jörn Franke [via Apache Spark Developers
List]  wrote:

> Which Spark version are you using? What are you trying to do exactly and
> what is the input data? As far as I know, akka has been dropped in recent
> Spark versions.
>
> > On 30 Jan 2017, at 00:44, aravasai <[hidden email]
> > wrote:
> >
> > I have a spark job running on 2 terabytes of data which creates more
> than
> > 30,000 partitions. As a result, the spark job fails with the error
> > "Map output statuses were 170415722 bytes which exceeds
> spark.akka.frameSize
> > 52428800 bytes" (For 1 TB data)
> > However, when I increase the akka.frame.size to around 500 MB, the job
> hangs
> > with no further progress.
> >
> > So, what is the ideal or maximum limit that i can assign akka.frame.size
> so
> > that I do not get the error of map output statuses exceeding limit for
> large
> > chunks of data ?
> >
> > Is coalescing the data into smaller number of partitions the only
> solution
> > to this problem? Is there any better way than coalescing many
> intermediate
> > rdd's in program ?
> >
> > My driver memory: 10G
> > Executor memory: 36G
> > Executor memory overhead : 3G
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Maximum-limit-for-
> akka-frame-size-be-greater-than-500-MB-tp20793.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: [hidden email]
> 
> >
>
> -
> To unsubscribe e-mail: [hidden email]
> 
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/Maximum-limit-for-akka-frame-size-be-greater-
> than-500-MB-tp20793p20796.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1...@n3.nabble.com
> To unsubscribe from Maximum limit for akka.frame.size be greater than 500
> MB ?, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Maximum-limit-for-akka-frame-size-be-greater-than-500-MB-tp20793p20797.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.