Re: Spark Streaming Small files in Hive

2017-10-29 Thread Siva Gudavalli
Hello Asmath, We had a similar challenge recently. When you write back to hive, you are creating files on HDFS, and it depends on your batch window. If you increase your batch window lets say from 1 min to 5 mins you will end up creating 5x times less. The other factor is your partitioning. F

Re: Orc predicate pushdown with Spark Sql

2017-10-27 Thread Siva Gudavalli
ld not read all the content, which is probably also not > happening. > > On 24. Oct 2017, at 18:16, Siva Gudavalli <mailto:gudavalli.s...@yahoo.com.INVALID>> wrote: > >> >> Hello, >> >> I have an update here. >> >> spark SQL is push

Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Siva Gudavalli
xplain at :33 == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[id#192 DESC], output=[id#192]) +- ConvertToSafe +- Project [id#192] +- Filter (usr#199 = AA0YP) +- HiveTableScan [id#192,usr#199], MetastoreRelation default, hlogsv5, None, [(cdt#189 = 20171003),(usrpartkey#191 = hhhUsers)]   

Orc predicate pushdown with Spark Sql

2017-10-23 Thread Siva Gudavalli
Hello, I am working with Spark SQL to query Hive Managed Table (in Orc Format) I have my data organized by partitions and asked to set indexes for each 50,000 Rows by setting ('orc.row.index.stride'='5') lets say -> after evaluating partition there are around 50 files in which data is

Partition and Sort by together

2017-10-12 Thread Siva Gudavalli
Hello, I have my data stored in parquet file format. My data Is already partitioned by dates and keyNow I want my data in each file to be sorted by a new Code column.  date1    -> key1             -> paqfile1             ->paqfile2     ->key2             ->paqfile1             ->paqfile2 date2 

Re: how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
k > Java serialization. > > On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli > wrote: > >> hello, >> >> i am writing a spark streaming application to read data from kafka. I am >> using no receiver approach and enabled checkpointing to make sure I am not >> rea

how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
hello, i am writing a spark streaming application to read data from kafka. I am using no receiver approach and enabled checkpointing to make sure I am not reading messages again in case of failure. (exactly once semantics) i have a quick question how checkpointing needs to be configured to handle

Spark sql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append

2015-11-24 Thread Siva Gudavalli
Ref:https://issues.apache.org/jira/browse/SPARK-11953 In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc They are replaced with write.jdbc() in Spark 1.4.1 CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table followed by INSERT (DML) InsertIntoJDBC

spark 1.4.1 to oracle 11g write to an existing table

2015-11-23 Thread Siva Gudavalli
Hi, I am trying to write a dataframe from Spark 1.4.1 to oracle 11g I am using dataframe.write.mode(SaveMode.Append).jdbc(url,tablename, properties) this is always trying to create a Table. I would like to insert records to an existing table instead of creating a new one each single time. Plea