from:"satyajit vegesna"

Shuffle write and read phase optimizations for parquet+zstd write

2024-02-07 Thread satyajit vegesna

Hi Community, Can someone please help validate the idea below and suggest pros/cons. Most of our jobs end up with a shuffle stage based on a partition column value before writing into parquet, and most of the time we have data skew ness in partitions. Currently most of the problems happen at shu

Add external JARS to classpath not working.

2021-12-16 Thread satyajit vegesna

Hi All, Could someone please advice with my below issues, Below is my command I am using, spark-submit --class AerospikeDynamicProtoMessageGenerator --master yarn --deploy-mode cluster --num-executors 10 --conf 'spark.driver.extraJavaOptions=-verbose:class' --conf 'spark.executor.extraJavaOptions

Question regarding Projection PushDown

2021-08-27 Thread satyajit vegesna

Hi All, Please help with below question, I am trying to build my own data source to connect to CustomAerospike. Now I am almost done with everything, but still not sure how to implement Projection Pushdown while selecting nested columns. Spark does implicit for column projection pushdown, but lo

Spark error while trying to spark.read.json()

2017-12-19 Thread satyajit vegesna

Hi All, Can anyone help me with below error, Exception in thread "main" java.lang.AbstractMethodError at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278) at org.apache.spark.sql.types.StructType.filterNot(StructType.scala:98) at org.apache.spark.sql.DataFrameReader.json

Access Array StructField inside StructType.

2017-12-12 Thread satyajit vegesna

Hi All, How to iterate over the StructField inside *after*, StructType(StructField(*after*,StructType(*StructField(Alarmed,LongType,true), StructField(CallDollarLimit,StringType,true), StructField(CallRecordWav,StringType,true), StructField(CallTimeLimit,LongType,true), StructField(Signature,Stri

Joining streaming data with static table data.

2017-12-11 Thread satyajit vegesna

Hi All, I working on real time reporting project and i have a question about structured streaming job, that is going to stream a particular table records and would have to join to an existing table. Stream > query/join to another DF/DS ---> update the Stream data record. Now i have a problem

Infer JSON schema in structured streaming Kafka.

2017-12-10 Thread satyajit vegesna

Hi All, I would like to infer JSON schema from a sample of data that i receive from, Kafka Streams(specific topic), and i have to infer the schema as i am going to receive random JSON string with different schema for each topic, so i chose to go ahead with below steps, a. readStream from Kafka(la

RDD[internalRow] -> DataSet

2017-12-07 Thread satyajit vegesna

Hi All, Is there a way to convert RDD[internalRow] to Dataset , from outside spark sql package. Regards, Satyajit.

Re: Json Parsing.

2017-12-06 Thread satyajit vegesna

e with id key. I would like to automate this process for all keys in the JSON, as i am going to get dynamically generated JSON schema. On Wed, Dec 6, 2017 at 4:37 PM, ayan guha wrote: > > On Thu, 7 Dec 2017 at 11:37 am, ayan guha wrote: > >> You can use get_json function >>

Json Parsing.

2017-12-06 Thread satyajit vegesna

Does spark support automatic detection of schema from a json string in a dataframe. I am trying to parse a json string and do some transofrmations on to it (would like append new columns to the dataframe) , from the data i stream from kafka. But i am not very sure, how i can parse the json in str

Re: Spark Project build Issues.(Intellij)

2017-06-28 Thread satyajit vegesna

e: > Did you follow the guide in `IDE Setup` -> `IntelliJ` section of > http://spark.apache.org/developer-tools.html ? > > Bests, > Dongjoon. > > On Wed, Jun 28, 2017 at 5:13 PM, satyajit vegesna < > satyajit.apas...@gmail.com> wrote: > >> Hi All, >> &

Spark Project build Issues.(Intellij)

2017-06-28 Thread satyajit vegesna

Hi All, When i try to build source code of apache spark code from https://github.com/apache/spark.git, i am getting below errors, Error:(9, 14) EventBatch is already defined as object EventBatch public class EventBatch extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro

Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread satyajit vegesna

Hi All, I am trying too build Kafka-0-10-sql module under external folder in apache spark source code. Once i generate jar file using, build/mvn package -DskipTests -pl external/kafka-0-10-sql i get jar file created under external/kafka-0-10-sql/target. And try to run spark-shell with jars create

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-21 Thread satyajit vegesna

Hi, val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data = spark.sql(zipval) data.write.parquet(

Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-20 Thread satyajit vegesna

Hi All, PFB sample code , val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data = spark.sql(zip

[no subject]

2016-12-20 Thread satyajit vegesna

Hi All, PFB sample code , val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data = spark.sql(zip

Re: Document Similarity -Spark Mllib

2016-12-13 Thread satyajit vegesna

Hi Liang, The problem is that when i take a huge data set , i get a matrix size 1616160 * 1616160. PFB code, val exact = mat.columnSimilarities(0.5) val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) } case class output(label1:Long,label2:Long,score:Double) val fin

Document Similarity -Spark Mllib

2016-12-09 Thread satyajit vegesna

Hi ALL, I am trying to implement a mlllib spark job, to find the similarity between documents(for my case is basically home addess). i believe i cannot use DIMSUM for my use case as, DIMSUM is works well only with matrix with thin columns and more rows in matrix. matrix example format, for my us

Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread satyajit vegesna

Hi All, PFB code. import org.apache.spark.ml.feature.{HashingTF, IDF} import org.apache.spark.ml.linalg.SparseVector import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} /** * Created by satyajit

Re: Issues in compiling spark 2.0.0 code using scala-maven-plugin

2016-09-30 Thread satyajit vegesna

> > > i am trying to compile code using maven ,which was working with spark > 1.6.2, but when i try for spark 2.0.0 then i get below error, > > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute > goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on > project

Issues in compiling spark 2.0.0 code using scala-maven-plugin

2016-09-29 Thread satyajit vegesna

Hi ALL, i am trying to compile code using maven ,which was working with spark 1.6.2, but when i try for spark 2.0.0 then i get below error, org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project Nginx

Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-02 Thread satyajit vegesna

Hi All, I am trying to run a spark job using yarn, and i specify --executor-cores value as 20. But when i go check the "nodes of the cluster" page in http://hostname:8088/cluster/nodes then i see 4 containers getting created on each of the node in cluster. But can only see 1 vcore getting assigne

HiveContext , difficulties in accessing tables in hive schema's/database's other than default database.

2016-07-19 Thread satyajit vegesna

Hi All, I have been trying to access tables from other schema's , apart from default , to pull data into dataframe. i was successful in doing it using the default schema in hive database. But when i try any other schema/database in hive, i am getting below error.(Have also not seen any examples r

Fwd: Master options Cluster/Client descrepencies.

2016-03-29 Thread satyajit vegesna

Hi All, I have written a spark program on my dev box , IDE:Intellij scala version:2.11.7 spark verison:1.6.1 run fine from IDE, by providing proper input and output paths including master. But when i try to deploy the code in my cluster made of below, Spark version:

Master options Cluster/Client descrepencies.

2016-03-28 Thread satyajit vegesna

Hi All, I have written a spark program on my dev box , IDE:Intellij scala version:2.11.7 spark verison:1.6.1 run fine from IDE, by providing proper input and output paths including master. But when i try to deploy the code in my cluster made of below, Spark version:

Fwd: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread satyajit vegesna

Hi, Scala version:2.11.7(had to upgrade the scala verison to enable case clasess to accept more than 22 parameters.) Spark version:1.6.1. PFB pom.xml Getting below error when trying to setup spark on intellij IDE, 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1 Exception

Fwd: DF creation

2016-03-18 Thread satyajit vegesna

Hi , I am trying to create separate val reference to object DATA (as shown below), case class data(name:String,age:String) Creation of this object is done separately and the reference to the object is stored into val data. i use val samplerdd = sc.parallelize(Seq(data)) , to create RDD. org.apa

Data not getting printed in Spark Streaming with print().

2016-01-28 Thread satyajit vegesna

HI All, I am trying to run HdfsWordCount example from github. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala i am using ubuntu to run the program, but dont see any data getting printed after , --

Parquet SaveMode.Append Trouble.

2015-07-30 Thread satyajit vegesna

Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile("/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet") Have also tried below command, val df=sqlContext.read.format("parquet").load("/data/LM/Parquet/Segment/pages/

Shuffle write and read phase optimizations for parquet+zstd write

Add external JARS to classpath not working.

Question regarding Projection PushDown

Spark error while trying to spark.read.json()

Access Array StructField inside StructType.

Joining streaming data with static table data.

Infer JSON schema in structured streaming Kafka.

RDD[internalRow] -> DataSet

Re: Json Parsing.

Json Parsing.

Re: Spark Project build Issues.(Intellij)

Spark Project build Issues.(Intellij)

Building Kafka 0.10 Source for Structured Streaming Error.

Re: Null pointer exception with RDD while computing a method, creating dataframe.

Null pointer exception with RDD while computing a method, creating dataframe.

[no subject]

Re: Document Similarity -Spark Mllib

Document Similarity -Spark Mllib

Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

Re: Issues in compiling spark 2.0.0 code using scala-maven-plugin

Issues in compiling spark 2.0.0 code using scala-maven-plugin

Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

HiveContext , difficulties in accessing tables in hive schema's/database's other than default database.

Fwd: Master options Cluster/Client descrepencies.

Master options Cluster/Client descrepencies.

Fwd: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

Fwd: DF creation

Data not getting printed in Spark Streaming with print().

Parquet SaveMode.Append Trouble.

29 matches

Site Navigation

Mail list logo

Footer information