Hi Community,
Can someone please help validate the idea below and suggest pros/cons.
Most of our jobs end up with a shuffle stage based on a partition column
value before writing into parquet, and most of the time we have data skew
ness in partitions.
Currently most of the problems happen at shu
Hi All,
Could someone please advice with my below issues,
Below is my command I am using,
spark-submit --class AerospikeDynamicProtoMessageGenerator --master yarn
--deploy-mode cluster --num-executors 10 --conf
'spark.driver.extraJavaOptions=-verbose:class' --conf
'spark.executor.extraJavaOptions
Hi All,
Please help with below question,
I am trying to build my own data source to connect to CustomAerospike.
Now I am almost done with everything, but still not sure how to implement
Projection Pushdown while selecting nested columns.
Spark does implicit for column projection pushdown, but lo
Hi All,
Can anyone help me with below error,
Exception in thread "main" java.lang.AbstractMethodError
at
scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
at org.apache.spark.sql.types.StructType.filterNot(StructType.scala:98)
at org.apache.spark.sql.DataFrameReader.json
Hi All,
How to iterate over the StructField inside *after*,
StructType(StructField(*after*,StructType(*StructField(Alarmed,LongType,true),
StructField(CallDollarLimit,StringType,true),
StructField(CallRecordWav,StringType,true),
StructField(CallTimeLimit,LongType,true),
StructField(Signature,Stri
Hi All,
I working on real time reporting project and i have a question about
structured streaming job, that is going to stream a particular table
records and would have to join to an existing table.
Stream > query/join to another DF/DS ---> update the Stream data record.
Now i have a problem
Hi All,
I would like to infer JSON schema from a sample of data that i receive
from, Kafka Streams(specific topic), and i have to infer the schema as i am
going to receive random JSON string with different schema for each topic,
so i chose to go ahead with below steps,
a. readStream from Kafka(la
Hi All,
Is there a way to convert RDD[internalRow] to Dataset , from outside spark
sql package.
Regards,
Satyajit.
e with id key.
I would like to automate this process for all keys in the JSON, as i am
going to get dynamically generated JSON schema.
On Wed, Dec 6, 2017 at 4:37 PM, ayan guha wrote:
>
> On Thu, 7 Dec 2017 at 11:37 am, ayan guha wrote:
>
>> You can use get_json function
>>
Does spark support automatic detection of schema from a json string in a
dataframe.
I am trying to parse a json string and do some transofrmations on to it
(would like append new columns to the dataframe) , from the data i stream
from kafka.
But i am not very sure, how i can parse the json in str
e:
> Did you follow the guide in `IDE Setup` -> `IntelliJ` section of
> http://spark.apache.org/developer-tools.html ?
>
> Bests,
> Dongjoon.
>
> On Wed, Jun 28, 2017 at 5:13 PM, satyajit vegesna <
> satyajit.apas...@gmail.com> wrote:
>
>> Hi All,
>>
&
Hi All,
When i try to build source code of apache spark code from
https://github.com/apache/spark.git, i am getting below errors,
Error:(9, 14) EventBatch is already defined as object EventBatch
public class EventBatch extends org.apache.avro.specific.SpecificRecordBase
implements org.apache.avro
Hi All,
I am trying too build Kafka-0-10-sql module under external folder in apache
spark source code.
Once i generate jar file using,
build/mvn package -DskipTests -pl external/kafka-0-10-sql
i get jar file created under external/kafka-0-10-sql/target.
And try to run spark-shell with jars create
Hi,
val df = spark.read.parquet()
df.registerTempTable("df")
val zip = df.select("zip_code").distinct().as[String].rdd
def comp(zipcode:String):Unit={
val zipval = "SELECT * FROM df WHERE
zip_code='$zipvalrepl'".replace("$zipvalrepl",
zipcode)
val data = spark.sql(zipval)
data.write.parquet(
Hi All,
PFB sample code ,
val df = spark.read.parquet()
df.registerTempTable("df")
val zip = df.select("zip_code").distinct().as[String].rdd
def comp(zipcode:String):Unit={
val zipval = "SELECT * FROM df WHERE
zip_code='$zipvalrepl'".replace("$zipvalrepl",
zipcode)
val data = spark.sql(zip
Hi All,
PFB sample code ,
val df = spark.read.parquet()
df.registerTempTable("df")
val zip = df.select("zip_code").distinct().as[String].rdd
def comp(zipcode:String):Unit={
val zipval = "SELECT * FROM df WHERE
zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode)
val data = spark.sql(zip
Hi Liang,
The problem is that when i take a huge data set , i get a matrix size
1616160 * 1616160.
PFB code,
val exact = mat.columnSimilarities(0.5)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i,
j), u) }
case class output(label1:Long,label2:Long,score:Double)
val fin
Hi ALL,
I am trying to implement a mlllib spark job, to find the similarity between
documents(for my case is basically home addess).
i believe i cannot use DIMSUM for my use case as, DIMSUM is works well only
with matrix with thin columns and more rows in matrix.
matrix example format, for my us
Hi All,
PFB code.
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by satyajit
>
>
> i am trying to compile code using maven ,which was working with spark
> 1.6.2, but when i try for spark 2.0.0 then i get below error,
>
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on
> project
Hi ALL,
i am trying to compile code using maven ,which was working with spark
1.6.2, but when i try for spark 2.0.0 then i get below error,
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on
project Nginx
Hi All,
I am trying to run a spark job using yarn, and i specify --executor-cores
value as 20.
But when i go check the "nodes of the cluster" page in
http://hostname:8088/cluster/nodes then i see 4 containers getting created
on each of the node in cluster.
But can only see 1 vcore getting assigne
Hi All,
I have been trying to access tables from other schema's , apart from
default , to pull data into dataframe.
i was successful in doing it using the default schema in hive database.
But when i try any other schema/database in hive, i am getting below
error.(Have also not seen any examples r
Hi All,
I have written a spark program on my dev box ,
IDE:Intellij
scala version:2.11.7
spark verison:1.6.1
run fine from IDE, by providing proper input and output paths including
master.
But when i try to deploy the code in my cluster made of below,
Spark version:
Hi All,
I have written a spark program on my dev box ,
IDE:Intellij
scala version:2.11.7
spark verison:1.6.1
run fine from IDE, by providing proper input and output paths including
master.
But when i try to deploy the code in my cluster made of below,
Spark version:
Hi,
Scala version:2.11.7(had to upgrade the scala verison to enable case
clasess to accept more than 22 parameters.)
Spark version:1.6.1.
PFB pom.xml
Getting below error when trying to setup spark on intellij IDE,
16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
Exception
Hi ,
I am trying to create separate val reference to object DATA (as shown
below),
case class data(name:String,age:String)
Creation of this object is done separately and the reference to the object
is stored into val data.
i use val samplerdd = sc.parallelize(Seq(data)) , to create RDD.
org.apa
HI All,
I am trying to run HdfsWordCount example from github.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
i am using ubuntu to run the program, but dont see any data getting printed
after ,
--
Hi,
I am new to using Spark and Parquet files,
Below is what i am trying to do, on Spark-shell,
val df =
sqlContext.parquetFile("/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet")
Have also tried below command,
val
df=sqlContext.read.format("parquet").load("/data/LM/Parquet/Segment/pages/
29 matches
Mail list logo