Somewhat related, though this JIRA is on 1.6.
https://issues.apache.org/jira/browse/SPARK-13288#
Hello,
I am learning sparkR by myself and have little computer background.
I am following the examples on
http://spark.apache.org/docs/latest/sparkr.html
and running
/sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
people <- read.df(sqlC
terminal_type =0, 260,000,000 rows, almost cover half of the whole
data.terminal_type =25066, just 3800 rows.
orc
tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="10","orc.create.index"="true","orc.bloom.filter.columns"
In Spark SQL, timestamp is the number of micro seconds since epoch, so
it has nothing with timezone.
When you compare it again unix_timestamp or string, it's better to
convert these into timestamp then compare them.
In your case, the where clause should be:
where (created > cast('{0}' as timesta
Hi all,
I had couple of questions.
1. Is there documentation on how to add the graphframes or any other
package for that matter on the google dataproc managed spark clusters ?
2. Is there a way to add a package to an existing pyspark context through a
jupyter notebook ?
--aj
I am using following code snippet in scala:
*val dict: RDD[String] = sc.textFile("path/to/csv/file")*
*val dict_broadcast=sc.broadcast(dict.collectAsMap())*
On compiling It generates this error:
*scala:42: value collectAsMap is not a member of
org.apache.spark.rdd.RDD[String]*
*val dict_broad
Hi,
I am dynamically doing union all and adding new column too
val dfresult =
> dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9")
> val schemaL = dfresult.schema
> var dffiltered = sqlContext.createDataFrame(sc.emptyRDD[Row], schemaL)
> for ((key,values) <- lcrMap) {
bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
Do you mind showing more of your code involving the map() ?
On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:
> Hello,
> I found a strange behavior after executing a prediction with MLIB.
> My code
Please refrain from posting such messages on this email thread.
This is specific to the Spark ecosystem and not an avenue to advertise an
entity/company.
Thank you.
-
Neelesh S. Salian
Cloudera
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HP-customer
One way is to rename the columns using the toDF
For eg:
val df = Seq((1, 2),(1,4),(2,3) ).toDF("a","b")
df.printSchema()
val renamedf = df.groupBy('a).agg(sum('b)).toDF("mycola", "mycolb")
renamedf.printSchema()
Best regards,
Sunitha
> On Mar 18, 2016, at 9:10 AM, andres.fernan...@wellsfargo.c
> But I guess I cannot add a package once i launch the pyspark context right ?
Correct. Potentially, if you really really wanted to, you could maybe
(with lots of pain) load packages dynamically with some class-loader
black magic, but Spark does not provide that functionality.
On Thu, Mar 17, 201
There's 1 topic per partition, so you're probably better off dealing
with topics that way rather than at the individual message level.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
Look at the discussion of "HasOffsetRanges"
If you r
It turned out that Col1 appeared twice in the select :-)
> On Mar 16, 2016, at 7:29 PM, Divya Gehlot wrote:
>
> Hi,
> I am dynamically doing union all and adding new column too
>
>> val dfresult =
>> dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9")
>> val schem
Hi,
I tried to replicate the example of joining DStream with lookup RDD from
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation.
It works fine, but when I enable checkpointing for the StreamingContext
and let the application to recover from a previously cr
OK. I did take a look at them. So once I have an accumulater for a HashSet,
how can I check if a particular key is already present in the HashSet
accumulator? I don't see any .contains method there. My requirement is that
I need to keep accumulating the keys in the HashSet across all the tasks in
v
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-
pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5
which starts and gives me a REPL, but when I try
from graphframes import *
I get
No module names graphframes
without '--master yarn' it
Hi all,
I'm running a query that looks like the following:
Select col1, count(1)
>From (Select col2, count(1) from tab2 group by col2)
Inner join tab1 on (col1=col2)
Group by col1
This creates a very large shuffle, 10 times the data size, as if the subquery
was executed for each row.
Anything ca
I would recommend against writing unit tests for Spark programs, and
instead focus on integration tests of jobs or pipelines of several
jobs. You can still use a unit test framework to execute them. Perhaps
this is what you meant.
You can use any of the popular unit test frameworks to drive your
t
Any objections? Please articulate your use case. SparkEnv is a weird one
because it was documented as "private" but not marked as so in class
visibility.
* NOTE: This is not intended for external use. This is exposed for Shark
and may be made private
* in a future release.
I do see Hive
I am using python spark 1.6 and the --packages
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10
I need to convert a time stamp string into a unix epoch time stamp. The
function unix_timestamp() function assume current time zone. How ever my
string data is UTC and encodes the time zone as zero. I
Is that happening only at startup, or during processing? If that's
happening during normal operation of the stream, you don't have enough
resources to process the stream in time.
There's not a clean way to deal with that situation, because it's a
violation of preconditions. If you want to modify
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems
that MLlib does not support deep learning, I want to know is there any way
to implement deep learning on spark ?
*Do I must use 3-party package like caffe or tensorflow ?*
or
*Does deep learning module list in the MLlib de
Good morning. I have a dataframe and would like to group by on two fields, and
perform a sum aggregation on more than 500 fields, though I would like to keep
the same name for the 500 hundred fields (instead of sum(Field)). I do have the
field names in an array. Could anybody help with this ques
On Thu, Mar 17, 2016 at 3:51 AM, charles li wrote:
> Hi, Alexander,
>
> that's awesome, and when will that feature be released ? Since I want to
> know the opportunity cost between waiting for that release and use caffe or
> tensorFlow ?
>
I don't expect MLlib will be able to compete with major
Regarding bloom filters,
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417
Sent with Good (www.good.com)
From: Joseph
Sent: Wednesday, March 16, 2016 9:46:25 AM
To: user
Cc: user; user
Subject: Re: Re: The build-in indexes in ORC file does
Hi, Alexander,
that's awesome, and when will that feature be released ? Since I want to
know the opportunity cost between waiting for that release and use caffe or
tensorFlow ?
great thanks again
On Thu, Mar 17, 2016 at 10:32 AM, Ulanov, Alexander <
alexander.ula...@hpe.com> wrote:
> Hi Charles
Thank you for info Steve.
I always believed (IMO) that there is an optimal position where one can
plot the projected NN memory (assuming 1GB--> 40TB of data) to the number
of nodes. For example heuristically how many nodes would be sufficient for
1PB of storage with nodes each having 512GB of mem
Hi,
I am adding a new column and renaming it at same time but the renaming
doesnt have any effect.
dffiltered =
> dffiltered.unionAll(dfresult.withColumn("Col1",lit("value1").withColumn("Col2",lit("value2")).cast("int")).withColumn("Col3",lit("values3")).withColumnRenamed("Col1","Col1Rename").drop
Anyways to cache the subquery or force a broadcast join without persisting it?
y
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: March-17-16 8:59 PM
To: Younes Naguib
Cc: user@spark.apache.org
Subject: Re: Subquery performance
Try running EXPLAIN on both version of the query.
Likel
Hi,
I'm testing Collaborative Filtering with Milib.
Making a model by ALS.trainImplicit (or train) seems scalable as far as I
have tested,
but I'm wondering how I can get all the recommendation results efficiently.
The predictAll method can get all the results,
but it needs the whole user-product
This is the line where NPE came from:
if (conf.get(SCAN) != null) {
So Configuration instance was null.
On Fri, Mar 18, 2016 at 9:58 AM, Lubomir Nerad
wrote:
> The HBase version is 1.0.1.1.
>
> Thanks,
> Lubo
>
>
> On 18.3.2016 17:29, Ted Yu wrote:
>
> I looked at the places in SparkContex
On 11 Mar 2016, at 16:25, Mich Talebzadeh
mailto:mich.talebza...@gmail.com>> wrote:
Hi Steve,
My argument has always been that if one is going to use Solid State Disks
(SSD), it makes sense to have it for NN disks start-up from fsimage etc.
Obviously NN lives in memory. Would you also reromme
CALL FOR PAPERS
11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC
'16)
held in conjunction with the International Supercomputing Conference - High
Performance,
June 19-23, 2016, Frankfurt, Germany.
===
That's a networking error when the driver is attempting to contact
leaders to get the latest available offsets.
If it's a transient error, you can look at increasing the value of
spark.streaming.kafka.maxRetries, see
http://spark.apache.org/docs/latest/configuration.html
If it's not a transient
I am using Spark streaming and reading data from Kafka using
KafkaUtils.createDirectStream. I have the "auto.offset.reset" set to
smallest.
But in some Kafka partitions, I get kafka.common.OffsetOutOfRangeException
and my spark job crashes.
I want to understand if there is a graceful way to handl
My suspect is your input file partitions are small. Hence small number of
tasks are started. Can you provide some more details like how you load the
files and how the result size is around 500GBs ?
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedin.com/in/ri
Err, whoops, looks like this is a user app and not building Spark itself,
so you'll have to change your deps to use the 2.11 versions of Spark.
e.g. spark-streaming_2.10 -> spark-streaming_2.11.
On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen wrote:
> See the instructions in the Spark documentation:
On Wed, Mar 16, 2016 at 3:29 PM, Mridul Muralidharan
wrote:
> b) Shuffle manager (to get shuffle reader)
>
What's the use case for shuffle manager/reader? This seems like using super
internal APIs in applications.
I've been trying to get log4j2 and logback to get to play nice with Spark
1.6.0 so I can properly offload my logs to a remote server.
I've attempted the following things:
1. Setting logback/log4j2 on the class path for both the driver and worker
nodes
2. Passing -Dlog4j.configurationFile= and -Dl
Hi,
I have a spark application for batch processing in standalone cluster. The
job is to query the database and then do some transformation, aggregation,
and several actions such as indexing the result into the elasticsearch.
If I dont call the sc.stop(), the spark application wont stop and take
Hi,
regarding 1, packages are resolved locally. That means that when you
specify a package, spark-submit will resolve the dependencies and
download any jars on the local machine, before shipping* them to the
cluster. So, without a priori knowledge of dataproc clusters, it
should be no different to
How much data are you querying? What is the query? How selective it is supposed
to be? What is the block size?
> On 16 Mar 2016, at 11:23, Joseph wrote:
>
> Hi all,
>
> I have known that ORC provides three level of indexes within each file, file
> level, stripe level, and row level.
> The fi
Hi all,
The Apache Beam Spark runner is now available at:
https://github.com/apache/incubator-beam/tree/master/runners/spark Check it out!
The Apache Beam (http://beam.incubator.apache.org/) project is a unified model
for building data pipelines using Google’s Dataflow programming model, and now
I just wrote a blog post on Unit testing Apache Spark with py.test
https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b
If you prefer using the py.test framework, then it might be useful.
-vikas
On Wed, Mar 2, 2016 at 10:59 AM, radoburansky
wrote:
> I am sure you ha
Cody et. al,
I am seeing a similar error. I've increased the number of retries. Once
I've got a job up and running I'm seeing it retry correctly. However, I am
having trouble getting the job started - number of retries does not seem to
help with startup behavior.
Thoughts?
Regards,
Bryan Jeff
You need to use wholetextfiles to read the whole file at once. Otherwise,
It can be split.
DB Tsai - Sent From My Phone
On Mar 17, 2016 12:45 AM, "Blaž Šnuderl" wrote:
> Hi.
>
> We have json data stored in S3 (json record per line). When reading the
> data from s3 using the following code we sta
bq. .drop("Col9")
Could it be due to the above ?
On Wed, Mar 16, 2016 at 7:29 PM, Divya Gehlot
wrote:
> Hi,
> I am dynamically doing union all and adding new column too
>
> val dfresult =
>> dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9")
>> val schemaL = dfresu
Experts.
Please your valued advice.
I have spark 1.5.2 set up as standalone for now and I have started the master
as below
start-master.sh
I also have modified config/slave file to have
# A Spark Worker will be started on each of the machines listed below.
localhostworkerhost
On the localhost I
It is defined in:
core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman wrote:
> I am using following code snippet in scala:
>
>
> *val dict: RDD[String] = sc.textFile("path/to/csv/file")*
> *val dict_broadcast=sc.broadcast(dict.collect
On 19 Mar 2016, at 02:25, Koert Kuipers
mailto:ko...@tresata.com>> wrote:
spark on yarn is nice because i can bring my own spark. i am worried that the
shuffle service forces me to use some "sanctioned" spark version that is
officially "installed" on the cluster.
so... can i safely install th
The HBase version is 1.0.1.1.
Thanks,
Lubo
On 18.3.2016 17:29, Ted Yu wrote:
I looked at the places in SparkContext.scala where NewHadoopRDD is
constrcuted.
It seems the Configuration object shouldn't be null.
Which hbase release are you using (so that I can see which line the
NPE came from)
Hi All,
On running Concurrent Spark Jobs (huge number of tasks) with same Spark
Context, there is high scheduler delay. We have FAIR schedulingPolicy set
and also we tried with different pool for each jobs but still no
improvement. What are the tuning ways to improve Scheduler Delay.
Thanks,
Doesn't FileInputFormat require type parameters? Like so:
class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord]
extends FileInputFormat[LW, RD]
I haven't verified this but it could be related to the compile error
you're getting.
On Thu, Mar 17, 2016 at 9:53 AM, Benyi Wang wrote:
>
Hello, guys!
I’ve been developing a kind of framework on top of spark, and my idea is to
bundle the framework jars and some extra configs with the spark and pass it
to other developers for their needs. So that devs can use this bundle and
run usual spark stuff but with extra flavor that framewor
Hi All,
I have to join 2 files both not very big say few MBs only but the result can be
huge say generating 500GBs to TBs of data. Now I have tried using spark Join()
function but Im noticing that join is executing on only 1 or 2 nodes at the
max. Since I have a cluster size of 5 nodes , I tri
Hello,
Looking at
https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html
and can't wrap my head around how to convert string data types names to
actual
pyspark.sql.types data types?
Does pyspark.sql.types has an interface to return StringType() for "string",
IntegerType()
Hi,
Scala version:2.11.7(had to upgrade the scala verison to enable case
clasess to accept more than 22 parameters.)
Spark version:1.6.1.
PFB pom.xml
Getting below error when trying to setup spark on intellij IDE,
16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
Exception
We are considering deploying a notebook server for use by two kinds of users
1. interactive dashboard.
> 1. I.e. Forms allow users to select data sets and visualizations
> 2. Review real time graphs of data captured by our spark streams
2. General notebooks for Data Scientists
My concern is inter
If you prefer the py.test framework, I just wrote a blog post with some
examples:
Unit testing Apache Spark with py.test
https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b
On Fri, Feb 5, 2016 at 11:43 AM, Steve Annessa
wrote:
> Thanks for all of the responses.
>
>
Hello,
I am trying to figure out how to unzip zip files in Spark Streaming. Within
each zip file will be a series of xml files which will also need parsing.
Are there libraries that work with DStream that parse a zip or parse an xml
file?. I have seen the databricks xml library but I do not think
On 17 Mar 2016, at 16:01, Allen George
mailto:allen.geo...@gmail.com>> wrote:
Hi guys,
I'm having a problem where respawning a failed executor during a job that
reads/writes parquet on S3 causes subsequent tasks to fail because of missing
AWS keys.
Setup:
I'm using Spark 1.5.2 with Hadoop 2
See the instructions in the Spark documentation:
https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna
wrote:
>
>
> Hi,
>
> Scala version:2.11.7(had to upgrade the scala verison to enable case
> clasess to accept more tha
Hi,
I am testing some parallel processing of Spark applications.
I have a two node spark cluster and currently running two worker processes
on each in Yarn-client mode. The master has 12 cores and 24GB of RAM. The
worker node has 4GB of RAM and 2 cores (well an old 32 bit host). The OS on
both is
Btw, here is a great article about accumulators and all their related
traps!
http://imranrashid.com/posts/Spark-Accumulators/ (I'm not the author)
On 16 March 2016 at 18:24, swetha kasireddy
wrote:
> OK. I did take a look at them. So once I have an accumulater for a
> HashSet, how can I check if
Sorry for latter reply. Yep, RDRawDataRecord is my object, It defined in
other java project(jar.), I get it with maven. My MapReduce program also
use it and works.
On Fri, Mar 18, 2016 at 12:48 AM, Mich Talebzadeh wrote:
> Hi Tony,
>
> Is
>
> com.kiisoo.aegis.bd.common.hdfs.RDRawDataRecord
>
> O
Hi,
I have a SparkStream (with Kafka) job, after running several days, it
failed with following errors:
ERROR DirectKafkaInputDStream:
ArrayBuffer(java.nio.channels.ClosedChannelException)
Any idea what would be wrong? will it be SparkStreaming buffer overflow
issue?
Regards
*** from the
On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson
wrote:
> I am using pyspark 1.6.0 and
> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
> data
>
> The data is originally captured by a spark streaming app and written to
> Cassandra. The value of the timestamp comes from
>
>
Try to toubleshoot why it is happening, maybe some messages are too big to
be read from the topic? I remember getting that error and that was the cause
On Fri, Mar 18, 2016 at 11:16 AM Ramkumar Venkataraman <
ram.the.m...@gmail.com> wrote:
> I am using Spark streaming and reading data from Kafka
Hi,
I am facing an issue while deduplicating the keys in RDD (Code Snippet
below).
I have few Sequence Files, some of them have duplicate entries. I am trying
to drop duplicate values for each key.
Here are two methods with code snippets:
val path = "path/to/sequence/file"
val rdd1 = ctx.sequenc
Solved the issue by setting up the same heartbeat interval and pauses in
both actor systems
akka {
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = DEBUG
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
log-dead-letters = on
log-dead-letters-during-shutdown = on
daemonic =
I looked at the places in SparkContext.scala where NewHadoopRDD is
constrcuted.
It seems the Configuration object shouldn't be null.
Which hbase release are you using (so that I can see which line the NPE
came from) ?
Thanks
On Fri, Mar 18, 2016 at 8:05 AM, Lubomir Nerad
wrote:
> Hi,
>
> I tri
Thanks Jakob, Felix. I am aware you can do it with --packages but i was
wondering if there is a way to do something like "!pip install "
like i do for other packages from jupyter notebook for python. But I guess
I cannot add a package once i launch the pyspark context right ?
On Thu, Mar 17, 2016
For some reason writing data from Spark shell to csv using the `csv
package` takes almost an hour to dump to disk. Am I going crazy or did I do
this wrong? I tried writing to parquet first and its fast as normal.
On my Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB the machine CPU's goes
crazy and i
Have u tried df.saveAsParquetFIle? I think that method is on df Api
Hth
Marco
On 19 Mar 2016 7:18 pm, "Vincent Ohprecio" wrote:
>
> For some reason writing data from Spark shell to csv using the `csv
> package` takes almost an hour to dump to disk. Am I going crazy or did I do
> this wrong? I tri
Could you try to cast the timestamp as long?
Internally, timestamp are stored as microseconds in UTC, you will got
seconds in UTC if you cast it to long.
On Thu, Mar 17, 2016 at 1:28 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:
> I am using python spark 1.6 and the --packages
> data
Try running EXPLAIN on both version of the query.
Likely when you cache the subquery we know that its going to be small so
use a broadcast join instead of a shuffling the data.
On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib <
younes.nag...@tritondigital.com> wrote:
> Hi all,
>
>
>
> I’m running
Sounds like you're using one of the KafkaUtils.createDirectStream
overloads that needs to do some broker communication in order to even
construct the stream, because you aren't providing topicpartitions?
Just wrap your construction attempt in a try / catch and retry in
whatever way makes sense for
CALL FOR PAPERS
11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC
'16)
held in conjunction with the International Supercomputing Conference - High
Performance,
June 19-23, 2016, Frankfurt, Germany.
===
> On 17 Mar 2016, at 12:28, Mich Talebzadeh wrote:
>
> Thanks Steve,
>
> For NN it all depends how fast you want a start-up. 1GB of NameNode memory
> accommodates around 42T so if you are talking about 100GB of NN memory then
> SSD may make sense to speed up the start-up. Raid 10 is the best
Thanks Steve,
For NN it all depends how fast you want a start-up. 1GB of NameNode
memory accommodates around 42T so if you are talking about 100GB of NN
memory then SSD may make sense to speed up the start-up. Raid 10 is the
best one that one can get assuming all internal disks.
In general it is
No, sadly, it's not an option.
End users are not my team members, it's for customers, so I have to bundle
the framework and ship it.
There is more to my project than just libs, so end users will have to use
bundle anyway.
On Wed, Mar 16, 2016 at 6:41 PM, Silvio Fiorito <
silvio.fior...@granturing.
Can you call sc.stop() after indexing into elastic search ?
> On Mar 16, 2016, at 9:17 PM, Imre Nagi wrote:
>
> Hi,
>
> I have a spark application for batch processing in standalone cluster. The
> job is to query the database and then do some transformation, aggregation,
> and several actions
It complains about the file path "./examples/src/main/resources/people.json"
You can try to use absolute path instead of relative path, and make sure the
absolute path is correct.
If that still does not work, you can prefix the path with "file://" in case the
default file schema for Hadoop is HD
Unfortunately I can't share any snippet quickly as the code is generated,
but for now at least can share the plan. (See it here -
http://pastebin.dqd.cz/RAhm/)
After I've increased spark.sql.autoBroadcastJoinThreshold to 30 from
10 it went through without any problems. With 10 it was a
Hi Ted, thanks for answering.
The map is just that, whenever I try inside the map it throws this
ClassNotFoundException, even if I do map(f => f) it throws the exception.
What is bothering me is that when I do a take or a first it returns the
result, which make me conclude that the previous code is
I like using the new DataFrame APIs on Spark ML, compared to using RDDs in
the older SparkMLlib. But it seems some of the older APIs are missing. In
particular, '*.mllib.clustering.DistributedLDAModel' had two APIs that I
need now:
topDocumentsPerTopic
topTopicsPerDocument
How can I get at the
I am using pyspark 1.6.0 and
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
data
The data is originally captured by a spark streaming app and written to
Cassandra. The value of the timestamp comes from
Rdd.foreachRDD(new VoidFunction2, Time>()
});
I am
Hi Vince,
We had a similar case a while back. I tried two solutions in both Spark on
Hive metastore and Hive on Spark engine.
Hive version 2
Spark as Hive engine 1.3.1
Basically
--1 Move .CSV data into HDFS:
--2 Create an external table (all columns as string)
--3 Create the ORC table (majority
You are running pyspark in Spark client deploy mode. I have ran into the same
error as well and I'm not sure if this is graphframes specific - the python
process can't find the graphframes Python code when it is loaded as a Spark
package.
To workaround this, I extract the graphframes Python dire
Running following:
#fix schema for gaid which should not be Double
> from pyspark.sql.types import *
> customSchema = StructType()
> for (col,typ) in tsp_orig.dtypes:
> if col=='Agility_GAID':
> typ='string'
> customSchema.add(col,typ,True)
Getting
ValueError: Could not parse
Thanks - I'll give that a try
cheers
On 20 March 2016 at 09:42, Felix Cheung wrote:
> You are running pyspark in Spark client deploy mode. I have ran into the
> same error as well and I'm not sure if this is graphframes specific - the
> python process can't find the graphframes Python code when
Slight update I suppose?
For some reason, sometimes it will connect and continue and the job will be
completed. But most of the time I still run into this error and the job is
killed and the application doesn't finish.
Still have no idea why this is happening. I could really use some help here.
Spark 1.5 is the latest that I have access to and where this problem
happens.
I don't see it's fixed in master but I might be wrong. diff atatched.
https://raw.githubusercontent.com/apache/spark/branch-1.5/python/pyspark/sql/types.py
https://raw.githubusercontent.com/apache/spark/d57daf1f7732a7ac
Can you give a bit more detail ?
Release of Spark
symptom of renamed column being not recognized
Please take a look at "withColumnRenamed" test in:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
On Thu, Mar 17, 2016 at 2:02 AM, Divya Gehlot
wrote:
> Hi,
> I am adding a new
I have data that I pull in using a sql context and then I convert to an rdd.
The problem is that the type in the rdd is [Any, Iterable[Any]]
And I need to have the type RDD[Array[String]] -- convert the Iterable to an
Array.
Here’s more detail:
val zdata = sqlContext.read.parquet("s3://.. pa
For some, like graphframes that are Spark packages, you could also use
--packages in the command line of spark-submit or pyspark.
Seehttp://spark.apache.org/docs/latest/submitting-applications.html
_
From: Jakob Odersky
Sent: Thursday, March 17, 2016 6:40 PM
Subj
For your last point, spark-submit has:
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
Meaning the script would determine the proper SPARK_HOME variable.
FYI
On Wed, Mar 16, 2016 at 4:22 AM, Леонид Поляков wrote:
> Hello, guys!
>
>
>
> I’ve been develop
I would say change
class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord]
extends FileInputFormat
to
class RawDataInputFormat[LongWritable, RDRawDataRecord] extends FileInputFormat
On Thu, Mar 17, 2016 at 9:48 AM, Mich Talebzadeh
wrote:
> Hi Tony,
>
> Is
>
> com.kiisoo.aegis.b
I did some tests on Hive running on MR to get rid of Spark effects.
In an ORC table that has been partitioned, partition elimination with
predicate push down works and the query is narrowed to the partition
itself. I can see that from the number of rows within that partition.
For example below sa
It makes no sense for worker, the issue is with executor classpath, not the
driver classpath.
Please, answer actual question that is not in "P.S." - that one it's just a
note about driver
Thanks, Leonid
On Wed, Mar 16, 2016 at 6:21 PM, Ted Yu wrote:
> For your last point, spark-submit has:
>
>
1 - 100 of 108 matches
Mail list logo