Hi,
I have a twitter spark stream initialized in the following way:
val ssc:StreamingContext =
> SparkLauncher.getSparkScalaStreamingContext()
> val config = getTwitterConfigurationBuilder.build()
> val auth: Option[twitter4j.auth.Authorization] =
> Some(
hi
I have build a spark application with IDEA. when run SparkPI , IDEA throw
exception as that :
Exception in thread "main" java.lang.NoClassDefFoundError:
javax/servlet/FilterRegistration at
org.spark-project.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:136)
at
org.sp
I wanted to ask a general question about Hadoop/Yarn and Apache Spark
integration. I know that
Hadoop on a physical cluster has rack awareness. i.e. It attempts to minimise
network traffic
by saving replicated blocks within a rack. i.e.
I wondered whether, when Spark is configured to use Yarn
This is likely due to data skew. If you are using key-value pairs, one key
has a lot more records, than the other keys. Do you have any groupBy
operations?
David
On Tue, Jul 14, 2015 at 9:43 AM, shahid wrote:
> hi
>
> I have a 10 node cluster i loaded the data onto hdfs, so the no. of
> par
it might be a network issue. The error states failed to bind the server IP
address
Chester
Sent from my iPhone
> On Jul 18, 2015, at 11:46 AM, Amjad ALSHABANI wrote:
>
> Does anybody have any idea about the error I m having.. I am really
> clueless... And appreciate any idea :)
>
> Thanks i
Does anybody have any idea about the error I m having.. I am really
clueless... And appreciate any idea :)
Thanks in advance
Amjad
On Jul 17, 2015 5:37 PM, "Amjad ALSHABANI" wrote:
> Hello,
>
> First of all I m a newbie in Spark ,
>
> I m trying to start the spark-shell with yarn cluster by run
Try this (replace ... with the appropriate values for your environment):
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
val sc = new SparkContext(...)
val documents = sc.wholeTextFile
Hi all,
I'm aware of the support for schema evolution via DataFrame API. Just
wondering what would be the best way to go about dealing with schema
evolution with Hive metastore tables. So, say I create a table via SparkSQL
CLI, how would I deal with Parquet schema evolution?
Thanks,
J
I am facing the same issue, i tried this but getting compilation error for
the "$" in the explode function
So, I had to modify to the below to make it work.
df.select(explode(new Column("entities.user_mentions")).as("mention"))
On Wed, Jun 24, 2015 at 2:48 PM, Michael Armbrust
wrote:
> Star
Here is a related thread:
http://search-hadoop.com/m/q3RTtPmjSJ1Dod92
> On Jul 15, 2015, at 7:41 AM, k0ala wrote:
>
> Hi,
>
> I have been working a bit with RDD, and am now taking a look at DataFrames.
> The schema definition using case classes looks very attractive;
>
> https://spark.apache.
Hi
I am trying to use DF and save it to Elasticsearch using newHadoopApi
(because I am using python). Can anyone guide me to help if this is even
possible?
--
Best Regards,
Ayan Guha
I have a large dataset stored into a BigQuery table and I would like to load
it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format
https://cloud.google.com/hadoop/writing-with-bigquery-connector
and pyspark should be able to use this int
Hi.
What I would do in your case would be something like this..
Lets call the two datasets, qs and ds, where qs is an array of vectors and
ds is an RDD[(dsID: Long, Vector)].
Do the following:
1) create a k-NN class that can keep track of the k-Nearest Neighbors so
far. It must have a qsID a
Hi.
You can use a broadcast variable to make data available to all the nodes in
your cluster that can live longer then just the current distributed task.
For example if you need a to access a large structure in multiple sub-tasks,
instead of sending that structure again and again with each sub-t
Even if I remove numpy calls. (no matrices loaded), Same exception is
coming.
Can anyone tell what createDataFrame does internally? Are there any
alternatives for it?
On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das
wrote:
> I suspect its the numpy filling up Memory.
>
> Thanks
> Best Regards
>
> On F
hi
on windows, in local mode, using pyspark i got an error about "excessively
deep recursion"
i'm using some module for lemmatizing/stemming, which uses some dll and
some binary files (module is a python wrapper around c code).
spark version 1.4.0
any idea what is going on?
--
Hi.
Assuming your have the data in an RDD you can save your RDD (regardless of
structure) with "nameRDD".saveAsObjectFile("path") where "path" can be
"hdfs:///myfolderonHDFS" or the local file system.
Alternatively you can also use .saveAsTextFile()
Regards,
Gylfi.
--
View this mess
You could even try changing the block size of the input data on HDFS (can be
done on a per file basis) and that would get all workers going right from
the get-go in Spark.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824
You may want to look into using the pipe command ..
http://blog.madhukaraphatak.com/pipe-in-spark/
http://spark.apache.org/docs/0.6.0/api/core/spark/rdd/PipedRDD.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Create-RDD-from-output-of-unix-command-tp
Hi.
"All transformations in Spark are lazy, in that they do not compute their
results right away. Instead, they just remember the transformations applied
to some base dataset (e.g. a file). The transformations are only computed
when an action requires a result to be returned to the driver program
Hi.
If I just look at the two pics, I see that there is only one sub-task that
takes all the time..
This is the flatmapToPair at Coef... line 52.
I also see that there are only two partitions that make up the input and
thus probably only two workers active.
Try repartitioning the data into mo
I think you have a mistake on call jdbc(), it should be:
jdbc(self, url, table, mode, properties)
You had use properties as the third parameter.
On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T
wrote:
> Hello,
>
> I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s
>
Hi.
To be honest I don't really understand your problem declaration :( but lets
just talk about how .flatmap works.
Unlike .map(), that only allows a one-to-one transformation, .flatmap()
allows 0, 1 or many outputs per item processed but the output must take the
form of a sequence of the same
23 matches
Mail list logo