Hi All,
I want read Mysql from Spark. Please let me know how many threads will be
used to read the RDBMS after set numPartitions =10 in Spark JDBC. What is
the best practice to read large dataset from RDBMS to Spark?
Thanks,
Jingyu
--
This message and its attachments may contain legally privil
Graphx did not support Python yet.
http://spark.apache.org/docs/latest/graphx-programming-guide.html
The workaround solution is use graphframes (3rd party API),
https://issues.apache.org/jira/browse/SPARK-3789
but some features in Python are not as same as Scala,
https://github.com/graphframes/gr
Hi All,
I am using Eclipse with Maven for developing Spark applications. I got a
error for Reading from S3 in Scala but it works fine in Java when I run
them in the same project in Eclipse. The Scala/Java code and the error in
following
Scala
val uri = URI.create("s3a://" + key + ":" + seckey +
>>> don't put your secret in the URI, it'll only creep out in the logs.
>>>
>>> Use the specific properties coverd in
>>> http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html,
>>> which you can set in your spark context by p
I try to filter vertexes that did not have any connection links with
others. How to filter those isolated vertexes in Graphx?
Thanks,
Jingyu
--
This message and its attachments may contain legally privileged or
confidential information. It is intended solely for the named addressee. If
you ar
Hi, What is he best way to unpersist the RDD in graphx to release memory?
RDD.unpersist
or
RDD.unpersistVertices and RDD..edges.unpersist
I study the source code of Pregel.scala, Both of above were used between
line 148 and line 150. Can anyone please tell me what the different? In
addition, what
I running a Pregel function with 37 nodes in EMR hadoop. After a hour
logs show following. Can anyone please tell what the problem is and
how do I solve it? Thanks
16/02/05 14:02:46 WARN BlockManagerMaster: Failed to remove broadcast
2 with removeFromMaster = true - Cannot receive any reply in 120
May set "maximizeResourceAllocation", then EMR will do the best config for
you.
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
Jingyu
On 18 February 2016 at 12:02, wrote:
> Hi All,
>
> I have been facing memory issues in spark. im using spark-sql on AWS
It is not a problem to use JavaRDD.cache() for 200M data (all Objects read
form Json Format). But when I try to use DataFrame.cache(), It shown
exception in below.
My machine can cache 1 G data in Avro format without any problem.
15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531
ith just a single row?
> Do you rows have any columns with null values?
> Can you post a code snippet here on how you load/generate the dataframe?
> Does dataframe.rdd.cache work?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Thu, Oct 29, 2015 at 4:
Try s3://aws_key:aws_secret@bucketName/folderName with your access key and
secret to save the data.
On 30 October 2015 at 10:55, William Li wrote:
> Hi – I have a simple app running fine with Spark, it reads data from S3
> and performs calculation.
>
> When reading data from S3, I use hadoopConf
There is not a problem in Spark SQL 1.5.1 but the error of "key not found:
sportingpulse.com" shown up when I use 1.5.0.
I have to use the version of 1.5.0 because that the one AWS EMR support.
Can anyone tell me why Spark uses "sportingpulse.com" and how to fix it?
Thanks.
Caused by: java.util.
hing due to the columnar compression. I've seen similar
> intermittent issues when caching DataFrames. "sportingpulse.com" is a
> value in one of the columns of the DF.
> --
> From: Ted Yu
> Sent: 10/30/2015 6:33 PM
> To: Zhang, Jingyu
A small csv file in S3. I use s3a://key:seckey@bucketname/a.csv
It works for SparkContext
pixelsStr: SparkContext = ctx.textFile(s3pathOrg);
It works for Java Spark-csv as well
Java code : DataFrame careerOneDF = sqlContext.read().format(
"com.databricks.spark.csv")
.option("inferSchema",
I want to pass two parameters into new java class from rdd.mapPartitions(),
the code like following.
---Source Code
Main method:
/*the parameters that I want to pass into the PixelGenerator.class for
selecting any items between the startTime and the endTime.
*/
int startTime, endTime;
Ja
Thanks, that worked for local environment but not in the Spark Cluster.
On 16 November 2015 at 16:05, Fengdong Yu wrote:
> Can you try : new PixelGenerator(startTime, endTime) ?
>
>
>
> On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu
> wrote:
>
> I want to pass two parame
e:
> If you got “cannot Serialized” Exception, then you need to
> PixelGenerator as a Static class.
>
>
>
>
> On Nov 16, 2015, at 1:10 PM, Zhang, Jingyu
> wrote:
>
> Thanks, that worked for local environment but not in the Spark Cluster.
>
>
> On 16 Novembe
.
On 16 November 2015 at 16:34, Fengdong Yu wrote:
> Just make PixelGenerator as a nested static class?
>
>
>
>
>
>
> On Nov 16, 2015, at 1:22 PM, Zhang, Jingyu
> wrote:
>
> Fengdong
>
>
>
--
This message and its attachments may contain legally priv
I am using spark-csv to save files in s3, it shown Size exceeds.
Please let me know how to fix it. Thanks.
df.write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("s3://newcars.csv");
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at
rk.sql.execution.Sort$$anonfun$doExecute$5.apply(basicOperators.scala:190)
at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 21 more
On 16 November 2015 at 21:16, Zhang, Jingyu
wrote:
> I am using spark-csv to save files in s3, it shown Size e
Can anyone please let me know How to print all nodes in connected
components one by one?
graph.connectedComponents()
e.g.
connected Component ID Nodes ID
1 1,2,3
6 6,7,8,9
Thanks
--
This message
Spark spills data to disk when there is more data shuffled onto a single
executor machine than can fit in memory. However, it flushes out the data
to disk one key at a time - so if a single key has more key-value pairs
than can fit in memory, an out of memory exception occurs.
Cheers,
Jingyu
On
Hi All,
I want to extract the "value" RDD from PairRDD in Java
Please let me know how can I get it easily.
Thanks
Jingyu
--
This message and its attachments may contain legally privileged or
confidential information. It is intended solely for the named addressee. If
you are not the address
I have A and B DataFrames
A has columns a11,a12, a21,a22
B has columns b11,b12, b21,b22
I persistent them in cache
1. A.Cache(),
2. B.Cache()
Then, I persistent the subset in cache later
3. DataFrame A1 (a11,a12).cache()
4. DataFrame B1 (b11,b12).cache()
5. DataFrame AB1 (a11,a12,b11,b12).cah
Is you sql works if do not runs a regex on strings and concatenates them, I
mean just Select the stuff without String operations?
On 24 September 2015 at 10:11, java8964 wrote:
> Try to increase partitions count, that will make each partition has less
> data.
>
> Yong
>
> ---
;> you would need extra memory that would be equivalent to half the memory of
>> A/B.
>>
>> You can check the storage that a dataFrame is consuming in the Spark UI's
>> Storage tab. http://host:4040/storage/
>>
>>
>>
>> On Thu, Sep 24, 2015 at 5:
I got following exception when I run
JavPairRDD.values().saveAsTextFile("s3n://bucket);
Can anyone help me out? thanks
15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.NoClassDefFoundError:
org/jets3t/service/ServiceException
at org.apa
27 matches
Mail list logo