join but I
want to use broadcast join).
Or is there any ticket or roadmap for this feature?
Regards,
Shuai
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Saturday, December 05, 2015 4:11 PM
To: Shuai Zheng
Cc: Jitesh chandra Mishra; user
Subject: Re: Broadcasting a
Hi all,
Sorry to re-open this thread.
I have a similar issue, one big parquet file left outer join quite a few
smaller parquet files. But the running is extremely slow and even OOM sometimes
(with 300M , I have two questions here:
1, If I use outer join, will Spark SQL auto use broadc
[mailto:jonathaka...@gmail.com]
Sent: Thursday, November 19, 2015 6:54 PM
To: Shuai Zheng
Cc: user
Subject: Re: Spark Tasks on second node never return in Yarn when I have more
than 1 task node
I don't know if this actually has anything to do with why your job is hanging,
but since you are using EM
Hi All,
I face a very weird case. I have already simplify the scenario to the most
so everyone can replay the scenario.
My env:
AWS EMR 4.1.0, Spark1.5
My code can run without any problem when I run it in a local mode, and it
has no problem when it run on a EMR cluster with one mas
Spark.
So I want to know any new setup I should set in Spark 1.5 to make it work?
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, November 04, 2015 3:22 PM
To: user@spark.apache.org
Subject: [Spark 1.5]: Exception in thread "broadcast-hash-j
Hi All,
I have a program which actually run a bit complex business (join) in spark.
And I have below exception:
I running on Spark 1.5, and with parameter:
spark-submit --deploy-mode client --executor-cores=24 --driver-memory=2G
--executor-memory=45G -class .
Some other setup:
, 2015 11:07 AM
To: Shuai Zheng
Cc: user
Subject: Re: How to Take the whole file as a partition
You situation is special. It seems to me Spark may not fit well in your case.
You want to process the individual files (500M~2G) as a whole, you want good
performance.
You may want to write our
Hi All,
I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark
can read them as partition on the file level. Which means want the FileSplit
turn off.
I know there are some solutions, but not very good in my case:
1, I can't use WholeTextFiles method, because my file is t
Hi All,
I try to access S3 file from S3 in Hadoop file format:
Below is my code:
Configuration hadoopConf = ctx.hadoopConfiguration();
hadoopConf.set("fs.s3n.awsAccessKeyId",
this.getAwsAccessKeyId());
hadoopConf.set("fs.s
Hi,
I have exactly same issue on Spark 1.4.1 (on EMR latest default AMI 4.0),
run as Yarn client. And after wrapped with another java hashMap, the
exception disappear.
But may I know what is right solution? Any JIRA ticket is created for this?
I want to monitor it, even it could be bypass b
to replicate my issue locally (the code doesn’t
need to run on EC2, I run it directly from my local windows pc).
Regards,
Shuai
From: Aaron Davidson [mailto:ilike...@gmail.com]
Sent: Wednesday, June 10, 2015 12:28 PM
To: Shuai Zheng
Subject: Re: [SPARK-6330] 1.4.0/1.5.0 Bug to access
Hi All,
I have some code to access s3 from Spark. The code is as simple as:
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
Configuration hadoopConf = ctx.hadoopConfiguration();
// aws.secretKey=Zqhjim3GB69hMBvfjh+7NX84p8sMF39BHfXwO3Hs
Hi All,
I want to ask how to use UDF when I use join function on DataFrame. It looks
like always give me the "cannot solve the column name" error.
Anyone can give me an example on how to run this in java?
My code is like:
edmData.join(yb_lookup,
edmData.col("year(YEARBUILT)").equalTo(
Hi All,
Basically I try to define a simple UDF and use it in the query, but it gives
me "Task not serializable"
public void test() {
RiskGroupModelDefinition model =
registeredRiskGroupMap.get(this.modelId);
RiskGroupModelDefinition edm = this.createEdm(
PM
To: Dean Wampler
Cc: Shuai Zheng; user@spark.apache.org
Subject: Re: Slower performance when bigger memory?
FWIW, I ran into a similar issue on r3.8xlarge nodes and opted for more/smaller
executors. Another observation was that one large executor results in less
overall read throughput from
Got it. Thanks! J
From: Yin Huai [mailto:yh...@databricks.com]
Sent: Thursday, April 23, 2015 2:35 PM
To: Shuai Zheng
Cc: user
Subject: Re: Bug? Can't reference to the column by name after join two
DataFrame on a same name key
Hi Shuai,
You can use "as" to creat
Hi All,
I use 1.3.1
When I have two DF and join them on a same name key, after that, I can't get
the common key by name.
Basically:
select * from t1 inner join t2 on t1.col1 = t2.col1
And I am using purely DataFrame, spark SqlContext not HiveContext
DataFrame df3 = df1.join(df2
Hi All,
I am running some benchmark on r3*8xlarge instance. I have a cluster with
one master (no executor on it) and one slave (r3*8xlarge).
My job has 1000 tasks in stage 0.
R3*8xlarge has 244G memory and 32 cores.
If I create 4 executors, each has 8 core+50G memory, each task will
Below is my code to access s3n without problem (only for 1.3.1. there is a bug
in 1.3.0).
Configuration hadoopConf = ctx.hadoopConfiguration();
hadoopConf.set("fs.s3n.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3n
I have similar issue (I failed on the spark core project but with same
exception as you). Then I follow the below steps (I am working on windows):
Delete the maven repository, and re-download all dependency. The issue sounds
like a corrupted jar can’t be opened by maven.
Other than this,
Hi All,
In some cases, I have below exception when I run spark in local mode (I
haven't see this in a cluster). This is weird but also affect my local unit
test case (it is not always happen, but usually one per 4-5 times run). From
the stack, looks like error happen when create the context, bu
Hi All,
I am a bit confused on spark.storage.memoryFraction, this is used to set the
area for RDD usage, will this RDD means only for cached and persisted RDD?
So if my program has no cached RDD at all (means that I have no .cache() or
.persist() call on any RDD), then I can set this
spark.stor
re what
happen here, have no time to dig it out. But if you want me to provide more
information. I will be happy to do that.
Regards,
Shuai
-Original Message-
From: Bozeman, Christopher [mailto:bozem...@amazon.com]
Sent: Wednesday, April 01, 2015 4:59 PM
To: Shuai Zheng; 'Sean
, not NULL), I am not sure why this happen (the code without any
problem when run with default java serializer). So I think this is a bug,
but I am not sure it is a bug of spark or a bug of Kryo.
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, April 06, 2015 5
Hi All,
I have tested my code without problem on EMR yarn (spark 1.3.0) with default
serializer (java).
But when I switch to org.apache.spark.serializer.KryoSerializer, the
broadcast value doesn't give me right result (actually return me empty
custom class on inner object).
Basically I bro
,
Shuai
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 01, 2015 10:51 AM
To: Shuai Zheng
Cc: Akhil Das; user@spark.apache.org
Subject: Re: --driver-memory parameter doesn't work for spark-submmit on yarn?
I feel like I recognize that problem, and
rom: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Wednesday, April 01, 2015 2:40 AM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: --driver-memory parameter doesn't work for spark-submmit on yarn?
Once you submit the job do a ps aux | grep spark-submit and see how much is
Hi All,
Below is the my shell script:
/home/hadoop/spark/bin/spark-submit --driver-memory=5G --executor-memory=40G
--master yarn-client --class com.***.FinancialEngineExecutor
/home/hadoop/lib/my.jar s3://bucket/vriscBatchConf.properties
My driver will load some resources and then broa
Hi All,
I am waiting the spark 1.3.1 to fix the bug to work with S3 file system.
Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because
there is jar compatible issue to work with AWS SDK.
Regards,
Shuai
Below is the output:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1967947
max locked memory (kbytes, -l) 64
max memory
Hi All,
I try to run a simple sort by on 1.2.1. And it always give me below two
errors:
1, 15/03/20 17:48:29 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID
35, ip-10-169-217-47.ec2.internal): java.io.FileNotFoundException:
/tmp/spark-e40bb112-3a08-4f62-9eaa-cd094fcfa624/spark-58f72d53
From: Aaron Davidson [mailto:ilike...@gmail.com]
Sent: Tuesday, March 17, 2015 3:06 PM
To: Imran Rashid
Cc: Shuai Zheng; user@spark.apache.org
Subject: Re: Spark will process _temporary folder on S3 is very slow and always
cause failure
Actually, this is the more relevant JIRA (which
Hi,
I am curious, when I start a spark program in local mode, which parameter
will be used to decide the jvm memory size for executor?
In theory should be:
--executor-memory 20G
But I remember local mode will only start spark executor in the same process
of driver, then should be:
--driv
Hi Imran,
I am a bit confused here. Assume I have RDD a with 1000 partition and also has
been sorted. How can I control when creating RDD b (with 20 partitions) to make
sure 1-50 partition of RDD a map to 1st partition of RDD b? I don’t see any
control code/logic here?
You code below:
rsion.
Regards,
Shuai
From: Kelly, Jonathan [mailto:jonat...@amazon.com]
Sent: Monday, March 16, 2015 2:54 PM
To: Shuai Zheng; user@spark.apache.org
Subject: Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0
See https://issues.apache.org/jira/browse/SPARK-6351
Hi All,
I just upgrade the system to use version 1.3.0, but then the
sqlContext.parquetFile doesn't work with s3n. I have test the same code with
1.2.1 and it works.
A simple test running in spark-shell:
val parquetFile = sqlContext.parquetFile("""s3n:///test/2.parq """)
java.lang.
...@gmail.com]
Sent: Monday, March 16, 2015 1:06 PM
To: Shuai Zheng
Cc: user
Subject: Re: [SPARK-3638 ] java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.
>From my local maven repo:
$ jar tvf
~/.m2/repository/org/apache/httpcomponents/httpclient/4.
Hi All,
I am running Spark 1.2.1 and AWS SDK. To make sure AWS compatible on the
httpclient 4.2 (which I assume spark use?), I have already downgrade to the
version 1.9.0
But even that, I still got an error:
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.http.impl.co
?
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Friday, March 13, 2015 6:51 PM
To: user@spark.apache.org
Subject: Spark will process _temporary folder on S3 is very slow and always
cause failure
Hi All,
I try to run a sorting on a r3.2xlarge instance on AWS
Hi All,
I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it
as a single node cluster for test. The data I use to sort is around 4GB and
sit on S3, output will also on S3.
I just connect spark-shell to the local cluster and run the code in the
script (because I just
Hi All,
I am running spark to deal with AWS.
And aws sdk latest version is working with httpclient 3.4+. Then but
spark-assembly-*-.jar file has packaged an old httpclient version which
cause me: ClassNotFoundException for
org/apache/http/client/methods/HttpPatch
Even when I put the rig
Hi All,
I try to pass parameter to the spark-shell when I do some test:
spark-shell --driver-memory 512M --executor-memory 4G --master
spark://:7077 --conf spark.sql.parquet.compression.codec=snappy --conf
spark.sql.parquet.binaryAsString=true
This works fine on my local pc. And whe
Hi All,
I am processing some time series data. For one day, it might has 500GB, then
for each hour, it is around 20GB data.
I need to sort the data before I start process. Assume I can sort them
successfully
dayRDD.sortByKey
but after that, I might have thousands of partitions (to m
Hi All,
I have a lot of parquet files, and I try to open them directly instead of
load them into RDD in driver (so I can optimize some performance through
special logic).
But I do some research online and can't find any example to access parquet
directly from scala, anyone has done this befor
Hi All,
If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.
20150101a.parq
20150101b.parq
20150102a.parq
20150102b.parq
20150102c.parq
.
201501010a.parq
.
N
, 2015 6:00 PM
To: Shuai Zheng
Cc: Shao, Saisai; user@spark.apache.org
Subject: Re: Union and reduceByKey will trigger shuffle even same partition?
I think you're getting tripped up lazy evaluation and the way stage boundaries
work (admittedly its pretty confusing in this case).
It is
small
enough put in memory), how can I access other RDD's local partition in the
mapParitition method? Is it anyway to do this in Spark?
From: Shao, Saisai [mailto:saisai.s...@intel.com]
Sent: Monday, February 23, 2015 3:13 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: RE: Union and r
: Monday, February 23, 2015 3:13 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey will trigger shuffle even same partition?
If you call reduceByKey(), internally Spark will introduce a shuffle
operations, not matter the data is already partitioned locally, Spark
Hi All,
I am running a simple page rank program, but it is slow. And I dig out part
of reason is there is shuffle happen when I call an union action even both
RDD share the same partition:
Below is my test code in spark shell:
import org.apache.spark.HashPartitioner
sc.getConf.set("
on it? How Spark to
control/maintain/detect the live of the client spark context?
Do I need to setup something special?
Regards,
Shuai
From: Eugen Cepoi [mailto:cepoi.eu...@gmail.com]
Sent: Thursday, February 05, 2015 5:39 PM
To: Shuai Zheng
Cc: Corey Nolet; Charles Feduke; user
method.
Anyone can confirm this? This is not something I can easily test with code.
Thanks!
Regards,
Shuai
From: Corey Nolet [mailto:cjno...@gmail.com]
Sent: Thursday, February 05, 2015 11:55 AM
To: Charles Feduke
Cc: Shuai Zheng; user@spark.apache.org
Subject: Re: How to design a lon
Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Thursday, February 05, 2015 10:53 AM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Use Spark as multi-threading library and deprecate web UI
Do you mean disable the web UI? spark.ui.enabled=false
Sure, it's useful with m
Hi All,
It might sounds weird, but I think spark is perfect to be used as a
multi-threading library in some cases. The local mode will naturally boost
multiple thread when required. Because it is more restrict and less chance
to have potential bug in the code (because it is more data oriental,
Hi All,
I want to develop a server side application:
User submit request à Server run spark application and return (this might
take a few seconds).
So I want to host the server to keep the long-live context, I dont know
whether this is reasonable or not.
Basically I try to have a g
Hi Tobias,
Can you share more information about how do you do that? I also have similar
question about this.
Thanks a lot,
Regards,
Shuai
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Wednesday, November 26, 2014 12:25 AM
To: Sandy Ryza
Cc: Yanbo Liang; user
Subject:
; In any normal case, you have 1 executor per machine per application.
> There are cases where you would make more than 1, but these are
> unusual.
>
> On Thu, Jan 15, 2015 at 8:16 PM, Shuai Zheng
> wrote:
> > Hi All,
> >
> >
> >
> > I try to clarify some
utors
first, then run the real behavior - because executor will run the whole
lifecycle of the applications? Although this may not have any real value in
practice J
But I still need help for my first question.
Thanks a lot.
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail
Forget to mention, I use EMR AMI 3.3.1, Spark 1.2.0. Yarn 2.4. The spark is
setup by the standard script:
s3://support.elasticmapreduce/spark/install-spark
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Thursday, January 15, 2015 3:52 PM
To: user@spark.apache.org
Subject: Executor
Hi All,
I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.
But the command line I use to start up spark-shell, it can't work. For
example:
~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-execut
Hi All,
I try to clarify some behavior in the spark for executor. Because I am from
Hadoop background, so I try to compare it to the Mapper (or reducer) in
hadoop.
1, Each node can have multiple executors, each run in its own process? This
is same as mapper process.
2, I thought the sp
Thanks a lot!
I just realize the spark is not a really in-memory version of mapreduce J
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, January 13, 2015 3:53 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Why always spilling to disk and how to improve it
Hi All,
I am trying with some small data set. It is only 200m, and what I am doing
is just do a distinct count on it.
But there are a lot of spilling happen in the log (I attached in the end of
the email).
Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge
instance t
Is it possible too many connections open to read from s3 from one node? I
have this issue before because I open a few hundreds of files on s3 to read
from one node. It just block itself without error until timeout later.
On Monday, December 22, 2014, durga wrote:
> Hi All,
>
> I am facing a stra
PM, Shuai Zheng wrote:
> Hi All,
>
> When I try to load a folder into the RDDs, any way for me to find the
> input file name of particular partitions? So I can track partitions from
> which file.
>
> In the hadoop, I can find this information through the code:
>
> FileS
Hi All,
When I try to load a folder into the RDDs, any way for me to find the input
file name of particular partitions? So I can track partitions from which
file.
In the hadoop, I can find this information through the code:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String strFil
Agree. I did similar things last week. The only issue is create a subclass
of configuration to implement serializable interface.
The Demi' solution is a bit overkill for this simple requirement
On Tuesday, December 16, 2014, Gerard Maas wrote:
> Hi Demi,
>
> Thanks for sharing.
>
> What we usual
Hi,
I am running a code which takes a network file (not HDFS) location as
input. But sc.textFile("networklocation\\README.md") can't recognize
the network location start with "" as a valid location, because it only
accept HDFS and local like file name format?
Anyone has idea how can I use
Hi All,
I notice if we create a spark context in driver, we need to call stop method
to clear it.
SparkConf sparkConf = new
SparkConf().setAppName("FinancialEngineExecutor");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
.
String inputP
]
Sent: Wednesday, December 17, 2014 11:04 AM
To: Shuai Zheng; 'Sun, Rui'; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS
Why not is a good option to create a RDD per each 200Mb file and then apply
the pre-calculations before merging them? I
Nice, that is the answer I want.
Thanks!
From: Sun, Rui [mailto:rui@intel.com]
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS
Hi, Shuai,
How did you turn off the file split in
Hi All,
My application load 1000 files, each file from 200M - a few GB, and combine
with other data to do calculation.
Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation.
In Hadoop, it is simple because I can turn-off
Thanks!
Shuai
-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Wednesday, November 05, 2014 6:27 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any "Replicated" RDD in Spark?
If you start with an RDD, you do have to collect to the driver
!
-Original Message-
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, November 05, 2014 3:32 PM
To: 'Matei Zaharia'
Cc: 'user@spark.apache.org'
Subject: RE: Any "Replicated" RDD in Spark?
Nice.
Then I have another question, if I have a file (o
here (in theory, either way works, but in real world, which one is
better?).
Regards,
Shuai
-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Monday, November 03, 2014 4:15 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any "Replicated" RDD
Hi,
I am planning to run spark on EMR. And because my application might take a
lot of memory. On EMR, I know there is a hard limit 16G physical memory on
individual mapper/reducer (otherwise I will have an exception and this is
confirmed by AWS EMR team, at least it is the spec at this moment).
A
Hi All,
I have spent last two years on hadoop but new to spark.
I am planning to move one of my existing system to spark to get some
enhanced features.
My question is:
If I try to do a map side join (something similar to "Replicated" key word
in Pig), how can I do it? Is it anyway to declare a R
76 matches
Mail list logo