Is the pandas version in doc of using pyarrow in spark wrong

2021-08-09 Thread Jeff Zhang
/api/python/user_guide/arrow_pandas.html#recommended-pandas-and-pyarrow-versions -- Best Regards Jeff Zhang

Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-16 Thread Jeff Zhang
to get created, often timing out. > > Any ideas on how to resolve this ? > Any other alternatives to databricks notebook ? > > -- Best Regards Jeff Zhang

[ANNOUNCE] Apache Zeppelin 0.10.0 is released, Spark on Zeppelin Improved

2021-08-26 Thread Jeff Zhang
/interpreter/spark.html Download it here: https://zeppelin.apache.org/download.html -- Best Regards Jeff Zhang Twitter: zjffdu

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang
1268 return args_command, temp_args > > ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py > in (.0) >1264 > 1265 args_command = "".join( > -> 1266 [get_command_part(arg, self.pool) for arg in new_args]) >1267 >1268 return args_command, temp_args > > ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py > in get_command_part(parameter, python_proxy_pool) > 296 command_part += ";" + interface > 297 else: > --> 298 command_part = REFERENCE_TYPE + parameter._get_object_id() > 299 > 300 command_part += "\n" > > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang
You can first try it via docker http://zeppelin.apache.org/download.html#using-the-official-docker-image Jeff Zhang 于2021年9月27日周一 上午6:49写道: > Hi kumar, > > You can try Zeppelin which support the udf sharing across languages > > http://zeppelin.apache.org/ > > > > >

Re: Choice of IDE for Spark

2021-09-30 Thread Jeff Zhang
roperty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang

Re: Filter operation to return two RDDs at once.

2015-06-02 Thread Jeff Zhang
This will run two different stages can this be done in one stage ? >> >> val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession. >> *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE) >> >> >> >> >> -- >> Deepak >> >> -- Best Regards Jeff Zhang

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
tch at once. > > On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang wrote: > >> As far as I know, spark don't support multiple outputs >> >> On Wed, Jun 3, 2015 at 2:15 PM, ayan guha wrote: >> >>> Why do you need to do that if filter and content of the resulting r

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Jeff Zhang
Best, > Patcharee > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Why does driver transfer application jar to executors?

2015-06-17 Thread Jeff Zhang
re already serialized with TaskDescription. > > > Regards. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: The auxService:spark_shuffle does not exist

2015-07-07 Thread Jeff Zhang
mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Console log file of CoarseGrainedExecutorBackend

2015-07-16 Thread Jeff Zhang
By default it is in ${SPARK_HOME}/work/${APP_ID}/${EXECUTOR_ID} On Thu, Jul 16, 2015 at 3:43 PM, Tao Lu wrote: > Hi, Guys, > > Where can I find the console log file of CoarseGrainedExecutorBackend > process? > > Thanks! > > Tao > > -- Best Regards Jeff Zhang

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Jeff Zhang
ed: > container_1456905762620_0002_01_02 on host: bold-x.rice.edu. Exit status: > 1. Diagnostics: Exception from container-launch. > > > Is there anybody know what is the problem here? > Best, > Xiaoye > -- Best Regards Jeff Zhang

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Jeff Zhang
+ temp). > > I am not sure if this is a known issue, or we should file a JIRA for it. > We originally came across this bug in the SciSpark project. > > Best, > > Rahul P > -- Best Regards Jeff Zhang

Re: OOM exception during Broadcast

2016-03-07 Thread Jeff Zhang
.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) > > > I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark > property maximizeResourceAllocation is set to true (executor.memory = 48G > according to spark ui environment). We're also using kryo serialization and > Yarn is the resource manager. > > Any ideas as what might be going wrong and how to debug this? > > Thanks, > Arash > > -- Best Regards Jeff Zhang

Re: Setting PYSPARK_PYTHON in spark-env.sh vs from driver program

2016-03-07 Thread Jeff Zhang
as > I can't find any mention of the environment of the driver program > overriding the environment in the workers, also that environment variable > was previously completely unset in the driver program anyway. > > Is there an explanation for this to help me understand how to do things > properly? We run Spark 1.6.0 on Ubuntu 14.04. > > Thanks > > Kostas > -- Best Regards Jeff Zhang

Re: Saving multiple outputs in the same job

2016-03-09 Thread Jeff Zhang
want in Spark, which is to > > have some control over which saves go into which jobs, and then execute > the > > jobs directly. I can envision a new version of the various save functions > > which take an extra job argument, or something, or some way to defer and > > unblock job creation in the spark context. > > > > Ideas? > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
>>> sqlContext.registerDataFrameAsTable(df, "table1") >>> sqlContext.dropTempTable("table1") On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Thanks > > Andy > -- Best Regards Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
wrote: > Thanks Jeff > > I was looking for something like ‘unregister’ > > > In SQL you use drop to delete a table. I always though register was a > strange function name. > > Register **-1 = unregister > createTable **-1 == dropTable > > Andy > > From: Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
ty? Can we restrict certain > users to access certain dataframes and not the others? > > -- > Best Regards, > Ayan Guha > -- Best Regards Jeff Zhang

Re: Spark UI Completed Jobs

2016-03-15 Thread Jeff Zhang
19841/19788 > *(41405 skipped)* > Thanks, > Prabhu Joseph > -- Best Regards Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
r as a better version > of hive one (with Spark as execution engine instead of MR/Tez) OR should I > see it as a JDBC server? > > On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang wrote: > >> spark thrift server is very similar with hive thrift server. You can use >> hive jdbc dri

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
It's same as hive thrift server. I believe kerberos is supported. On Wed, Mar 16, 2016 at 10:48 AM, ayan guha wrote: > so, how about implementing security? Any pointer will be helpful > > On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang wrote: > >> The spark thrift serve

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Jeff Zhang
> --- > TSMC PROPERTY > This email communication (and any attachments) is proprietary information > for the sole use of its > intended recipient. Any unauthorized review, use or distribution by anyone > other than the intended > recipient is strictly prohibited. If you are not the intended recipient, > please notify the sender by > replying to this email, and then delete this email and any copies of it > immediately. Thank you. > > --- > > > -- Best Regards Jeff Zhang

Re: exception while running job as pyspark

2016-03-16 Thread Jeff Zhang
; python2.7 couldn't found. But i m using vertual env python 2.7 > {{{ > [ram@test-work workspace]$ python > Python 2.7.8 (default, Mar 15 2016, 04:37:00) > [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> > }}} > > Can anyone help me with this? > Thanks > -- Best Regards Jeff Zhang

Re: DataFrame vs RDD

2016-03-22 Thread Jeff Zhang
-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
> > Thanks in advance for your help! > > -- > Thamme > -- Best Regards Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
I think I got the root cause, you can use Text.toString() to solve this issue. Because the Text is shared so the last record display multiple times. On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang wrote: > Looks like a spark bug. I can reproduce it for sequence file, but it works > for tex

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang wrote: > I think I got the root cause, you can use Text.toString() to solve this > issue. Because the Text is shared so the last record display multiple > times. > > On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang wrote: > >> L

Re: run spark job

2016-03-29 Thread Jeff Zhang
scr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-29 Thread Jeff Zhang
legatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:229) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > > > -- Best Regards Jeff Zhang

Re: sqlContext.cacheTable + yarn client mode

2016-03-30 Thread Jeff Zhang
he local Spark driver for some SQL code and was wondering if the > local cache load could be the culprit. > > Appreciate any thoughts. BTW, we're running Spark 1.6.0 on this particular > cluster. > > Regards, > > Soam > -- Best Regards Jeff Zhang

Re: Spark 1.6.0 - token renew failure

2016-04-13 Thread Jeff Zhang
op.proxyuser.spark.groups > * > > > > hadoop.proxyuser.spark.hosts > * > > > ... > > > hadoop.security.auth_to_local > > RULE:[1:$1@$0](spark-pantagr...@contactlab.lan)s/.*/spark/ > DEFAULT > > > > > "spark" is present as local user in all servers. > > > What does is missing here ? > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Re: Why Spark having OutOfMemory Exception?

2016-04-19 Thread Jeff Zhang
rk User List mailing list archive at Nabble.com. > > > >----- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > >commands, e-mail: user-h...@spark.apache.org > > > >Information transmitted by this e-mail is proprietary to Mphasis, its > >associated companies and/ or its customers and is intended > >for use only by the individual or entity to which it is addressed, and may > >contain information that is privileged, confidential or > >exempt from disclosure under applicable law. If you are not the intended > >recipient or it appears that this mail has been forwarded > >to you without proper authority, you are notified that any use or > >dissemination of this information in any manner is strictly > >prohibited. In such cases, please notify us immediately at > >mailmas...@mphasis.com and delete this mail from your records. > > > > > > > > > > > > -- Best Regards Jeff Zhang

Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread Jeff Zhang
d, Apr 20, 2016 at 3:55 PM, 李明伟 wrote: > Hi Jeff > > The total size of my data is less than 10M. I already set the driver > memory to 4GB. > > > > > > > > 在 2016-04-20 13:42:25,"Jeff Zhang" 写道: > > Seems it is OOM in driver side when fetching

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang
;>> > Regards, >>>>> > Raghava. >>>>> > >>>>> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar >>>>> wrote: >>>>> > >>>>> >> If the data file is same then it should have similar distribution of >>>>> >> keys. >>>>> >> Few queries- >>>>> >> >>>>> >> 1. Did you compare the number of partitions in both the cases? >>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala >>>>> >> Program being submitted? >>>>> >> >>>>> >> Also, can you please share the details of Spark Context, >>>>> Environment and >>>>> >> Executors when you run via Scala program? >>>>> >> >>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < >>>>> >> m.vijayaragh...@gmail.com> wrote: >>>>> >> >>>>> >>> Hello All, >>>>> >>> >>>>> >>> We are using HashPartitioner in the following way on a 3 node >>>>> cluster (1 >>>>> >>> master and 2 worker nodes). >>>>> >>> >>>>> >>> val u = >>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, >>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) => >>>>> (y.toInt, >>>>> >>> x.toInt) } }).partitionBy(new >>>>> HashPartitioner(8)).setName("u").persist() >>>>> >>> >>>>> >>> u.count() >>>>> >>> >>>>> >>> If we run this from the spark shell, the data (52 MB) is split >>>>> across >>>>> >>> the >>>>> >>> two worker nodes. But if we put this in a scala program and run >>>>> it, then >>>>> >>> all the data goes to only one node. We have run it multiple times, >>>>> but >>>>> >>> this >>>>> >>> behavior does not change. This seems strange. >>>>> >>> >>>>> >>> Is there some problem with the way we use HashPartitioner? >>>>> >>> >>>>> >>> Thanks in advance. >>>>> >>> >>>>> >>> Regards, >>>>> >>> Raghava. >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > Raghava >>>>> > http://raghavam.github.io >>>>> > >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Mike >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Raghava >>>> http://raghavam.github.io >>>> >>> >> >> >> -- >> Regards, >> Raghava >> http://raghavam.github.io >> > -- Best Regards Jeff Zhang

Re: Spark 1.6.1 throws error: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-04-27 Thread Jeff Zhang
w.index.stride"="1" ) > """ > HiveContext.sql(sqltext) > // > sqltext = """ > INSERT INTO TABLE test.dummy2 > SELECT > * > FROM tmp > """ > HiveContext.sql(sqltext) > > In Spark 1.6.1, it is throwing error as below > > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 1.0 (TID 4, rhes564): java.lang.IllegalStateException: Did not find > registered driver with class oracle.jdbc.OracleDriver > > Is this a new bug introduced in Spark 1.6.1? > > > Thanks > -- Best Regards Jeff Zhang

Re: pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-17 Thread Jeff Zhang
xt: > http://apache-spark-user-list.1001560.n3.nabble.com/pandas-dataframe-broadcasted-giving-errors-in-datanode-function-called-kernel-tp26953.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Spark crashes with Filesystem recovery

2016-05-17 Thread Jeff Zhang
kError: An error occurred while trying to connect > to the Java server > > Even though I start pyspark with these options: > ./pyspark --master local[4] --executor-memory 14g --driver-memory 14g > --packages com.databricks:spark-csv_2.11:1.4.0 > --spark.deploy.recoveryMode=FILESYSTEM > > and this in my /conf/spark-env.sh file: > - SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM > -Dspark.deploy.recoveryDirectory=/user/recovery" > > How can I get HA to work in Spark? > > thanks, > imran > > -- Best Regards Jeff Zhang

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Jeff Zhang
plication main class fails with class not found exception. > Is there any workaround? > -- Best Regards Jeff Zhang

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Jeff Zhang
y commercial proposition or >> anything like that. As I seem to get involved with members troubleshooting >> issues and threads on this topic, I thought it is worthwhile writing a note >> about it to summarise the findings for the benefit of the community. >> >> >> Regards. >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> > > -- Best Regards Jeff Zhang

Any way to pass custom hadoop conf to through spark thrift server ?

2016-05-18 Thread Jeff Zhang
I want to pass one custom hadoop conf to spark thrift server so that both driver and executor side can get this conf. And I also want this custom hadoop conf only detected by this user's job who set this conf. Is it possible for spark thrift server now ? Thanks -- Best Regards Jeff Zhang

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Jeff Zhang
2-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a > INFO apache.spark.util.ShutdownHookManager - Deleting directory > /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061 > > > > Sent from Yahoo Mail. Get the app <https://yho.com/148vdq> > -- Best Regards Jeff Zhang

Bug of PolynomialExpansion ?

2016-05-29 Thread Jeff Zhang
2*x1,x2*x2, x2*x3, x3*x1, x3*x2,x3*x3) (3,[0,2],[1.0,1.0]) --> (9,[0,1,5,6,8],[1.0,1.0,1.0,1.0,1.0])| -- Best Regards Jeff Zhang

Re: java.io.FileNotFoundException

2016-06-03 Thread Jeff Zhang
c79c725/28/shuffle_8553_38_0.index >> (No such file or directory) >> >> any idea about this error ? >> -- >> Thanks, >> Kishore. >> > > > > -- > Thanks, > Kishore. > -- Best Regards Jeff Zhang

Re: Spark corrupts text lines

2016-06-14 Thread Jeff Zhang
spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: sqlcontext - not able to connect to database

2016-06-14 Thread Jeff Zhang
gt; at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > > > -- Best Regards Jeff Zhang

Re: Limit pyspark.daemon threads

2016-06-15 Thread Jeff Zhang
;>>>> >>>>>> >>>>>> spark.python.worker.memory >>>>>> 512m >>>>>> >>>>>> Amount of memory to use per python worker process during >>>>>> aggregation, in the same >>>>>> format as JVM memory strings (e.g. 512m, >>>>>> 2g). If the memory >>>>>> used during aggregation goes above this amount, it will spill the >>>>>> data into disks. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken < >>>>>> carli...@janelia.hhmi.org> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> We have an HPC cluster that we run Spark jobs on using standalone >>>>>>> mode and a number of scripts I’ve built up to dynamically schedule and >>>>>>> start spark clusters within the Grid Engine framework. Nodes in the >>>>>>> cluster >>>>>>> have 16 cores and 128GB of RAM. >>>>>>> >>>>>>> My users use pyspark heavily. We’ve been having a number of problems >>>>>>> with nodes going offline with extraordinarily high load. I was able to >>>>>>> look >>>>>>> at one of those nodes today before it went truly sideways, and I >>>>>>> discovered >>>>>>> that the user was running 50 pyspark.daemon threads (remember, this is >>>>>>> a 16 >>>>>>> core box), and the load was somewhere around 25 or so, with all CPUs >>>>>>> maxed >>>>>>> out at 100%. >>>>>>> >>>>>>> So while the spark worker is aware it’s only got 16 cores and >>>>>>> behaves accordingly, pyspark seems to be happy to overrun everything >>>>>>> like >>>>>>> crazy. Is there a global parameter I can use to limit pyspark threads >>>>>>> to a >>>>>>> sane number, say 15 or 16? It would also be interesting to set a memory >>>>>>> limit, which leads to another question. >>>>>>> >>>>>>> How is memory managed when pyspark is used? I have the spark worker >>>>>>> memory set to 90GB, and there is 8GB of system overhead (GPFS caching), >>>>>>> so >>>>>>> if pyspark operates outside of the JVM memory pool, that leaves it at >>>>>>> most >>>>>>> 30GB to play with, assuming there is no overhead outside the JVM’s 90GB >>>>>>> heap (ha ha.) >>>>>>> >>>>>>> Thanks, >>>>>>> Ken Carlile >>>>>>> Sr. Unix Engineer >>>>>>> HHMI/Janelia Research Campus >>>>>>> 571-209-4363 >>>>>>> >>>>>>> >>>>>> >>>>>> Т�ХF� >>>>>> V�7V'67&�&R� R�� �â W6W"�V�7V'67&�&T 7 &�� 6�R��&pФf�" FF�F��� � 6��� >>>>>> �G2� >>>>>> R�� �â W6W"ֆV� 7 &�� 6�R��&pР >>>>>> >>>>>> >>>>>> - >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>>>>> additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>>> >>>> >>>> >>> >>> >>> -- >>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>> >> >> > -- Best Regards Jeff Zhang

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
of ram > > Then my notebook dies and I get below error > > Py4JNetworkError: An error occurred while trying to connect to the Java server > > > Thank You > -- Best Regards Jeff Zhang

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
nsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Specify --executor-memory in your spark-submit command. On Thu, Jun 16, 2016 at 9:01 AM, spR wrote: > Thank you. Can you pls tell How to increase the executor memory? > > > > On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang wrote: > >> >>> Caused by: java.lang.Ou

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
You are using local mode, --executor-memory won't take effect for local mode, please use other cluster mode. On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang wrote: > Specify --executor-memory in your spark-submit command. > > > > On Thu, Jun 16, 2016 at 9:01 AM, spR wrote: >

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
e to do this processing in local mode then? > > Regards, > Tejaswini > > On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang wrote: > >> You are using local mode, --executor-memory won't take effect for local >> mode, please use other cluster mode. >> >> On Th

Re: How to deal with tasks running too long?

2016-06-16 Thread Jeff Zhang
asks in my example can be killed (timed out) and the stage completes > successfully. > > -- > Thanks, > -Utkarsh > -- Best Regards Jeff Zhang

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread Jeff Zhang
List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Does saveAsHadoopFile depend on master?

2016-06-21 Thread Jeff Zhang
gt; what is happening here? > > Thanks! > > Pierre. > -- Best Regards Jeff Zhang

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-21 Thread Jeff Zhang
ter yarn-client --driver-memory 512m --num-executors 2 >> --executor-memory 512m --executor-cores 210: >> >> >> >>- Error: Could not find or load main class >>org.apache.spark.deploy.yarn.ExecutorLauncher >> >> but i don't config that para ,there no error why???that para is only >> avoid Uploading resource file(jar package)?? >> > > -- Best Regards Jeff Zhang

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Jeff Zhang
that - although included in the > pom.xml build - are for some reason not found when processed within > Intellij. > -- Best Regards Jeff Zhang

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Jeff Zhang
sErrorBatch(batch: EventBatch): Boolean = { > ^ > > /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala > Error:(86, 51) not found: type SparkFlumeProtocol > val responder = new SpecificResponder(classOf[Spar

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Jeff Zhang
:137) >> at >> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) >> at >> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) >> at scala.Option.foreach(Option.scala:236) >> at org.apache.spark.SparkContext.(SparkContext.scala:481) >> at >> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) >> >> >> > -- Best Regards Jeff Zhang

Re: Call Scala API from PySpark

2016-06-30 Thread Jeff Zhang
Dataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile', > 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair', > 'writeIteratorToStream', 'writeUTF'] > > The next thing I would run into is converting the JVM RDD[String] back to > a Python RDD, what is the easiest way to do this? > > Overall, is this a good approach to calling the same API in Scala and > Python? > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- Best Regards Jeff Zhang

Re: Remote RPC client disassociated

2016-06-30 Thread Jeff Zhang
ction.Iterator$$anon$11.next(Iterator.scala:328) > > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:914) > > at > scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929) > > at > scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:968) > > at > scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) > > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at > scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452) > > at > org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280) > > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) > >at > org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239) > BR > > > > Joaquin > This email is confidential and may be subject to privilege. If you are not > the intended recipient, please do not copy or disclose its content but > contact the sender immediately upon receipt. > -- Best Regards Jeff Zhang

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
; server, so it makes some sense, but it's really inconvenient - I need a lot > of memory on my driver machine. Reasons for one instance per machine I do > not understand. > > -- > > > *Sincerely yoursEgor Pakhomov* > -- Best Regards Jeff Zhang

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
wrote: > I get > > "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as > process 28989. Stop it first." > > Is it a bug? > > 2016-07-01 10:10 GMT-07:00 Jeff Zhang : > >> I don't think the one instance per machine is true. As long as yo

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
'm trying some very rare case? > > 2016-07-01 10:54 GMT-07:00 Jeff Zhang : > >> This is not a bug, because these 2 processes use the same SPARK_PID_DIR >> which is /tmp by default. Although you can resolve this by using >> different SPARK_PID_DIR, I suspect you would

Re: spark local dir to HDFS ?

2016-07-05 Thread Jeff Zhang
ocal-dir-to-HDFS-tp27291.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: spark run shell On yarn

2016-07-28 Thread Jeff Zhang
mit > export YARN_CONF_DIR=/etc/hadoop/conf > export HADOOP_CONF_DIR=/etc/hadoop/conf > export SPARK_HOME=/etc/spark-2.0.0-bin-hadoop2.6 > > > how I to update? > > > > > > === > Name: cen sujun > Mobile: 13067874572 > Mail: ce...@lotuseed.com > > -- Best Regards Jeff Zhang

Re: Spark assembly in Maven repo?

2015-12-10 Thread Jeff Zhang
& upload it to Maven central? > > > > Thanks! > > > > Xiaoyong > > > -- Best Regards Jeff Zhang

[SparkR] Any reason why saveDF's mode is append by default ?

2015-12-13 Thread Jeff Zhang
It is inconsistent with scala api which is error by default. Any reason for that ? Thanks -- Best Regards Jeff Zhang

Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Jeff Zhang
[1]) but the Python API seems to have been > changed to match Scala / Java in > https://issues.apache.org/jira/browse/SPARK-6366 > > Feel free to open a JIRA / PR for this. > > Thanks > Shivaram > > [1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files > >

[SparkR] Is rdd in SparkR deprecated ?

2015-12-14 Thread Jeff Zhang
>From the source code of SparkR, seems SparkR support rdd api. But there's no documentation on that. ( http://spark.apache.org/docs/latest/sparkr.html ) So I guess it is deprecated, is that right ? -- Best Regards Jeff Zhang

Re: [SparkR] Is rdd in SparkR deprecated ?

2015-12-14 Thread Jeff Zhang
I sufficient? > > > > > > On Mon, Dec 14, 2015 at 4:26 AM -0800, "Jeff Zhang" > wrote: > > From the source code of SparkR, seems SparkR support rdd api. But there's > no documentation on that. ( > http://spark.apache.org/docs/latest/sparkr.html )

Re: Database does not exist: (Spark-SQL ===> Hive)

2015-12-14 Thread Jeff Zhang
18:49:57 ERROR HiveContext: > == > HIVE FAILURE OUTPUT > == > > > > >OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Database does not exist: test_db > > == > END HIVE FAILURE OUTPUT > == > > > Process finished with exit code 0 > > Thanks & Regards, > Gokula Krishnan* (Gokul)* > -- Best Regards Jeff Zhang

Re: how to make a dataframe of Array[Doubles] ?

2015-12-14 Thread Jeff Zhang
- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Can't create UDF through thriftserver, no error reported

2015-12-15 Thread Jeff Zhang
roblem, > but all these feature I believe are present in Hive 0.11 and should have > made it into Spark. At the very least, I would like to see some message in > the logs and console so that I can find the error of my ways, repent and > fix my code. Any suggestions? Anything I should post to support > troubleshooting? Is this JIRA-worthy? Thanks > > Antonio > > > > -- Best Regards Jeff Zhang

Re: YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2015-12-15 Thread Jeff Zhang
ckContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > > at scala.concurrent.Await$.result(package.scala:107) > > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) > > > -- Best Regards Jeff Zhang

Re: hiveContext: storing lookup of partitions

2015-12-15 Thread Jeff Zhang
t; Currently it takes around 1.5 hours for me just to cache in the partition > information and after that I can see that the job gets queued in the SPARK > UI. > > Regards, > Gourav > -- Best Regards Jeff Zhang

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Jeff Zhang
on it (the one with > .count() or .show()) then it takes around 2 hours before the job starts in > SPARK. > > On the pyspark screen I can see that it is parsing the S3 locations for > these 2 hours. > > Regards, > Gourav > > On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang

Re: Access row column by field name

2015-12-16 Thread Jeff Zhang
cidents/unstructured/inc-0-500.txt") > val df = sqlContext.jsonRDD(rawIncRdd) > df.foreach(line => println(line.getString(*"field_name"*))) > > thanks for the advice > -- Best Regards Jeff Zhang

Re: Base ERROR

2015-12-17 Thread Jeff Zhang
2 INFO [Thread-6] regionserver.ShutdownHook: > Starting fs shutdown hook thread. > 2015-12-17 21:24:29,953 INFO [Thread-6] regionserver.ShutdownHook: > Shutdown hook finished. > > > -- Best Regards Jeff Zhang

Re: Dynamic jar loading

2015-12-19 Thread Jeff Zhang
- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > -- Best Regards Jeff Zhang

Re: Spark batch getting hung up

2015-12-19 Thread Jeff Zhang
e Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Spark batch getting hung up

2015-12-20 Thread Jeff Zhang
efore it gets cleared up. > > Would the driver not wait till all the stuff related to test1 is completed > before calling test2 as test2 is dependent on test1? > > val test1 =RDD1.mapPartitions.() > > val test2 = test1.mapPartititions() > > On Sat, Dec 19, 2015 at 12:24 AM,

Re: DataFrame operations

2015-12-20 Thread Jeff Zhang
ot the row data > What am I missing here? > -- Best Regards Jeff Zhang

Re: get parameters of spark-submit

2015-12-21 Thread Jeff Zhang
; To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: spark-submit for dependent jars

2015-12-21 Thread Jeff Zhang
ark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Regards, > Rajesh > -- Best Regards Jeff Zhang

Re: spark-submit for dependent jars

2015-12-21 Thread Jeff Zhang
com.cisco.ss.etl.Main.main(Main.scala) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:606) >>> at >>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) >>> at >>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) >>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) >>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) >>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>> >>> Regards, >>> Rajesh >>> >> >> > -- Best Regards Jeff Zhang

Re: trouble implementing Transformer and calling DataFrame.withColumn()

2015-12-21 Thread Jeff Zhang
super(f, dataType, inputTypes); > > ??? Why do I have to implement this constructor ??? > > ??? What are the arguments ??? > > } > > > > @Override > > public > > Column apply(scala.collection.Seq exprs) { > > What do you do with a scala seq? > > return ???; > > } > > } > > } > > > -- Best Regards Jeff Zhang

Re: should I file a bug? Re: trouble implementing Transformer and calling DataFrame.withColumn()

2015-12-22 Thread Jeff Zhang
String call(String s) throws Exception { > > logger.info("AEDWIP s:{}", s); > > String ret = s.equalsIgnoreCase(category1) ? category1 : > category3; > > return ret; > > } > > } > > > public class Features impl

Re: Missing dependencies when submitting scala app

2015-12-22 Thread Jeff Zhang
an't find the parse method > > Any idea on how to solve this depdendency problem? > > thanks in advance > -- Best Regards Jeff Zhang

Re: Passing parameters to spark SQL

2015-12-27 Thread Jeff Zhang
pache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Can anyone explain Spark behavior for below? Kudos in Advance

2015-12-27 Thread Jeff Zhang
> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x > + y) > res143: String = 10 > > Scenario2: > val z = sc.parallelize(List("12","23","","345"),2) > z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x > + y) > res144: String = 11 > > why the result is different . I was expecting 10 for both. also for the > first Partition > -- Best Regards Jeff Zhang

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Jeff Zhang
menode on yarn ? > > > > Thanks a lot. > > > > Mars > > > -- Best Regards Jeff Zhang

Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Jeff Zhang
he Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Jeff Zhang
gh YARN. Where will these properties be > logged?. I guess they wont be part of YARN logs > > 2015-12-28 13:22 GMT+01:00 Jeff Zhang : > >> set spark.logConf as true in spark-default.conf will log the property in >> driver side. But it would only log the property you set, not in

Re: Cannot get repartitioning to work

2016-01-01 Thread Jeff Zhang
-list.1001560.n3.nabble.com/Cannot-get-repartitioning-to-work-tp25852.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >

Re: sql:Exception in thread "main" scala.MatchError: StringType

2016-01-03 Thread Jeff Zhang
> Exception in thread "main" scala.MatchError: StringType (of class > org.apache.spark.sql.types.StringType$) > at > org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58) > at > > org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139) > > ___ > why > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/sql-Exception-in-thread-main-scala-MatchError-StringType-tp25868.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
there anyway to do that ? Or do I miss anything here ? -- Best Regards Jeff Zhang

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
heduling and don't set a pool, > the default pool will be used. > > On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote: > >> >> It seems currently spark.scheduler.pool must be set as localProperties >> (associate with thread). Any reason why spark.scheduler.pool can

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
n/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L90 > ). > > > On Tue, Jan 5, 2016 at 4:15 PM, Jeff Zhang wrote: > >> Sorry, I don't make it clearly. What I want is the default pool is fair >> scheduling. But seems if I want to use fair scheduling now, I h

  1   2   3   >