Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov
ache/spark/d57daf1f7732a7ac54a91fe112deeda0a254f9ef/python/pyspark/sql/types.py -- Ruslan Dautkhanov On Wed, Mar 16, 2016 at 4:44 PM, Reynold Xin wrote: > We probably should have the alias. Is this still a problem on master > branch? > > On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov
Getting ValueError: Could not parse datatype: bigint Looks like pyspark.sql.types doesn't know anything about bigint.. Should it be aliased to LongType in pyspark.sql.types? Thanks On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov wrote: > Hello, > > Looking at > &g

df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov
t;, IntegerType() for "integer" etc? If it doesn't exist it would be great to have such a mapping function. Thank you. ps. I have a data frame, and use its dtypes to loop through all columns to fix a few columns' data types as a workaround for SPARK-13866. -- Ruslan Dautkhanov

Spark session dies in about 2 days: HDFS_DELEGATION_TOKEN token can't be found

2016-03-11 Thread Ruslan Dautkhanov
Spark session dies out after ~40 hours when running against Hadoop Secure cluster. spark-submit has --principal and --keytab so kerberos ticket renewal works fine according to logs. Some happens with HDFS dfs connection? These messages come up every 1 second: See complete stack: http://pastebi

binary file deserialization

2016-03-09 Thread Ruslan Dautkhanov
known and well documented. -- Ruslan Dautkhanov

Re: Spark + Sentry + Kerberos don't add up?

2016-02-24 Thread Ruslan Dautkhanov
Turns to be it is a Spark issue https://issues.apache.org/jira/browse/SPARK-13478 -- Ruslan Dautkhanov On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov wrote: > Hi Romain, > > Thank you for your response. > > Adding Kerberos support might be as simple as > https://i

spark.storage.memoryFraction for shuffle-only jobs

2016-02-04 Thread Ruslan Dautkhanov
For a Spark job that only does shuffling (e.g. Spark SQL with joins, group bys, analytical functions, order bys), but no explicit persistent RDDs nor dataframes (there are no .cache()es in the code), what would be the lowest recommended setting for spark.storage.memoryFraction? spark.storage.memor

Re: Hive on Spark knobs

2016-01-29 Thread Ruslan Dautkhanov
Yep, I tried that. It seems you're right. Got an error that execution engine has to be set to mr. hive.execution.engine = mr I did not keep exact error message/stack. It's probably disabled explicitly. -- Ruslan Dautkhanov On Thu, Jan 28, 2016 at 7:03 AM, Todd wrote: > Did yo

Hive on Spark knobs

2016-01-27 Thread Ruslan Dautkhanov
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started There are quite a lot of knobs to tune for Hive on Spark. Above page recommends following settings: mapreduce.input.fileinputformat.split.maxsize=75000 > hive.vectorized.execution.enabled=true > hive.cbo.enable

Re: Spark + Sentry + Kerberos don't add up?

2016-01-20 Thread Ruslan Dautkhanov
I took liberty and created a JIRA https://github.com/cloudera/livy/issues/36 Feel free to close it if doesn't belong to Livy project. I really don't know if this is a Spark or a Livy/Sentry problem. Any ideas for possible workarounds? Thank you. -- Ruslan Dautkhanov On Mon, Jan 1

Re: Spark + Sentry + Kerberos don't add up?

2016-01-18 Thread Ruslan Dautkhanov
roupInformation.doAs() hence the error. So Sentry isn't compatible with Spark in kerberized clusters? Is any workaround for this problem? -- Ruslan Dautkhanov On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux wrote: > Livy does not support any Kerberos yet > https://issues.cloudera.org

Spark + Sentry + Kerberos don't add up?

2016-01-17 Thread Ruslan Dautkhanov
) is allowed to impersonate to other users. So very convenient for Spark Notebooks. Any information to help solve this will be highly appreciated. -- Ruslan Dautkhanov

livy test problem: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project livy-spark_2.10: There are test failures

2016-01-14 Thread Ruslan Dautkhanov
Livy build test from master fails with below problem. Can't track it down. YARN shows Livy Spark yarn application as running. Although attempt to connect to application master shows connection refused: HTTP ERROR 500 > Problem accessing /proxy/application_1448640910222_0046/. Reason: > Connection

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov
Spark > 1.3.1 does not provide integration with Phoenix for kerberized cluster. > > Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured > cluster or not? > > Thanks, > Akhilesh > > On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov > wrote: >

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov
pired) kerberos ticket for authentication to pass. -- Ruslan Dautkhanov On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Hi, > > I am running spark job on yarn in cluster mode in secured cluster. I am > trying to run Spark on Hbase using Phoen

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov
-table-in-hive/34059289#34059289 -- Ruslan Dautkhanov On Mon, Dec 7, 2015 at 11:27 AM, Test One wrote: > I'm using spark-avro with SparkSQL to process and output avro files. My > data has the following schema: > > root > |-- memberUuid: string (nullable = true) > |-

Re: question about combining small parquet files

2015-11-26 Thread Ruslan Dautkhanov
An interesting compaction approach of small files is discussed recently http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ AFAIK Spark supports views too. -- Ruslan Dautkhanov On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi < nyig

Re: Spark REST Job server feedback?

2015-11-25 Thread Ruslan Dautkhanov
java#welcome-to-livy-the-rest-spark-server> " Although that post is from April 2015, not sure if it's still accurate. -- Ruslan Dautkhanov On Thu, Nov 26, 2015 at 12:04 AM, Deenar Toraskar wrote: > Hi > > I had the same question. Anyone having used Livy and/opr SparkJ

Re: Data in one partition after reduceByKey

2015-11-25 Thread Ruslan Dautkhanov
more even distriubution you could use a hash function from that not just a remainder. -- Ruslan Dautkhanov On Mon, Nov 23, 2015 at 6:35 AM, Patrick McGloin wrote: > I will answer my own question, since I figured it out. Here is my answer > in case anyone else has the same issue.

Re: ISDATE Function

2015-11-18 Thread Ruslan Dautkhanov
You could write your own UDF isdate(). -- Ruslan Dautkhanov On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani wrote: > Hi Ted Yu, > > Thanks for your response. Is any other way to achieve in Spark Query? > > > Regards, > Ravi > > On Tue, Nov 17, 2015 at 10:26 AM,

Re: kerberos question

2015-11-06 Thread Ruslan Dautkhanov
thought it's primary use is for Hue and similar services which uses impersonation quite heavily in kerberized cluster. -- Ruslan Dautkhanov On Wed, Nov 4, 2015 at 1:40 PM, Ted Yu wrote: > 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] > hdfs.KeyPr

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ruslan Dautkhanov
https://issues.apache.org/jira/browse/SPARK-8992 Should be in 1.6? -- Ruslan Dautkhanov On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss wrote: > Hi, > > I have data as follows: > > A, 2015, 4 > A, 2014, 12 > A, 2013, 1 > B, 2015, 24 > B, 2013 4 > > > I need

Re: save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov
Thank you Richard and Matthew. DataFrameWriter first appeared in Spark 1.4. Sorry, I should have mentioned earlier, we're on CDH 5.4 / Spark 1.3. No options for this version? Best regards, Ruslan Dautkhanov On Mon, Oct 5, 2015 at 4:00 PM, Richard Hillegas wrote: > Hi Ruslan, > >

save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov
/apache/spark/sql/SQLContext.html and can't find anything relevant. Thanks! -- Ruslan Dautkhanov

Re: Spark Web UI + NGINX

2015-09-22 Thread Ruslan Dautkhanov
T2: max_fails=3; to "Machine B, where Spark is installed" 3) You may want to adjust "location /static/" that fits your Spark Web UI.. 4) With a few more config lines you can setup SSL offloading too. -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 3:06 AM, Renato Perini wr

Re: Spark data type guesser UDAF

2015-09-21 Thread Ruslan Dautkhanov
). -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov wrote: > Wanted to take something like this > > https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java > and create a Hive UDAF to create an aggregate function that returns a data > typ

Spark data type guesser UDAF

2015-09-17 Thread Ruslan Dautkhanov
Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Am I inventing a wheel? Does Spark have something like this already built-in? Would be very useful f

Re: NGINX + Spark Web UI

2015-09-17 Thread Ruslan Dautkhanov
Similar setup for Hue http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/ Might give you an idea. -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 9:50 AM, mjordan79 wrote: > Hello! > I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI. > I have 2 machines:

Re: Spark ANN

2015-09-15 Thread Ruslan Dautkhanov
Thank you Alexander. Sounds like quite a lot of good and exciting changes slated for Spark's ANN. Looking forward to it. -- Ruslan Dautkhanov On Wed, Sep 9, 2015 at 7:10 PM, Ulanov, Alexander wrote: > Thank you, Feynman, this is helpful. The paper that I linked claims a big > s

Re: Best way to import data from Oracle to Spark?

2015-09-10 Thread Ruslan Dautkhanov
Sathish, Thanks for pointing to that. https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm That must be only part of Oracle's BDA codebase, not open-source Hive, right? -- Ruslan Dautkhanov On Thu, Sep 10, 2015 at 6:59 AM, Sathish Kumaran Vairavelu < vsathishkuma...@g

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Ruslan Dautkhanov
a_bind_variables.html This point is more relevant for OLTP-like queries which Spark is probably not yet good at (e.g. return a few rows quickly/ winthin a few ms). -- Ruslan Dautkhanov On Thu, Sep 10, 2015 at 12:07 PM, Michael Armbrust wrote: > Either that or use the DataFrame API, which

Re: Best way to import data from Oracle to Spark?

2015-09-08 Thread Ruslan Dautkhanov
You can also sqoop oracle data in $ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY http://www.rittmanmead.com/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/ -- Ruslan Dautkhanov On Tue

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
Found a dropout commit from avulanov: https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d It probably hasn't made its way to MLLib (yet?). -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang wrote: > Unfortunately, not yet... Deep

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43 should read B&C :) -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang wrote: > Backprop is used to compute the gradient here > <https://github.com/apache/spark/blob/master/mllib/s

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
Read response from Cheng Lian on Aug/27th - it looks the same problem. Workarounds 1. write that parquet file in Spark; 2. upgrade to Spark 1.5. -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov wrote: > No, it was created in Hive by CTAS, but any help is apprecia

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
That parquet table wasn't created in Spark, is it? There was a recent discussion on this list that complex data types in Spark prior to 1.5 often incompatible with Hive for example, if I remember correctly. On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote: > I am trying to read an (array typed) p

Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html Implementation seems missing backpropagation? Was there is a good reason to omit BP? What are the drawbacks of a pure feedforward-only ANN? Thanks! -- Ruslan Dautkhanov

Re: Ranger-like Security on Spark

2015-09-03 Thread Ruslan Dautkhanov
est/topics/sg_hdfs_sentry_sync.html -- Ruslan Dautkhanov On Thu, Sep 3, 2015 at 1:46 PM, Daniel Schulz wrote: > Hi Matei, > > Thanks for your answer. > > My question is regarding simple authenticated Spark-on-YARN only, without > Kerberos. So when I run Spark on YARN and H

Re: FAILED_TO_UNCOMPRESS error from Snappy

2015-08-20 Thread Ruslan Dautkhanov
https://issues.apache.org/jira/browse/SPARK-7660 ? -- Ruslan Dautkhanov On Thu, Aug 20, 2015 at 1:49 PM, Kohki Nishio wrote: > Right after upgraded to 1.4.1, we started seeing this exception and yes we > picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6). Is there > a

Re: Spark Master HA on YARN

2015-08-16 Thread Ruslan Dautkhanov
There is no Spark master in YARN mode. It's standalone mode terminology. In YARN cluster mode, Spark's Application Master (Spark Driver runs in it) will be restarted automatically by RM up to yarn.resourcemanager.am.max-retries times (default is 2). -- Ruslan Dautkhanov On Fri, Jul 17,

Re: Spark job workflow engine recommendations

2015-08-11 Thread Ruslan Dautkhanov
Spark? -- Ruslan Dautkhanov On Tue, Aug 11, 2015 at 11:30 AM, Hien Luu wrote: > We are in the middle of figuring that out. At the high level, we want to > combine the best parts of existing workflow solutions. > > On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone wrote: > >>

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Ruslan Dautkhanov
2.6. -- Ruslan Dautkhanov On Mon, Aug 10, 2015 at 3:53 PM, YaoPau wrote: > I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via > iPython Notebook. I'm getting collect() to work just fine, but take() > errors. (I'm having issues with collect() on

Re: TCP/IP speedup

2015-08-01 Thread Ruslan Dautkhanov
is not network bandwidth-bound, I can see it'll be a few percent to no improvement. -- Ruslan Dautkhanov On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus wrote: > H > > 2% huh. > > > -- ttfn > Simon Edelhaus > California 2015 > > On Sat, Aug 1,

Re: Spark Number of Partitions Recommendations

2015-08-01 Thread Ruslan Dautkhanov
You should also take into account amount of memory that you plan to use. It's advised not to give too much memory for each executor .. otherwise GC overhead will go up. Btw, why prime numbers? -- Ruslan Dautkhanov On Wed, Jul 29, 2015 at 3:31 AM, ponkin wrote: > Hi Rahul, > >

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Ruslan Dautkhanov
ks are with their Tez project, but Tez has components that resemble closer to "buffer cache" or "in-memory columnar storage" caching from traditional RDBMS systems, and may get better and/or more predictable performance on BI queries. -- Ruslan Dautkhanov On Mon, Jul 20, 2

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-23 Thread Ruslan Dautkhanov
Or Spark on HBase ) http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ -- Ruslan Dautkhanov On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu wrote: > bq. that is, key-value stores > > Please consider HBase for this purpose :-) > > On Tue, Jul 14, 2015 at 5:55 PM

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Ruslan Dautkhanov
>> the executor receives a SIGTERM (from whom???) >From YARN Resource Manager. Check if yarn fair scheduler preemption and/or speculative execution are turned on, then it's quite possible and not a bug. -- Ruslan Dautkhanov On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim wrot

Re: Spark equivalent for Oracle's analytical functions

2015-07-12 Thread Ruslan Dautkhanov
Should be part of Spark 1.4 https://issues.apache.org/jira/browse/SPARK-1442 I don't see it in the documentation though https://spark.apache.org/docs/latest/sql-programming-guide.html -- Ruslan Dautkhanov On Mon, Jul 6, 2015 at 5:06 AM, gireeshp wrote: > Is there any equivalent of

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread Ruslan Dautkhanov
You can see what Spark SQL functions are supported in Spark by doing the following in a notebook: %sql show functions https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html I think Spark SQL support is currently around Hive ~0.11? -- Ruslan

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread Ruslan Dautkhanov
Good question. I'd like to know the same. Although I think you'll loose supportability. -- Ruslan Dautkhanov On Wed, Jul 8, 2015 at 2:03 AM, Ashish Dutt wrote: > > Hi, > I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4. > I checked the docu

Re: Real-time data visualization with Zeppelin

2015-07-12 Thread Ruslan Dautkhanov
Don't think it is a Zeppelin problem.. RDDs are "immutable". Unless you integrate something like IndexedRDD http://spark-packages.org/package/amplab/spark-indexedrdd into Zeppelin I think it's not possible. -- Ruslan Dautkhanov On Wed, Jul 8, 2015 at 3:24 PM, Brandon Whit

Re: Caching in spark

2015-07-12 Thread Ruslan Dautkhanov
Hi Akhil, It's interesting if RDDs are stored internally in a columnar format as well? Or it is only when an RDD is cached in SQL context, it is converted to columnar format. What about data frames? Thanks! -- Ruslan Dautkhanov On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das wrote: >

Re: configuring max sum of cores and memory in cluster through command line

2015-07-05 Thread Ruslan Dautkhanov
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_resource_pools.html -- Ruslan Dautkhanov On Thu, Jul 2, 2015 at 4:20 PM, Alexander Waldin wrote: > Hi, > > I'd like to specify the total sum of cores / mem

Re: .NET on Apache Spark?

2015-07-05 Thread Ruslan Dautkhanov
Scala used to run on .NET http://www.scala-lang.org/old/node/10299 -- Ruslan Dautkhanov On Thu, Jul 2, 2015 at 1:26 PM, pedro wrote: > You might try using .pipe() and installing your .NET program as a binary > across the cluster (or using addFile). Its not ideal to pipe things in/out &

Re: Problem after enabling Hadoop native libraries

2015-06-30 Thread Ruslan Dautkhanov
You can run hadoop checknative -a and see if bzip2 is detected correctly. -- Ruslan Dautkhanov On Fri, Jun 26, 2015 at 10:18 AM, Marcelo Vanzin wrote: > What master are you using? If this is not a "local" master, you'll need to > set LD_LIBRARY_PATH on the

Re: flume sinks supported by spark streaming

2015-06-23 Thread Ruslan Dautkhanov
https://spark.apache.org/docs/latest/streaming-flume-integration.html Yep, avro sink is the correct one. -- Ruslan Dautkhanov On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid wrote: > Hi! > > > I want to integrate flume with spark streaming. I want to know which sink > ty

Re: [ERROR] Insufficient Space

2015-06-19 Thread Ruslan Dautkhanov
Vadim, You could edit /etc/fstab, then issue mount -o remount to give more shared memory online. Didn't know Spark uses shared memory. Hope this helps. On Fri, Jun 19, 2015, 8:15 AM Vadim Bichutskiy wrote: > Hello Spark Experts, > > I've been running a standalone Spark cluster on EC2 for a fe

Re: Does MLLib has attribute importance?

2015-06-18 Thread Ruslan Dautkhanov
Got it. Thanks! -- Ruslan Dautkhanov On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng wrote: > ChiSqSelector calls an RDD of labeled points, where the label is the > target. See > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mlli

Re: Does MLLib has attribute importance?

2015-06-17 Thread Ruslan Dautkhanov
Thank you Xiangrui. Oracle's attribute importance mining function have a target variable. "Attribute importance is a supervised function that ranks attributes according to their significance in predicting a target." MLlib's ChiSqSelector does not have a target variable. -

Does MLLib has attribute importance?

2015-06-11 Thread Ruslan Dautkhanov
ce in predicting a target. Best regards, Ruslan Dautkhanov

k-means for text mining in a streaming context

2015-06-08 Thread Ruslan Dautkhanov
? Best reagrds, Ruslan Dautkhanov

Re: Spark Job always cause a node to reboot

2015-06-04 Thread Ruslan Dautkhanov
+ OS + etc >> amount of memory node has. -- Ruslan Dautkhanov On Thu, Jun 4, 2015 at 8:59 AM, Chao Chen wrote: > Hi all, > I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and > Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory. > One is configured

Re: How to monitor Spark Streaming from Kafka?

2015-06-02 Thread Ruslan Dautkhanov
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4 http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf -- Ruslan Dautkhanov On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg wrote: > Thank you, Tathagata, Cody, O

Re: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Ruslan Dautkhanov
ction *spark.shuffle.safetyFraction)/ spark.executor.cores. Memory fraction and safety fraction default to 0.2 and 0.8 respectively. I'd test spark.executor.cores with 2,4,8 and 16 and see what makes your job run faster.. -- Ruslan Dautkhanov On Wed, May 27, 2015 at 6:46 PM, Mulugeta Mammo wrote: >

Re: PySpark Logs location

2015-05-21 Thread Ruslan Dautkhanov
yarn application logs. -- Ruslan Dautkhanov On Thu, May 21, 2015 at 5:08 AM, Oleg Ruchovets wrote: > Doesn't work for me so far , >using command but got such output. What should I check to fix the > issue? Any configuration parameters ... > > > [root@sdo-

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov
Oleg, You can see applicationId in your Spark History Server. Go to http://historyserver:18088/ Also check https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application It should be no different with PySpark. -- Ruslan Dautkhanov On Wed, May 20, 2015 at 2:12 PM, Oleg

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov
You could use yarn logs -applicationId application_1383601692319_0008 -- Ruslan Dautkhanov On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets wrote: > Hi , > > I am executing PySpark job on yarn ( hortonworks distribution). > > Could someone pointing me where is th

Re: Reading Nested Fields in DataFrames

2015-05-11 Thread Ruslan Dautkhanov
Had the same question on stackoverflow recently http://stackoverflow.com/questions/30008127/how-to-read-a-nested-collection-in-spark Lomig Mégard had a detailed answer of how to do this without using LATERAL VIEW. On Mon, May 11, 2015 at 8:05 AM, Ashish Kumar Singh wrote: > Hi , > I am trying