Re: Does Python 2.7 have to be installed on every cluster node?

2015-05-19 Thread Davies Liu
PySpark work with CPython by default, and you can specify which version of Python to use by: PYSPARK_PYTHON=path/to/path bin/spark-submit xxx.py When you do the upgrade, you could install python 2.7 on every machine in the cluster, test it with PYSPARK_PYTHON=python2.7 bin/spark-submit xxx.py

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-19 Thread Davies Liu
It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes wrote: > Dear Experts, > > we have a spark cluster (standalone mode) in which master and workers are > started from root account. Everything runs

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-20 Thread Davies Liu
t; > > > On Tuesday, 19 May 2015 2:43 AM, Davies Liu wrote: > > > SparkContext can be used in multiple threads (Spark streaming works > with multiple threads), for example: > > import threading > import time > > def show(x): > time.sleep(1) > print x

Re: Is this a good use case for Spark?

2015-05-20 Thread Davies Liu
Spark is a great framework to do things in parallel with multiple machines, will be really helpful for your case. Once you can wrap your entire pipeline into a single Python function: def process_document(path, text): # you can call other tools or services here return xxx then you can

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Davies Liu
ruboes). The temp directories created inside >> >> namesAndAges.parquet2/_temporary/0/ >> >> (e.g. task_201505200920_0009_r_01) are owned by root, again with >> drwxr-xr-x access rights >> >> Cheers, >>Tomasz >> >>

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread Davies Liu
The docs had been updated. You should convert the DataFrame to RDD by `df.rdd` On Mon, Apr 20, 2015 at 5:23 AM, ayan guha wrote: > Hi > Just upgraded to Spark 1.3.1. > > I am getting an warning > > Warning (from warnings module): > File > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-had

Re: [pyspark] Starting workers in a virtualenv

2015-05-21 Thread Davies Liu
Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson wrote: > Hi all, > > I am running the Python process that communicates with Spark in a > virtua

Re: Bigints in pyspark

2015-05-22 Thread Davies Liu
Could you show up the schema and confirm that they are LongType? df.printSchema() On Mon, Apr 27, 2015 at 5:44 AM, jamborta wrote: > hi all, > > I have just come across a problem where I have a table that has a few bigint > columns, it seems if I read that table into a dataframe then collect it

Re: PySpark Unknown Opcode Error

2015-05-26 Thread Davies Liu
This should be the case that you run different versions for Python in driver and slaves, Spark 1.4 will double check that will release soon). SPARK_PYTHON should be PYSPARK_PYTHON On Tue, May 26, 2015 at 11:21 AM, Nikhil Muralidhar wrote: > Hello, > I am trying to run a spark job (which runs

Re: PySpark with OpenCV causes python worker to crash

2015-05-28 Thread Davies Liu
Could you try to comment out some lines in `extract_sift_features_opencv` to find which line cause the crash? If the bytes came from sequenceFile() is broken, it's easy to crash a C library in Python (OpenCV). On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga wrote: > Hi sparkers, > > I am working

Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
There is another implementation of RDD interface in Python, called DPark [1], Could you have a few words to compare these two? [1] https://github.com/douban/dpark/ On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss wrote: > I wanted to share a Python implementation of RDDs: pysparkling. > > http://tri

Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
http://"; whereas DPark needs a > Mesos cluster. > > On Fri, May 29, 2015 at 2:46 PM Davies Liu wrote: >> >> There is another implementation of RDD interface in Python, called >> DPark [1], Could you have a few words to compare these two? >> >> [1] https

Re: PySpark with OpenCV causes python worker to crash

2015-06-01 Thread Davies Liu
he same parameters. You're saying >> maybe the bytes from the sequencefile got somehow transformed and don't >> represent an image anymore causing OpenCV to crash the whole python >> executor. >> >> On Fri, May 29, 2015 at 2:06 AM, Davies Liu wrote: >>>

Re: Best strategy for Pandas -> Spark

2015-06-01 Thread Davies Liu
The second one sounds reasonable, I think. On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot wrote: > Hi everyone, > Let's assume I have a complex workflow of more than 10 datasources as input > - 20 computations (some creating intermediary datasets and some merging > everything for the final com

Re: deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-06-01 Thread Davies Liu
No, all of the RDDs (including those returned from randomSplit()) are read-only. On Mon, Apr 27, 2015 at 11:28 AM, Pagliari, Roberto wrote: > Suppose I have something like the code below > > > for idx in xrange(0, 10): > train_test_split = training.randomSplit(weights=[0.75, 0

Re: PySpark with OpenCV causes python worker to crash

2015-06-04 Thread Davies Liu
achine but when >> moving the exact same OpenCV code to spark it just crashes. >> >> On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu wrote: >>> >>> Could you run the single thread version in worker machine to make sure >>> that OpenCV is installed and configur

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
x27;t cleaned up and tested it for other people) btw >> I study at Tsinghua also currently. >> >> On Fri, Jun 5, 2015 at 2:43 PM, Davies Liu wrote: >>> >>> Please file a bug here: https://issues.apache.org/jira/browse/SPARK/ >>> >>> Could you a

Re: SparkSQL nested dictionaries

2015-06-08 Thread Davies Liu
I think it works in Python ``` >>> df = sqlContext.createDataFrame([(1, {'a': 1})]) >>> df.printSchema() root |-- _1: long (nullable = true) |-- _2: map (nullable = true) ||-- key: string ||-- value: long (valueContainsNull = true) >>> df.select(df._2.getField('a')).show() +-+ |_2

Re: Fully in-memory shuffles

2015-06-10 Thread Davies Liu
If you have enough memory, you can put the temporary work directory in tempfs (in memory file system). On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet wrote: > Ok so it is the case that small shuffles can be done without hitting any > disk. Is this the same case for the aux shuffle service in yarn?

Re: BigDecimal problem in parquet file

2015-06-12 Thread Davies Liu
Maybe it's related to a bug, which is fixed by https://github.com/apache/spark/pull/6558 recently. On Fri, Jun 12, 2015 at 5:38 AM, Bipin Nag wrote: > Hi Cheng, > > Yes, some rows contain unit instead of decimal values. I believe some rows > from original source I had don't have any value i.e. it

Re: number of partitions in join: Spark documentation misleading!

2015-06-16 Thread Davies Liu
Please file a JIRA for it. On Mon, Jun 15, 2015 at 8:00 AM, mrm wrote: > Hi all, > > I was looking for an explanation on the number of partitions for a joined > rdd. > > The documentation of Spark 1.3.1. says that: > "For distributed shuffle operations like reduceByKey and join, the largest > num

Re: Got the exception when joining RDD with spark streamRDD

2015-06-18 Thread Davies Liu
This seems be a bug, could you file a JIRA for it? RDD should be serializable for Streaming job. On Thu, Jun 18, 2015 at 4:25 AM, Groupme wrote: > Hi, > > > I am writing pyspark stream program. I have the training data set to compute > the regression model. I want to use the stream data set to t

Re: Cassandra - Spark 1.3 - reading data from cassandra table with PYSpark

2015-06-19 Thread Davies Liu
On Fri, Jun 19, 2015 at 7:33 AM, Koen Vantomme wrote: > Hello, > > I'm trying to read data from a table stored in cassandra with pyspark. > I found the scala code to loop through the table : > "cassandra_rdd.toArray.foreach(println)" > > How can this be translated into PySpark ? > > code snipplet

Re: ERROR in withColumn method

2015-06-19 Thread Davies Liu
This is an known issue: https://issues.apache.org/jira/browse/SPARK-8461?filter=-1 Will be fixed soon by https://github.com/apache/spark/pull/6898 On Fri, Jun 19, 2015 at 5:50 AM, Animesh Baranawal wrote: > I am trying to perform some insert column operations in dataframe. Following > is the cod

Re: SparkR - issue when starting the sparkR shell

2015-06-19 Thread Davies Liu
Yes, right now, we only tested SparkR with R 3.x On Fri, Jun 19, 2015 at 5:53 AM, Kulkarni, Vikram wrote: > Hello, > > I am seeing this issue when starting the sparkR shell. Please note that I > have R version 2.14.1. > > > > [root@vertica4 bin]# sparkR > > > > R version 2.14.1 (2011-12-22) > >

Re: Java Constructor Issues

2015-06-21 Thread Davies Liu
The compiled jar is not consistent with Python source, maybe you are using a older version pyspark, but with assembly jar of Spark Core 1.4? On Sun, Jun 21, 2015 at 7:24 AM, Shaanan Cohney wrote: > > Hi all, > > > I'm having an issue running some code that works on a build of spark I made > (and

Re: SQL vs. DataFrame API

2015-06-22 Thread Davies Liu
Right now, we can not figure out which column you referenced in `select`, if there are multiple row with the same name in the joined DataFrame (for example, two `value`). A workaround could be: numbers2 = numbers.select(df.name, df.value.alias('other')) rows = numbers.join(numbers2,

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
ist.github.com/dokipen/018a1deeab668efdf455 >> >> On Mon, Jun 22, 2015 at 4:33 PM Davies Liu wrote: >>> >>> Right now, we can not figure out which column you referenced in >>> `select`, if there are multiple row with the same name in the joined >>

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
968d2e4be68958df8 > > 2015-06-23 23:11 GMT+02:00 Davies Liu : >> >> I think it also happens in DataFrames API of all languages. >> >> On Tue, Jun 23, 2015 at 9:16 AM, Ignacio Blasco >> wrote: >> > That issue happens only in python dsl? >> > &g

Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-07-02 Thread Davies Liu
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl wrote: > In pyspark, when I convert from rdds to dataframes it looks like the rdd is > being materialized/collected/repartitioned before it's converted to a > dataframe. It's not true. When converting a RDD to dataframe, it only take a few of rows to inf

Re: User Defined Functions - Execution on Clusters

2015-07-06 Thread Davies Liu
Currently, Python UDFs run in a Python instances, are MUCH slower than Scala ones (from 10 to 100x). There is JIRA to improve the performance: https://issues.apache.org/jira/browse/SPARK-8632, After that, they will be still much slower than Scala ones (because Python is lower and the overhead for c

Re: PySpark without PySpark

2015-07-08 Thread Davies Liu
Great post, thanks for sharing with us! On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal wrote: > Hi Julian, > > I recently built a Python+Spark application to do search relevance > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on > EC2 (so I don't use the PySpark shell, hopefu

Re: Language support for Spark libraries

2015-07-13 Thread Davies Liu
On Mon, Jul 13, 2015 at 11:06 AM, Lincoln Atkinson wrote: > I’m still getting acquainted with the Spark ecosystem, and wanted to make > sure my understanding of the different API layers is correct. > > > > Is this an accurate picture of the major API layers, and their associated > client support?

Re: Running foreach on a list of rdds in parallel

2015-07-15 Thread Davies Liu
sc.union(rdds).saveAsTextFile() On Wed, Jul 15, 2015 at 10:37 PM, Brandon White wrote: > Hello, > > I have a list of rdds > > List(rdd1, rdd2, rdd3,rdd4) > > I would like to save these rdds in parallel. Right now, it is running each > operation sequentially. I tried using a rdd of rdd but that do

Re: pyspark 1.4 udf change date values

2015-07-16 Thread Davies Liu
Thanks for reporting this, could you file a JIRA for it? On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra wrote: > Hi all, > > I am having some troubles when using a custom udf in dataframes with pyspark > 1.4. > > I have rewritten the udf to simplify the problem and it gets even weirder. > The udfs

Re: Spark and SQL Server

2015-07-18 Thread Davies Liu
I think you have a mistake on call jdbc(), it should be: jdbc(self, url, table, mode, properties) You had use properties as the third parameter. On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T wrote: > Hello, > > I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s >

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Davies Liu
Spark 2.0 is dropping the support for Python 2.6, it only work with Python 2.7, and 3.4+ On Thu, Mar 10, 2016 at 11:17 PM, Gayathri Murali wrote: > Hi all, > > I am trying to run python unit tests. > > I currently have Python 2.6 and 2.7 installed. I installed unittest2 against > both of them. >

Re: Parition RDD by key to create DataFrames

2016-03-15 Thread Davies Liu
I think you could create a DataFrame with schema (mykey, value1, value2), then partition it by mykey when saving as parquet. r2 = rdd.map((k, v) => Row(k, v._1, v._2)) df = sqlContext.createDataFrame(r2, schema) df.write.partitionBy("myKey").parquet(path) On Tue, Mar 15, 2016 at 10:33 AM, Moham

Re: filter by dict() key in pySpark

2016-03-15 Thread Davies Liu
Another solution could be using left-semi join: keys = sqlContext.createDataFrame(dict.keys()) DF2 = DF1.join(keys, DF1.a = keys.k, "leftsemi") On Wed, Feb 24, 2016 at 2:14 AM, Franc Carter wrote: > > A colleague found how to do this, the approach was to use a udf() > > cheers > > On 21 February

Re: sql timestamp timezone bug

2016-03-19 Thread Davies Liu
In Spark SQL, timestamp is the number of micro seconds since epoch, so it has nothing with timezone. When you compare it again unix_timestamp or string, it's better to convert these into timestamp then compare them. In your case, the where clause should be: where (created > cast('{0}' as timesta

Re: sql timestamp timezone bug

2016-03-19 Thread Davies Liu
On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson wrote: > I am using pyspark 1.6.0 and > datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series > data > > The data is originally captured by a spark streaming app and written to > Cassandra. The value of the timestamp comes from > >

Re: unix_timestamp() time zone problem

2016-03-19 Thread Davies Liu
Could you try to cast the timestamp as long? Internally, timestamp are stored as microseconds in UTC, you will got seconds in UTC if you cast it to long. On Thu, Mar 17, 2016 at 1:28 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using python spark 1.6 and the --packages > data

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
The broadcast hint does not work as expected in this case, could you also how the logical plan by 'explain(true)'? On Wed, Mar 23, 2016 at 8:39 AM, Yong Zhang wrote: > > So I am testing this code to understand "broadcast" feature of DF on Spark > 1.6.1. > This time I am not disable "tungsten". E

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang wrote: > Here is the output: > > == Parsed Logical Plan == > Project [400+ columns] > +- Project [400+ columns] >+- Project [400+ columns] > +- Project [400+ columns] > +- Join Inner, Somevisid_high#460L = visid_high#948L) && > (v

Re: strange behavior of pyspark RDD zip

2016-04-11 Thread Davies Liu
It seems like a bug, could you file a JIRA for this? (also post a way to reproduce it) On Fri, Apr 1, 2016 at 11:08 AM, Sergey wrote: > Hi! > > I'm on Spark 1.6.1 in local mode on Windows. > > And have issue with zip of zip'pping of two RDDs of __equal__ size and > __equal__ partitions number (I

Re: How to estimate the size of dataframe using pyspark?

2016-04-11 Thread Davies Liu
That's weird, DataFrame.count() should not require lots of memory on driver, could you provide a way to reproduce it (could generate fake dataset)? On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev wrote: > I've allocated about 4g for the driver. For the count stage, I notice the > Shuffle Write to be 13

Re: pyspark EOFError after calling map

2016-04-22 Thread Davies Liu
This exception is already handled well, just noisy, should be muted. On Wed, Apr 13, 2016 at 4:52 PM, Pete Werner wrote: > Hi > > I am new to spark & pyspark. > > I am reading a small csv file (~40k rows) into a dataframe. > > from pyspark.sql import functions as F > df = > sqlContext.read.forma

Re: EOFException while reading from HDFS

2016-04-26 Thread Davies Liu
The Spark package you are using is packaged with Hadoop 2.6, but the HDFS is Hadoop 1.0.4, they are not compatible. On Tue, Apr 26, 2016 at 11:18 AM, Bibudh Lahiri wrote: > Hi, > I am trying to load a CSV file which is on HDFS. I have two machines: > IMPETUS-1466 (172.26.49.156) and IMPETUS-132

Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Davies Liu
hdfs://192.168.10.130:9000/dev/output/test already exists, so you need to remove it first. On Tue, Apr 26, 2016 at 5:28 AM, Luke Adolph wrote: > Hi, all: > Below is my code: > > from pyspark import * > import re > > def getDateByLine(input_str): > str_pattern = '^\d{4}-\d{2}-\d{2}' > patt

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Davies Liu
as @Gourav said, all the join with different join type show the same results, which meant that all the rows from left could match at least one row from right, all the rows from right could match at least one row from left, even the number of row from left does not equal that of right. This is corr

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Davies Liu
dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = >>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() >>> res12: Long = 23809 >>> >>> >>> >>> From my results above,

Re: pyspark dataframe sort issue

2016-05-08 Thread Davies Liu
When you have multiple parquet files, the order of all the rows in them is not defined. On Sat, May 7, 2016 at 11:48 PM, Buntu Dev wrote: > I'm using pyspark dataframe api to sort by specific column and then saving > the dataframe as parquet file. But the resulting parquet file doesn't seem > to

Re: broadcast variable not picked up

2016-05-16 Thread Davies Liu
broadcast_var is only defined in foo(), I think you should have `global` for it. def foo(): global broadcast_var broadcast_var = sc.broadcast(var) On Fri, May 13, 2016 at 3:53 PM, abi wrote: > def kernel(arg): > input = broadcast_var.value + 1 > #some processing with input > > def

Re: 2 tables join happens at Hive but not in spark

2016-05-18 Thread Davies Liu
What the schema of the two tables looks like? Could you also show the explain of the query? On Sat, Feb 27, 2016 at 2:10 AM, Sandeep Khurana wrote: > Hello > > We have 2 tables (tab1, tab2) exposed using hive. The data is in different > hdfs folders. We are trying to join these 2 tables on certa

Re: pyspark.GroupedData.agg works incorrectly when one column is aggregated twice?

2016-06-09 Thread Davies Liu
This one works as expected: ``` >>> spark.range(10).selectExpr("id", "id as k").groupBy("k").agg({"k": "count", >>> "id": "sum"}).show() +---++---+ | k|count(k)|sum(id)| +---++---+ | 0| 1| 0| | 7| 1| 7| | 6| 1| 6| | 9| 1| 9|

Re: converting timestamp from UTC to many time zones

2016-06-17 Thread Davies Liu
The DataFrame API does not support this use case, you can use still use SQL do that, df.selectExpr("from_utc_timestamp(start, tz) as testthis") On Thu, Jun 16, 2016 at 9:16 AM, ericjhilton wrote: > This is using python with Spark 1.6.1 and dataframes. > > I have timestamps in UTC that I want to

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Davies Liu
On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor wrote: > Hi all, > > I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 > using pyspark. > > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 > executors's memory in SparkSQL, on which we would do some ca

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Davies Liu
you have UDFs then > somehow the memory usage depends on the amount of data in that record (the > whole row), which includes other fields too, which are actually not used by > the UDF. Maybe the UDF serialization to Python serializes the whole row > instead of just the attributes of the

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Davies Liu
Can you get all the fields back using Scala or SQL (bin/spark-sql)? On Tue, Aug 9, 2016 at 2:32 PM, cdecleene wrote: > Some details of an example table hive table that spark 2.0 could not read... > > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org

Re: DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Davies Liu
I think you are looking for `def repartition(numPartitions: Int, partitionExprs: Column*)` On Tue, Aug 9, 2016 at 9:36 AM, Stephen Fletcher wrote: > Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD? > I'm reading data from a file data source and I want to partition this d

Re: Content based window operation on Time-series data

2015-12-17 Thread Davies Liu
Could you try this? df.groupBy(cast(col("timeStamp") - start) / bucketLengthSec, IntegerType)).agg(max("timestamp"), max("value")).collect() On Wed, Dec 9, 2015 at 8:54 AM, Arun Verma wrote: > Hi all, > > We have RDD(main) of sorted time-series data. We want to split it into > different RDDs acc

回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-29 Thread Davies Liu
Hi Andy, Could you change logging level to INFO and post some here? There will be some logging about the memory usage of a task when OOM. In 1.6, the memory for a task is : (HeapSize - 300M) * 0.75 / number of tasks. Is it possible that the heap is too small? Davies -- Davies Liu

Re: Does Spark SQL support rollup like HQL

2015-12-29 Thread Davies Liu
Just sent out a PR[1] to support cube/rollup as function, it works with both SQLContext and HiveContext. https://github.com/apache/spark/pull/10522/files On Tue, Dec 29, 2015 at 9:35 PM, Yi Zhang wrote: > Hi Hao, > > Thanks. I'll take a look at it. > > > On Wednesday, December 30, 2015 12:47 PM,

Re: Problem with WINDOW functions?

2015-12-30 Thread Davies Liu
Window functions are improved in 1.6 release, could you try 1.6-RC4 (or wait until next week for the final release)? Even In 1.6, the buffer of rows for window function does not support spilling (also does not use memory efficiently), there is a JIRA for it: https://issues.apache.org/jira/browse/S

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas wrote: > +1 > > Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python > 2.6 is ancient history and the core Python developers stopped supporting it > in 2013. REHL 5 is not a good enough reason to continue support for Pytho

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661 On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers wrote: > i do not think so. > > does the python 2.7 need to be installed on all slaves? if so, we do not > have direct access to those. > > also, spark is easy for us to ship with our sof

Re: Pyspark - how to use UDFs with dataframe groupby

2016-02-10 Thread Davies Liu
short answer: PySpark does not support UDAF (user defined aggregate function) for now. On Tue, Feb 9, 2016 at 11:44 PM, Viktor ARDELEAN wrote: > Hello, > > I am using following transformations on RDD: > > rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\ >.aggregateByKey

Re: Spark Job Hanging on Join

2016-02-22 Thread Davies Liu
This link may help: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html Spark 1.6 had improved the CatesianProduct, you should turn of auto broadcast and go with CatesianProduct in 1.6 On Mon, Feb 22, 2016 at 1:45 AM, Mohannad Ali wrote: > Hello e

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-05 Thread Davies Liu
Could you create a JIRA to track this bug? On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan wrote: > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > I'm trying to read in a large quantity of json data in a couple of files and > I receive a scala.MatchError when I do so. Json, Pyth

Re: StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Davies Liu
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet? On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov wrote: > Hi, > > We're building our own framework on top of spark and we give users pretty > complex schema to work with. That requires from us to build dataframes by >

Re: weird issue with sqlContext.createDataFrame - pyspark 1.3.1

2015-10-09 Thread Davies Liu
Is it possible that you have an very old version of pandas, that does not have DataFrame (or in different submodule). Could you try this: ``` >>> import pandas >>> pandas.__version__ '0.14.0' ``` On Thu, Oct 8, 2015 at 10:28 PM, ping yan wrote: > I really cannot figure out what this is about.. >

Re: Handling expirying state in UDF

2015-10-12 Thread Davies Liu
Could you try this? my_token = None def my_udf(a): global my_token if my_token is None: # create token # do something In this way, a new token will be created for each pyspark task On Sun, Oct 11, 2015 at 5:14 PM, brightsparc wrote: > Hi, > > I have created a python UDF to

Re: pyspark: results differ based on whether persist() has been called

2015-10-19 Thread Davies Liu
This should be fixed by https://github.com/apache/spark/commit/a367840834b97cd6a9ecda568bb21ee6dc35fcde Will be released as 1.5.2 soon. On Mon, Oct 19, 2015 at 9:04 AM, peay2 wrote: > Hi, > > I am getting some very strange results, where I get different results based > on whether or not I call p

Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread Davies Liu
What's the issue with groupByKey()? On Mon, Oct 19, 2015 at 1:11 AM, fahad shah wrote: > Hi > > I wanted to ask whats the best way to achieve per key auto increment > numerals after sorting, for eg. : > > raw file: > > 1,a,b,c,1,1 > 1,a,b,d,0,0 > 1,a,b,e,1,0 > 2,a,e,c,0,0 > 2,a,f,d,1,0 > > post-o

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread Davies Liu
Could you simplify the code a little bit so we can reproduce the failure? (may also have some sample dataset if it depends on them) On Sun, Oct 18, 2015 at 10:42 PM, fahad shah wrote: > Hi > > I am trying to do pair rdd's, group by the key assign id based on key. > I am using Pyspark with spark

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread Davies Liu
The thread-local things does not work well with PySpark, because the thread used by PySpark in JVM could change over time, SessionState could be lost. This should be fixed in master by https://github.com/apache/spark/pull/8909 On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote: > I've connected Spar

Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Have you use any partitioned columns when write as json or parquet? On Fri, Nov 6, 2015 at 6:53 AM, Rok Roskar wrote: > yes I was expecting that too because of all the metadata generation and > compression. But I have not seen performance this bad for other parquet > files I’ve written and was wo

Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Do you have partitioned columns? On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote: > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a > parquet file on HDFS. I've got a few hundred nodes in the cluster, so for > the size of file this is way over-provisioned (I've tried

Re: Distributing Python code packaged as tar balls

2015-11-13 Thread Davies Liu
Python does not support library as tar balls, so PySpark may also not support that. On Wed, Nov 4, 2015 at 5:40 AM, Praveen Chundi wrote: > Hi, > > Pyspark/spark-submit offers a --py-files handle to distribute python code > for execution. Currently(version 1.5) only zip files seem to be supported

Re: bin/pyspark SparkContext is missing?

2015-11-13 Thread Davies Liu
You forgot to create a SparkContext instance: sc = SparkContext() On Tue, Nov 3, 2015 at 9:59 AM, Andy Davidson wrote: > I am having a heck of a time getting Ipython notebooks to work on my 1.5.1 > AWS cluster I created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 > > I have read the instructio

Re: how to us DataFrame.na.fill based on condition

2015-11-23 Thread Davies Liu
DataFrame.replace(to_replace, value, subset=None) http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace On Mon, Nov 23, 2015 at 11:05 AM, Vishnu Viswanath wrote: > Hi > > Can someone tell me if there is a way I can use the fill method in > DataFrameNaFunct

Re: how to us DataFrame.na.fill based on condition

2015-11-23 Thread Davies Liu
ull value of a column.( I don't have a to_replace here ) > > Regards, > Vishnu > > On Mon, Nov 23, 2015 at 1:37 PM, Davies Liu wrote: >> >> DataFrame.replace(to_replace, value, subset=None) >> >> >> http://spark.apache.org/docs/latest/api/python/py

Re: Spark SQL Save CSV with JSON Column

2015-11-24 Thread Davies Liu
I think you could have a Python UDF to turn the properties into JSON string: import simplejson def to_json(row): return simplejson.dumps(row.asDict(recursive=Trye)) to_json_udf = pyspark.sql.funcitons.udf(to_json) df.select("col_1", "col_2", to_json_udf(df.properties)).write.format("com.dat

Re: UDF with 2 arguments

2015-11-25 Thread Davies Liu
It works in master (1.6), what's the version of Spark you have? >>> from pyspark.sql.functions import udf >>> def f(a, b): pass ... >>> my_udf = udf(f) >>> from pyspark.sql.types import * >>> my_udf = udf(f, IntegerType()) On Wed, Nov 25, 2015 at 12:01 PM, Daniel Lopes wrote: > Hallo, > > supos

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Davies Liu
Could you reproduce this problem in 1.5 or 1.6? On Sun, Dec 6, 2015 at 12:29 AM, YaoPau wrote: > If anyone runs into the same issue, I found a workaround: > df.where('state_code = "NY"') > > works for me. > df.where(df.state_code == "NY").collect() > > fails with the error from the firs

Re: PySpark Nested Json Parsing

2015-07-20 Thread Davies Liu
Before using the json file as text file, can you make sure that each json string can fit in one line? Because textFile() will split the file by '\n' On Mon, Jul 20, 2015 at 3:26 AM, Ajay wrote: > Hi, > > I am new to Apache Spark. I am trying to parse nested json using pyspark. > Here is the code

Re: PySpark Nested Json Parsing

2015-07-20 Thread Davies Liu
Could you try SQLContext.read.json()? On Mon, Jul 20, 2015 at 9:06 AM, Davies Liu wrote: > Before using the json file as text file, can you make sure that each > json string can fit in one line? Because textFile() will split the > file by '\n' > > On Mon, Jul 20, 201

Re: Spark and SQL Server

2015-07-20 Thread Davies Liu
osoft to release an SQL Server connector for > Spark to resolve the other issues. > > Cheers, > > -- Matthew Young > > > From: Davies Liu [dav...@databricks.com] > Sent: Saturday, July 18, 2015 12:45 AM > To: Young, Matthew T >

Re: large scheduler delay in pyspark

2015-08-04 Thread Davies Liu
On Mon, Aug 3, 2015 at 9:00 AM, gen tang wrote: > Hi, > > Recently, I met some problems about scheduler delay in pyspark. I worked > several days on this problem, but not success. Therefore, I come to here to > ask for help. > > I have a key_value pair rdd like rdd[(key, list[dict])] and I tried t

Re: SparkR Supported Types - Please add "bigint"

2015-08-07 Thread Davies Liu
They are actually the same thing, LongType. `long` is friendly for developer, `bigint` is friendly for database guy, maybe data scientists. On Thu, Jul 23, 2015 at 11:33 PM, Sun, Rui wrote: > printSchema calls StructField. buildFormattedString() to output schema > information. buildFormattedStri

Re: Problem with take vs. takeSample in PySpark

2015-08-10 Thread Davies Liu
I tested this in master (1.5 release), it worked as expected (changed spark.driver.maxResultSize to 10m), >>> len(sc.range(10).map(lambda i: '*' * (1<<23) ).take(1)) 1 >>> len(sc.range(10).map(lambda i: '*' * (1<<24) ).take(1)) 15/08/10 10:45:55 ERROR TaskSetManager: Total size of serialized resul

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Davies Liu
Is it possible that you have Python 2.7 on the driver, but Python 2.6 on the workers?. PySpark requires that you have the same minor version of Python in both driver and worker. In PySpark 1.4+, it will do this check before run any tasks. On Mon, Aug 10, 2015 at 2:53 PM, YaoPau wrote: > I'm runn

Re: PySpark order-only window function issue

2015-08-12 Thread Davies Liu
This should be a bug, go ahead to open a JIRA for it, thanks! On Tue, Aug 11, 2015 at 6:41 AM, Maciej Szymkiewicz wrote: > Hello everyone, > > I am trying to use PySpark API with window functions without specifying > partition clause. I mean something equivalent to this > > SELECT v, row_number()

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Davies Liu
As Aram said, there two options in Spark 1.4, 1) Use the HiveContext, then you got datediff from Hive, df.selectExpr("datediff(d2, d1)") 2) Use Python UDF: ``` >>> from datetime import date >>> df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9, 26))], >>> ['d1', 'd2']) >>> from py

Re: Join with multiple conditions (In reference to SPARK-7197)

2015-08-25 Thread Davies Liu
It's good to support this, could you create a JIRA for it and target for 1.6? On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise wrote: > > Hello All, > > PySpark currently has two ways of performing a join: specifying a join > condition or column names. > > I would like to perform a join using

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-08-31 Thread Davies Liu
I had sent out a PR [1] to fix 2), could you help to test that? [1] https://github.com/apache/spark/pull/8543 On Mon, Aug 31, 2015 at 12:34 PM, Anders Arpteg wrote: > Was trying out 1.5 rc2 and noticed some issues with the Tungsten shuffle > manager. One problem was when using the com.databrick

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-01 Thread Davies Liu
fix submitted less than one hour after my mail, very impressive Davies! > I've compiled your PR and tested it with the large job that failed before, > and it seems to work fine now without any exceptions. Awesome, thanks! > > Best, > Anders > > On Tue, Sep 1, 2015 at 1:38 AM D

Re: Custom Partitioner

2015-09-01 Thread Davies Liu
You can take the sortByKey as example: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642 On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker wrote: > something like... > > class RangePartitioner(Partitioner): > def __init__(self, numParts): > self.numPartitions = numParts > self.parti

Re: large number of import-related function calls in PySpark profile

2015-09-02 Thread Davies Liu
Could you have a short script to reproduce this? On Wed, Sep 2, 2015 at 2:10 PM, Priedhorsky, Reid wrote: > Hello, > > I have a PySpark computation that relies on Pandas and NumPy. Currently, my > inner loop iterates 2,000 times. I’m seeing the following show up in my > profiling: > > 74804/29102

Re: spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Davies Liu
This should be a bug, could you create a JIRA for it? On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl wrote: > in my spark-defaults.conf I have: > spark.files file1.zip, file2.py > spark.master spark://master.domain.com:7077 > > If I execute: > bin/pyspark > > I can see it addin

  1   2   3   4   >