Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Andy Davidson
eclipse, pyDev, virtualenv? syntaxError: yield from walk( > FYI, there is a PR and JIRA for virtualEnv support in PySpark > > https://issues.apache.org/jira/browse/SPARK-13587 > https://github.com/apache/spark/pull/13599 > > > 2018-04-06 7:48 GMT+08:00 Andy Davidson : >&g

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Andy Davidson
FYI http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext From: Andrew Davidson Date: Wednesday, April 4, 2018 at 5:36 PM To: "user @spark" Subject: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk( > I am having a heck of a time setting

Re: Union of multiple data frames

2018-04-05 Thread Andy Davidson
Hi Ceasar I have used Brandson approach in the past with out any problem Andy From: Brandon Geise Date: Thursday, April 5, 2018 at 11:23 AM To: Cesar , "user @spark" Subject: Re: Union of multiple data frames > Maybe something like > > var finalDF = spark.sqlContext.emptyDataFrame > for

how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-04 Thread Andy Davidson
I am having a heck of a time setting up my development environment. I used pip to install pyspark. I also downloaded spark from apache. My eclipse pyDev intereperter is configured as a python3 virtualenv I have a simple unit test that loads a small dataframe. Df.show() generates the following err

trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

2018-04-04 Thread Andy Davidson
I am having trouble setting up my python3 virtualenv. I created a virtualenv Œspark-2.3.0¹ Installed pyspark using pip how ever I am not able to import pyspark.sql.functions. I get ³unresolved import² when I try to import col() and lit() from pyspark.sql.functions import * I found if I download

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
|john| red| > |bill| blue| > |bill| red| > | sam|green| > ++-+ > > > distData.as("tbl1").join(distData.as("tbl2"), Seq("a"), > "fullouter").select("tbl1.b", "tbl2.b").distinct.show() > > +-+-+ >

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
I was a little sloppy when I created the sample output. Its missing a few pairs Assume for a given row I have [a, b, c] I want to create something like the cartesian join From: Andrew Davidson Date: Friday, March 30, 2018 at 5:54 PM To: "user @spark" Subject: how to create all possible comb

how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²) This produces a column of type array. Now for each row I want to create a multiple pairs/tuples from the array so that I can create a contingency table. Any idea how I can transform my data so that call crosstab() ? The j

newbie: how to partition data on file system. What are best practices?

2017-11-22 Thread Andy Davidson
I am working on a deep learning project. Currently we do everything on a single machine. I am trying to figure out how we might be able to move to a clustered spark environment. Clearly its possible a machine or job on the cluster might fail so I assume that the data needs to be replicated to some

does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Andy Davidson
I am starting a new deep learning project currently we do all of our work on a single machine using a combination of Keras and Tensor flow. https://databricks.github.io/spark-deep-learning/site/index.html looks very promising. Any idea how performance is likely to improve as I add machines to my my

anyone know what the status of spark-ec2 is?

2016-09-06 Thread Andy Davidson
Spark-ec2 used to be part of the spark distribution. It now seems to be split into a separate repo https://github.com/amplab/spark-ec2 It does not seem to be listed on https://spark-packages.org/ Does anyone know what the status is? There is a readme.md how ever I am unable to find any release no

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
h is not a directory: /tmp tmp > Do you have a file called tmp at / on HDFS? > > > > > > On Thu, Aug 18, 2016 at 2:57 PM -0700, "Andy Davidson" > wrote: > > For unknown reason I can not create UDF when I run the attached notebook on my > cluster. I

pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is

Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-18 Thread Andy Davidson
Hi I am using python3, Java8 and spark-1.6.1. I am running my code in Jupyter notebook The following code runs fine on my mac udfRetType = ArrayType(StringType(), True) findEmojiUDF = udf(lambda s : re.findall(emojiPattern2, s), udfRetType) retDF = (emojiSpecialDF # convert into a li

py4j.Py4JException: Method lower([class java.lang.String]) does not exist

2016-08-18 Thread Andy Davidson
I am still working on spark-1.6.1. I am using java8. Any idea why (df.select("*", functions.lower("rawTag").alias("tag²)) Would raise py4j.Py4JException: Method lower([class java.lang.String]) does not exist Thanks in advance Andy https://spark.apache.org/docs/1.6.0/api/pytho

FW: [jupyter] newbie. apache spark python3 'Jupyter' data frame problem with auto completion and accessing documentation

2016-08-02 Thread Andy Davidson
FYI From: on behalf of Thomas Kluyver Reply-To: Date: Tuesday, August 2, 2016 at 3:26 AM To: Project Jupyter Subject: Re: [jupyter] newbie. apache spark python3 'Jupyter' data frame problem with auto completion and accessing documentation > Hi Andy, > > On 1 August

Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread Andy Davidson
Hi Freedafeng I have been reading and writing to s3 using spark-1.6.x with out any problems. Can you post a little code example and any error messages? Andy From: freedafeng Date: Tuesday, August 2, 2016 at 9:26 AM To: "user @spark" Subject: Re: spark 1.6.0 read s3 files error. > Any one,

python 'Jupyter' data frame problem with autocompletion

2016-08-01 Thread Andy Davidson
I started using python3 and jupyter in a chrome browser. I seem to be having trouble with data frame code completion. Regular python functions seems to work correctly. I wonder if I need to import something so the notebook knows about data frames? Kind regards Andy

Re: how to copy local files to hdfs quickly?

2016-07-30 Thread Andy Davidson
For lack of a better solution I am using ŒAWS s3 copy¹ to copy my files locally and Œhadoop fs ­put ./tmp/* Œ to transfer them. In general put works much better with a smaller number of big files compared to a large number of small files Your milage may vary Andy From: Andrew Davidson Date: W

Re: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-30 Thread Andy Davidson
>> >>> This takes that idea further by: >>> 1. Rather than sc.parallelize, implement the RDD interface where each >>> partition is defined by the files it needs to read (haven't gotten to >>> DataFrames yet) >>> 2. At the driver node, use t

use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-29 Thread Andy Davidson
s to jobs dealing with many >> files especially many small files and to jobs whose input is unbalanced to >> start with. Jobs perform better because: 1) there isn't a long stall at the >> driver when hadoop decides how to split S3 files 2) the partitions end up >> nearl

Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread Andy Davidson
Hi Freedafeng Can you tells a little more? I.E. Can you paste your code and error message? Andy From: freedafeng Date: Thursday, July 28, 2016 at 9:21 AM To: "user @spark" Subject: Re: spark 1.6.0 read s3 files error. > The question is, what is the cause of the problem? and how to fix it?

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Andy Davidson
s3.S3Context > > I am completing the sonatype process for publishing artifacts on maven central > (this should be done by tomorrow so referencing > "io.entilzha:spark-s3_2.10:0.0.0" should work very soon). I would love to hear > if this library solution works, otherwise I hope

performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Andy Davidson
I have a relatively small data set however it is split into many small JSON files. Each file is between maybe 4K and 400K This is probably a very common issue for anyone using spark streaming. My streaming app works fine, how ever my batch application takes several hours to run. All I am doing is

Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Andy Davidson
Hi Ascot When you run in cluster mode it means your cluster manager will cause your driver to execute on one of the works in your cluster. The advantage of this is you can log on to a machine in your cluster and submit your application and then log out. The application will continue to run. Here

how to copy local files to hdfs quickly?

2016-07-27 Thread Andy Davidson
I have a spark streaming app that saves JSON files to s3:// . It works fine Now I need to calculate some basic summary stats and am running into horrible performance problems. I want to run a test to see if reading from hdfs instead of s3 makes difference. I am able to quickly copy my the data fr

spark-2.x what is the default version of java ?

2016-07-27 Thread Andy Davidson
I currently have to configure spark-1.x to use Java 8 and python 3.x. I noticed that http://spark.apache.org/releases/spark-release-2-0-0.html#removals mentions java 7 is deprecated. Is the default now Java 8 ? Thanks Andy Deprecations The following features have been deprecated in Spark 2.0,

Re: spark 1.6.0 read s3 files error.

2016-07-27 Thread Andy Davidson
Hi Freedafeng The following works for me df will be a data frame. fullPath is lists list of various part files stored in s3. fullPath = ['s3n:///json/StreamingKafkaCollector/s1/2016-07-10/146817304/part-r -0-a2121800-fa5b-44b1-a994-67795' ] from pyspark.sql import SQLContext sqlCont

spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Andy Davidson
Congratulations on releasing 2.0! spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever http://spark.apache.org/docs/latest/index.html has a link to the spark-ec2 github repo https://github.com/amplab/spark-ec2 Is this the right group to discuss spark-ec2? Any idea how

getting more concurrency best practices

2016-07-26 Thread Andy Davidson
Bellow is a very simple application. It runs very slowly. It does not look like I am getting a lot of parallel execution. I image this is a very common work flow. Periodically I want to runs some standard summary statistics across several different data sets. Any suggestions would be greatly appre

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Andy Davidson
Yup in cluster mode you need to figure out what machine the driver is running on. That is the machine the UI will run on https://issues.apache.org/jira/browse/SPARK-15829 You may also have fire wall issues Here are some notes I made about how to figure out what machine the driver is running on w

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Andy Davidson
Hi Kevin Just a heads up at the recent spark summit in S.F. There was a presentation about streaming in 2.0. They said that streaming was not going to production ready in 2.0. I am not sure if the older 1.6.x version will be supported. My project will not be able to upgrade with streaming support

Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
heap memory do you give the driver ? > > On Fri, Jul 22, 2016 at 2:17 PM, Andy Davidson > wrote: >> Given I get a stack trace in my python notebook I am guessing the driver is >> running out of memory? >> >> My app is simple it creates a list of dataFrames from s3:

Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
Given I get a stack trace in my python notebook I am guessing the driver is running out of memory? My app is simple it creates a list of dataFrames from s3://, and counts each one. I would not think this would take a lot of driver memory I am not running my code locally. Its using 12 cores. Each

running jupyter notebook server Re: spark and plot data

2016-07-22 Thread Andy Davidson
u have any solution. > > Thanks > > 2016-07-21 18:44 GMT+02:00 Andy Davidson : >> Hi Pseudo >> >> Plotting, graphing, data visualization, report generation are common needs in >> scientific and enterprise computing. >> >> Can you tell me more abo

Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
n, while 7077 is the legacy way. From user's aspect, it should be > transparent and no need to worry about the difference. > > * URL: spark://hw12100.local:7077 > * REST URL: spark://hw12100.local:6066 (cluster mode) > > Thanks > Saisai > > On Fri, Jul 22, 2016 at

How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
I have some very long lived streaming apps. They have been running for several months. I wonder if something has changed recently? I first started working with spark-1.3 . I am using the stand alone cluster manager. The way I would submit my app to run in cluster mode was port 6066 Looking at the

Re: spark and plot data

2016-07-21 Thread Andy Davidson
Hi Pseudo Plotting, graphing, data visualization, report generation are common needs in scientific and enterprise computing. Can you tell me more about your use case? What is it about the current process / workflow do you think could be improved by pushing plotting (I assume you mean plotting and

Re: write and call UDF in spark dataframe

2016-07-20 Thread Andy Davidson
Hi Divya In general you will get better performance if you can minimize your use of UDFs. Spark 2.0/ tungsten does a lot of code generation. It will have to treat your UDF as a block box. Andy From: Rishabh Bhardwaj Date: Wednesday, July 20, 2016 at 4:22 AM To: Rabin Banerjee Cc: Divya Geh

Re: Role-based S3 access outside of EMR

2016-07-19 Thread Andy Davidson
Hi Everett I always do my initial data exploration and all our product development in my local dev env. I typically select a small data set and copy it to my local machine My main() has an optional command line argument Œ- - runLocal¹ Normally I load data from either hdfs:/// or S3n:// . If the a

Re: Trouble while running spark at ec2 cluster

2016-07-18 Thread Andy Davidson
Hi Hassan Typically I log on to my master to submit my app. [ec2-user@ip-172-31-11-222 bin]$ echo $SPARK_ROOT /root/spark [ec2-user@ip-172-31-11-222 bin]$echo $MASTER_URL spark://ec2-54-215-11-222.us-west-1.compute.amazonaws.com:7077 [ec2-user@ip-172-31-11-222 bin]$ $SPARK_ROOT/bin/spark-

/spark-ec2 script: trouble using ganglia web ui spark 1.6.1

2016-07-11 Thread Andy Davidson
I created a cluster using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 script. The shows ganglia started how ever I am not able to access http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:5080/ganglia. I have tried using the private ip from with in my data center. I d not see anything listing

Re: Spark Streaming - Direct Approach

2016-07-11 Thread Andy Davidson
Hi Pradeep I can not comment about experimental or production, how ever I recently started a POC using direct approach. Its been running off and on for about 2 weeks. In general it seems to work really well. One thing that is not clear to me is how the cursor is manage. E.G. I have my topic set t

trouble accessing driver log files using rest-api

2016-07-11 Thread Andy Davidson
I am running spark-1.6.1 and the stand alone cluster manager. I am running into performance problems with spark streaming and added some extra metrics to my log files. I submit my app in cluster mode. (I.e. The driver runs on a slave not master) I am not able to get the driver log files while the

WARN FileOutputCommitter: Failed to delete the temporary output directory of task: attempt_201607111453_128606_m_000000_0 - s3n://

2016-07-11 Thread Andy Davidson
I am running into serious performance problems with my spark 1.6 streaming app. As it runs it gets slower and slower. My app is simple. * It receives fairly large and complex JSON files. (twitter data) * Converts the RDD to DataFrame * Splits the data frame in to maybe 20 different data sets * W

spark UI what does storage memory x/y mean

2016-07-11 Thread Andy Davidson
My stream app is running into problems It seems to slow down over time. How can I interpret the storage memory column. I wonder if I have a GC problem? Any idea how I can get GC stats? Thanks Andy Executors (3) * Memory: 9.4 GB Used (1533.4 MB Total) * Disk: 0.0 B Used Executor IDAddressRDD Bloc

can I use ExectorService in my driver? was: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Andy Davidson
tion = createNewConnection() > partitionOfRecords.foreach(record => connection.send(record)) > connection.close() > } > > Hope this helps, > Ewan > > -Original Message- > From: Cody Koeninger [mailto:c...@koeninger.org] > Sent: 08 July 2016 15:31 >

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Andy Davidson
Kafka has an interesting model that might be applicable. You can think of kafka as enabling a queue system. Writes are called producers, and readers are called consumers. The server is called a broker. A ³topic² is like a named queue. Producer are independent. They can write to a ³topic² at will.

is dataframe.write() async? Streaming performance problem

2016-07-07 Thread Andy Davidson
I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka direct stream approach. I am running into performance problems. My processing time is > than my window size. Changing window sizes, adding cores and executor memory does not change performance. I am having a lot of trouble

Re: strange behavior when I chain data frame transformations

2016-05-13 Thread Andy Davidson
: "user @spark" Subject: Re: strange behavior when I chain data frame transformations > In the structure shown, tag is under element. > > I wonder if that was a factor. > > On Fri, May 13, 2016 at 11:49 AM, Andy Davidson > wrote: >> I am using spark-1.6.1. >&g

strange behavior when I chain data frame transformations

2016-05-13 Thread Andy Davidson
I am using spark-1.6.1. I create a data frame from a very complicated JSON file. I would assume that query planer would treat both version of my transformation chains the same way. // org.apache.spark.sql.AnalysisException: Cannot resolve column name "tag" among (actor, body, generator, pip, id,

How to transform a JSON string into a Java HashMap<> java.io.NotSerializableException

2016-05-11 Thread Andy Davidson
I have a streaming app that receives very complicated JSON (twitter status). I would like to work with it as a hash map. It would be very difficult to define a pojo for this JSON. (I can not use twitter4j) // map json string to map JavaRDD> jsonMapRDD = powerTrackRDD.map(new Function>(){ private

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Andy Davidson
Hi Tobias I am very interested implemented rest based api on top of spark. My rest based system would make predictions from data provided in the request using models trained in batch. My SLA is 250 ms. Would you mind sharing how you implemented your rest server? I am using spark-1.6.1. I have se

java.io.NotSerializableException: org.apache.spark.sql.types.LongType

2016-04-21 Thread Andy Davidson
I started using http://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#fp-gr owth in python. It was really easy to get the frequent items set. Unfortunately associations is not implemented in python. Here is my python code It works great rawJsonRDD = jsonToPythonDictionaries(sc,

custom transformer pipeline sample code

2016-04-20 Thread Andy Davidson
Someone recently asked me for a code example of how to to write a custom pipeline transformer in Java Enjoy, Share Andy https://github.com/AEDWIP/Spark-Naive-Bayes-text-classification/blob/260a6b9 b67d7da42c1d0f767417627da342c8a49/src/test/java/com/santacruzintegration/spa rk/SparseVectorToLo

Re: Spark replacing Hadoop

2016-04-14 Thread Andy Davidson
Hi Ashok In general if I was starting a new project and had not invested heavily in hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of existing projects implemented on hadoop, Š) I would probably start using spark. Its faster and easier to use Your mileage may vary Andy Fro

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread Andy Davidson
+1 From: Matei Zaharia Date: Tuesday, April 5, 2016 at 4:58 PM To: Xiangrui Meng Cc: Shivaram Venkataraman , Sean Owen , Xiangrui Meng , dev , "user @spark" , DB Tsai Subject: Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0 > This sounds good to me as well. The one thing

Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

2016-04-05 Thread Andy Davidson
ich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8 > Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV > 8Pw> > > > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wo

Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

2016-04-05 Thread Andy Davidson
n > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8 > Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV > 8Pw> > > > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >

Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

2016-04-05 Thread Andy Davidson
Hi Mich Yup I was surprised to find empty files. Its easy to work around. Note I should probably use coalesce() and not repartition() In general I found I almost always need to reparation. I was getting thousands of empty partitions. It was really slowing my system down. private static void s

Re: Can spark somehow help with this usecase?

2016-04-05 Thread Andy Davidson
Hi Marco You might consider setting up some sort of ELT pipe line. One of your stages might be to create a file of all the FTP URL. You could then write a spark app that just fetches the urls and stores the data in some sort of data base or on the file system (hdfs?) My guess would be to maybe u

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-04-04 Thread Andy Davidson
... 1 more From: Jeff Zhang Date: Tuesday, March 29, 2016 at 10:34 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > Ac

data frame problem preserving sort order with repartition() and coalesce()

2016-03-29 Thread Andy Davidson
I have a requirement to write my results out into a series of CSV files. No file may have more than 100 rows of data. In the past my data was not sorted, and I was able to use reparation() or coalesce() to ensure the file length requirement. I realize that reparation() cause the data to be shuffle

Vectors.sparse exception: TypeError: indices array must be sorted

2016-03-29 Thread Andy Davidson
I am using pyspark 1.6.1 and python3 Any idea what my bug is? Clearly the indices are being sorted? Could it be the numDimensions = 713912692155621377 and my indices are longs not ints? import numpy as np from pyspark.mllib.linalg import Vectors from pyspark.mllib.linalg import VectorUDT #sv1

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Andy Davidson
. >> >> from pyspark.sql.functions import max >> >> rows = idDF2.select(max("col[id]")).collect() >> firstRow = rows[0] >> >> # by index >> max = firstRow[0] >> >> # by column name >> max = firstRow["max(col[id])&qu

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Andy Davidson
ot; Subject: Re: looking for an easy to to find the max value of a column in a data frame > e.g. select max value for column "foo": > > from pyspark.sql.functions import max, col > df.select(max(col("foo"))).show() > > On Tue, Mar 29, 2016 at 2:15

Re: Sending events to Kafka from spark job

2016-03-29 Thread Andy Davidson
Hi Fanoos I would be careful about using collect(). You need to make sure you local computer has enough memory to hold your entire data set. Eventually I will need to do something similar. I have to written the code yet. My plan is to load the data into a data frame and then write a UDF that actu

pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-28 Thread Andy Davidson
I am using pyspark spark-1.6.1-bin-hadoop2.6 and python3. I have a data frame with a column I need to convert to a sparse vector. I get an exception Any idea what my bug is? Kind regards Andy Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang

looking for an easy to to find the max value of a column in a data frame

2016-03-28 Thread Andy Davidson
I am using pyspark 1.6.1 and python3. Given: idDF2 = idDF.select(idDF.id, idDF.col.id ) idDF2.printSchema() idDF2.show() root |-- id: string (nullable = true) |-- col[id]: long (nullable = true) +--+--+ |id| col[id]| +--+--+ |1008930924| 534494917| |1

Re: --packages configuration equivalent item name?

2016-03-28 Thread Andy Davidson
Hi Russell I use Jupyter python notebooks a lot. Here is how I start the server set -x # turn debugging on #set +x # turn debugging off # https://github.com/databricks/spark-csv # http://spark-packages.org/package/datastax/spark-cassandra-connector #https://github.com/datastax/spark-cassand

Re: pyspark sql convert long to timestamp?

2016-03-22 Thread Andy Davidson
> Have a look at the from_unixtime() functions. > https://spark.apache.org/docs/1.5.0/api/python/_modules/pyspark/sql/functions. > html#from_unixtime > > Thanks > Best Regards > > On Tue, Mar 22, 2016 at 4:49 AM, Andy Davidson > wrote: >> Any idea how I have a col in a data

pyspark sql convert long to timestamp?

2016-03-21 Thread Andy Davidson
Any idea how I have a col in a data frame that is of type long any idea how I create a column who¹s type is time stamp? The long is unix epoch in ms Thanks Andy

bug spark should not use java.sql.timestamp was: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
Here is a nice analysis of the issue from the Cassandra mail list. (Datastax is the Databricks for Cassandra) Should I fill a bug? Kind regards Andy http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date and this one On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer wrote:

Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
Hi Davies > > What's the type of `created`? TimestampType? The Œcreated¹ column in cassandra is a timestamp https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html In the spark data frame it is a a timestamp > > If yes, when created is compared to a string, it will be c

Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
For completeness. Clearly spark sql returned a different data set In [4]: rawDF.selectExpr("count(row_key) as num_samples", "sum(count) as total", "max(count) as max ").show() +---++-+ |num_samples|total|max| +---

sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
I am using pyspark 1.6.0 and datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series data The data is originally captured by a spark streaming app and written to Cassandra. The value of the timestamp comes from Rdd.foreachRDD(new VoidFunction2, Time>() Š}); I am

best practices: running multi user jupyter notebook server

2016-03-19 Thread Andy Davidson
We are considering deploying a notebook server for use by two kinds of users 1. interactive dashboard. > 1. I.e. Forms allow users to select data sets and visualizations > 2. Review real time graphs of data captured by our spark streams 2. General notebooks for Data Scientists My concern is inter

unix_timestamp() time zone problem

2016-03-19 Thread Andy Davidson
I am using python spark 1.6 and the --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 I need to convert a time stamp string into a unix epoch time stamp. The function unix_timestamp() function assume current time zone. How ever my string data is UTC and encodes the time zone as zero. I

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Andy Davidson
Davidson Cc: "user @spark" Subject: Re: what is the pyspark inverse of registerTempTable()? >>>> >>> sqlContext.registerDataFrameAsTable(df, "table1") >>>> >>> sqlContext.dropTempTable("table1") > > &g

what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Andy Davidson
Thanks Andy

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
340-0466 > >> On Mar 15, 2016, at 11:45 AM, Andy Davidson >> wrote: >> >> We use the spark-ec2 script to create AWS clusters as needed (we do not use >> AWS EMR) >> 1. will we get better performance if we copy data to HDFS before we run >> instead o

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3? It seems like there are many ways to bulk copy

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson
bject: Re: trouble with NUMPY constructor in UDF > >> bq. epoch2numUDF = udf(foo, FloatType()) >> >> Is it possible that return value from foo is not FloatType ? >> >> On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson >> wrote: >>> I need to convert t

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson
tor in UDF > bq. epoch2numUDF = udf(foo, FloatType()) > > Is it possible that return value from foo is not FloatType ? > > On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson > wrote: >> I need to convert time stamps into a format I can use with matplotlib >> plot_date().

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Andy Davidson
In my experience I would try the following I use the standalone cluster manager. Each app gets it own performance web page . The streaming tab is really helpful. If processing time is greater than then your mini batch length you are going to run into performance problems Use the ³stages² tab to

trouble with NUMPY constructor in UDF

2016-03-09 Thread Andy Davidson
I need to convert time stamps into a format I can use with matplotlib plot_date(). epoch2num() works fine if I use it in my driver how ever I get a numpy constructor error if use it in a UDF Any idea what the problem is? Thanks Andy P.s I am using python3 and spark-1.6 from pyspark.sql.functio

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-09 Thread Andy Davidson
ark-cassandra-connector/blob/master/doc/referenc > e.md#cassandra-connection-parameters > > Looking at the logs, it seems your port config is not being set and it's > falling back to default. > Let me know if that helps. > > Saurabh Bajaj > > On Tue, Mar 8, 2

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson
.126}:9042 > Have you contacted spark-cassandra-connector related mailing list ? > > I wonder where the port 9042 came from. > > Cheers > > On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson > wrote: >> >> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a p

pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson
I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook that reads a data frame from Cassandra. I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works how ever I can not figure out how to configure my notebook. I have tried various hacks any idea what I a

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-07 Thread Andy Davidson
Hi Vinti I use the stand alone cluster. I the mgmt console provides a link to an app UI. It has all sorts of performance info. There should be a tab ¹stages¹. You can use it to find bottlenecks Note the links in the mgmt console do not seem to work. The app UI runs on the same machine as the driv

streaming will I loose data if spark.streaming.backpressure.enabled=true

2016-03-07 Thread Andy Davidson
http://spark.apache.org/docs/latest/streaming-programming-guide.html#deployi ng-applications Gives a brief discussion about max rate and back pressure Its not clear to me what will happen. I use an unreliable reciever. Let say me app is running and process time is less then window length. Happy

how to implement and deploy robust streaming apps

2016-03-07 Thread Andy Davidson
One of the challenges we need to prepare for with streaming apps is bursty data. Typically we need to estimate our worst case data load and make sure we have enough capacity It not obvious what best practices are with spark streaming. * we have implemented check pointing as described in the prog

streaming app performance when would increasing execution size or adding more cores

2016-03-07 Thread Andy Davidson
We just deployed our first streaming apps. The next step is running them so they run reliably We have spend a lot of time reading the various prog guides looking at the standalone cluster manager app performance web pages. Looking at the streaming tab and the stages tab have been the most helpful

Re: newbie unable to write to S3 403 forbidden error

2016-02-24 Thread Andy Davidson
ng the s3 cmd line client. > > Also, if you are using instance profiles you don't need to specify access and > secret keys. No harm in specifying though. > > Regards > Sab > On 12-Feb-2016 2:46 am, "Andy Davidson" wrote: >> I am using spark 1.6.0

streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Andy Davidson
Currently our stream apps write results to hdfs. We are running into problems with HDFS becoming corrupted and running out of space. It seems like a better solution might be to write directly to S3. Is this a good idea? We plan to continue to write our checkpoints to hdfs Are there any issues to

spark-1.6.0-bin-hadoop2.6/ec2/spark-ec2 uses old version of hadoop

2016-02-23 Thread Andy Davidson
I do not have any hadoop legacy code. My goal is to run spark on top of HDFS. Recently I have been have hdfs corruption problem. I was also never able to access S3 even though I used --copy-aws-credentials. I noticed that by default the spark-ec2 script uses hadoop 1.0.4. I ran help and discovered

Re: GroupedDataset needs a mapValues

2016-02-14 Thread Andy Davidson
Hi Michael From: Michael Armbrust Date: Saturday, February 13, 2016 at 9:31 PM To: Koert Kuipers Cc: "user @spark" Subject: Re: GroupedDataset needs a mapValues > Instead of grouping with a lambda function, you can do it with a column > expression to avoid materializing an unnecessary tup

org.apache.spark.sql.AnalysisException: undefined function lit;

2016-02-12 Thread Andy Davidson
I am trying to add a column with a constant value to my data frame. Any idea what I am doing wrong? Kind regards Andy DataFrame result = Š String exprStr = "lit(" + time.milliseconds()+ ") as ms"; logger.warn("AEDWIP expr: {}", exprStr); result.selectExpr("*", exprStr).show(false); WAR

Re: Question on Spark architecture and DAG

2016-02-12 Thread Andy Davidson
From: Mich Talebzadeh Date: Thursday, February 11, 2016 at 2:30 PM To: "user @spark" Subject: Question on Spark architecture and DAG > Hi, > > I have used Hive on Spark engine and of course Hive tables and its pretty > impressive comparing Hive using MR engine. > > > > Let us assume t

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Andy Davidson
spark or hadoop. > > BR, > > Arkadiusz Bicz > https://www.linkedin.com/in/arkadiuszbicz > > On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson > wrote: >> We recently started a Spark/Spark Streaming POC. We wrote a simple streaming >> app in java to collect tweets.

  1   2   3   >