Thanks everyone for the help!
On Sat, Aug 29, 2015 at 2:55 AM, Alexey Grishchenko <programme...@gmail.com> wrote: > If the data is already in RDD, the easiest way to calculate min/max for > each column would be an aggregate() function. It takes 2 functions as > arguments - first is used to aggregate RDD values to your "accumulator", > the second is used to merge two accumulators. This way both min and max for > all the columns in your RDD would be calculated in a single pass over it. > Here's an example in Python: > > def agg1(x,y): > if len(x) == 0: x = [y,y] > return [map(min,zip(x[0],y)),map(max,zip(x[1],y))] > > def agg2(x,y): > if len(x) == 0: x = y > return [map(min,zip(x[0],y[0])),map(max,zip(x[1],y[1]))] > > rdd = sc.parallelize(xrange(100000), 5) > rdd2 = rdd.map(lambda x: ([random.randint(1,100) for _ in xrange(15)])) > rdd2.aggregate([], agg1, agg2) > > What personally I would do in your case depends on what else you want to > do with the data. If you plan to run some more business logic on top of it > and you're more comfortable with SQL, it might worth registering this > DataFrame as a table and generating SQL query to it (generate a string with > a series of min-max calls). But to solve your specific problem I'd load > your file with textFile(), use map() transformation to split the string by > comma and convert it to the array of doubles, and call aggregate() on top > of it just like I've shown in the example above > > On Fri, Aug 28, 2015 at 6:15 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> Or you can just call describe() on the dataframe? In addition to min-max, >> you'll also get the mean, and count of non-null and non-NA elements as well. >> >> Burak >> >> On Fri, Aug 28, 2015 at 10:09 AM, java8964 <java8...@hotmail.com> wrote: >> >>> Or RDD.max() and RDD.min() won't work for you? >>> >>> Yong >>> >>> ------------------------------ >>> Subject: Re: Calculating Min and Max Values using Spark Transformations? >>> To: as...@wso2.com >>> CC: user@spark.apache.org >>> From: jfc...@us.ibm.com >>> Date: Fri, 28 Aug 2015 09:28:43 -0700 >>> >>> >>> If you already loaded csv data into a dataframe, why not register it as >>> a table, and use Spark SQL >>> to find max/min or any other aggregates? SELECT MAX(column_name) FROM >>> dftable_name ... seems natural. >>> >>> >>> >>> >>> >>> *JESSE CHEN* >>> Big Data Performance | IBM Analytics >>> >>> Office: 408 463 2296 >>> Mobile: 408 828 9068 >>> Email: jfc...@us.ibm.com >>> >>> >>> >>> [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi >>> all, I have a dataset which consist of large number of feature]ashensw >>> ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large >>> number of features(columns). It is >>> >>> From: ashensw <as...@wso2.com> >>> To: user@spark.apache.org >>> Date: 08/28/2015 05:40 AM >>> Subject: Calculating Min and Max Values using Spark Transformations? >>> >>> ------------------------------ >>> >>> >>> >>> Hi all, >>> >>> I have a dataset which consist of large number of features(columns). It >>> is >>> in csv format. So I loaded it into a spark dataframe. Then I converted it >>> into a JavaRDD<Row> Then using a spark transformation I converted that >>> into >>> JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So >>> now >>> I have a JavaRDD<double[]>. So is there any method to calculate max and >>> min >>> values of each columns in this JavaRDD<double[]> ? >>> >>> Or Is there any way to access the array if I store max and min values to >>> a >>> array inside the spark transformation class? >>> >>> Thanks. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >> > > > -- > Best regards, Alexey Grishchenko > > phone: +353 (87) 262-2154 > email: programme...@gmail.com > web: http://0x0fff.com > -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 <94716042995> LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga <http://lk.linkedin.com/in/ashenweerathunga>*