How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-12 Thread Rex X
To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a lo

[Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-12 Thread Rex X
Hi, I want to use spark to select N columns, top M rows of all csv files under a folder. To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Rex X
nce you have that as a DataFrame, SQL can do the rest. > > https://spark.apache.org/docs/latest/sql-programming-guide.html > > -Don > > > On Fri, Jun 12, 2015 at 8:46 PM, Rex X wrote: > >> Hi, >> >> I want to use spark to select N columns, top M rows of

What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-14 Thread Rex X
For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement - *binary / categorical (nominal), counts (ordinal), and ratio (scale)* To be concrete, for example, working with attributes of *city, zip, satisfaction_level, price* In the meanwh

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Is it necessary to convert categorical data into integers? Any tips would be greatly appreciated! -Rex On Sun, Jun 14, 2015 at 10:05 AM, Rex X wrote: > For clustering analysis, we need a way to measure distances. > > When the data contains different levels of measurement -

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
he integer representation. > > -sujit > > > On Tue, Jun 16, 2015 at 1:17 PM, Rex X wrote: > >> Is it necessary to convert categorical data into integers? >> >> Any tips would be greatly appreciated! >> >> -Rex >> >> On Sun, Jun 14, 2015 at 10:05

HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Rex X
With Python Pandas, it is easy to do concatenation of dataframes by combining pandas.concat and pandas.read_csv pd.concat([pd.read_csv(os.path.join(Path_to_csv_files, f)) for f in csvfiles]) where "csvfiles" is the list o

How to select from table name using IF(condition, tableA, tableB)?

2016-03-15 Thread Rex X
I want to do a query based on a logic condition to query between two tables. select * from if(A>B, tableA, tableB) But "if" function in Hive cannot work within FROM above. Any idea how?

What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread Rex X
Each row of the given matrix is Vector[Double]. Want to find out the nearest neighbor row to each row using cosine similarity. The problem here is the complexity: O( 10^20 ) We need to do *blocking*, and do the row-wise comparison within each block. Any tips for best practice? In Spark, we have

What is the best way to JOIN two 10TB csv files and three 100kb files on Spark?

2016-02-05 Thread Rex X
Dear all, The new DataFrame of spark is extremely fast. But out cluster have limited RAM (~500GB). What is the best way to do such a big table Join? Any sample code is greatly welcome! Best, Rex

What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
Hi everyone, What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. (I know we have MLlib. But the existing code base is big, and some functions are not fully supporte

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
e to do the plumbing all yourself. This is the same for all >> commercial and non-commercial libraries/analytics packages. It often also >> depends on the functional requirements on how you distribute. >> >> Le sam. 12 sept. 2015 à 20:18, Rex X a écrit : >> >&

Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Rex X
With Pandas dataframe , we can do query: >>> from numpy.random import randn>>> from pandas import DataFrame>>> df = >>> DataFrame(randn(10, 2), columns=list('ab'))>>> df.query('a > b') This SQL-select-like query

Re: Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Rex X
.where("a > b").show() > > (2) Spark Jobs > +--+---+ | a| b| > +--+---+ > |0.6697439215581628|0.23420961030968923| |0.9248996796756386| > 0.4146647917936366| +------+---+ > > On Thu, Sep 17, 2015 at 9:32 AM

How to do this pairing in Spark?

2016-08-25 Thread Rex X
1. Given following CSV file > $cat data.csv > > ID,City,Zip,Flag > 1,A,95126,0 > 2,A,95126,1 > 3,A,95126,1 > 4,B,95124,0 > 5,B,95124,1 > 6,C,95124,0 > 7,C,95127,1 > 8,C,95127,0 > 9,C,95127,1 (a) where "ID" above is a primary key (unique), (b) for each

Re: How to do this pairing in Spark?

2016-08-26 Thread Rex X
:46 AM, ayan guha wrote: > Why 3 and 9 should be deleted? 3 can be paired with 1and 9 can be paired > with 8. > On 26 Aug 2016 11:00, "Rex X" wrote: > >> 1. Given following CSV file >> >> > $cat data.csv >> > >> > ID,City,Zi

How to make new composite columns by combining rows in the same group?

2016-08-26 Thread Rex X
1. Given following CSV file $cat data.csv ID,City,Zip,Price,Rating1,A,95123,100,01,B,95124,102,11,A,95126,100,12,B,95123,200,02,B,95124,201,12,C,95124,203,03,A,95126,300,13,C,95124,280,04,C,95124,400,1 We want to group by ID, and make new composite columns of Price and Rating based on the value

Re: How to make new composite columns by combining rows in the same group?

2016-08-26 Thread Rex X
The data.csv need to be corrected: 1. Given following CSV file $cat data.csv ID,City,Zip,Price,Rating 1,A,95123,100,1 1,B,95124,102,2 1,A,95126,100,2 2,B,95123,200,1 2,B,95124,201,2 2,C,95124,203,1 3,A,95126,300,2 3,C,95124,280,1 4,C,95124,400,2 On Fri, Aug 26, 2016 at 4:54 AM, Rex X wrote

Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-04 Thread Rex X
Wish to use the Pivot Table feature of data frame which is available since Spark 1.6. But the spark of current cluster is version 1.5. Can we install Spark 2.0 on the master node to work around this? Thanks!

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Rex X
gt;> >>> >>> >>> On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" < >>> hol...@pigscanfly.ca> wrote: >>> >>> You really shouldn't mix different versions of Spark between the master >>> and worker nodes, if your