Re: Parsing a large XML file using Spark

2015-11-04 Thread Jin
I recently worked around datasource and parquet a bit at Spark and someone requested me to make a XML datasource plugin. So iI did this. https://github.com/HyukjinKwon/spark-xml It tried to get rid of in-line format just like Json datasource in Spark. Altough I didn't add a CI tool for this yet,

dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Tony Jin
Hi guys, I have a problem about spark DataFrame. My spark version is 1.6.1. Basically, i used udf and df.withColumn to create a "new" column, and then i filter the values on this new columns and call show(action). I see the udf function (which is used to by withColumn to create the new column) is

Lots of spark-assembly jars localized to /usercache/username/filecache directory

2016-09-30 Thread Lantao Jin
Hi, Our Spark is deployed on YARN and I found there were lots of spark-assembly jars in the Spark heavy user filecache directory (aka /usercache/username/filecache), and you know the assembly jar is bigger than 100 MB before Spark v2. So all of them take 26GB (1/4 reserved space) in most of Datanod

SparkR execution hang on when handle a RDD which is converted from DataFrame

2016-10-13 Thread Lantao Jin
sqlContext <- sparkRHive.init(sc) sqlString<- "SELECT key_id, rtl_week_beg_dt rawdate, gmv_plan_rate_amt value FROM metrics_moveing_detection_cube " df <- sql(sqlString) rdd<-SparkR:::toRDD(df) #hang on case one: take from rdd #take(rdd,3) #hang on case two: convert back to dataframe #df1<-create

Re: SparkR execution hang on when handle a RDD which is converted from DataFrame

2016-10-14 Thread Lantao Jin
40GB 2016-10-14 14:20 GMT+08:00 Felix Cheung : > How big is the metrics_moveing_detection_cube table? > > > > > > On Thu, Oct 13, 2016 at 8:51 PM -0700, "Lantao Jin" > wrote: > > sqlContext <- sparkRHive.init(sc) > sqlString<- > "SELECT

Re: PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-12 Thread Li Jin
Yeoul, I think a you can run an microbench for pyspark serialization/deserialization would be to run a withColumn + a python udf that returns a constant and compare that with similar code in Scala. I am not sure if there is way to measure just the serialization code, because pyspark API only allo

Re: Spark join over sorted columns of dataset.

2017-03-12 Thread Li Jin
I am not an expert on this but here is what I think: Catalyst maintains information on whether a plan node is ordered. If your dataframe is a result of a order by, catalyst will skip the sorting when it does merge sort join. If you dataframe is created from storage, for instance. ParquetRelation,

Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
Hi, I am wondering does pyspark standalone (local) mode support multi cores/executors? Thanks, Li

Re: EXT: Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
/sbin/start-master.sh script) and at least one worker node (can > be started using SPARK_HOME/sbin/start-slave.sh script).SparkConf should > use master node address to create (spark://host:port) > > Thanks! > > Gangadhar > From: Li Jin mailto:ice.xell...@gmail.com>> &g

Time Series Functionality with Spark

2018-03-12 Thread Li Jin
Hi All, This is Li Jin. We (me and my fellow colleagues at Two Sigma) have been using Spark for time series analysis for the past two years and it has been a success to scale up our time series analysis. Recently, we start a conversation with Reynold about potential opportunities to collaborate

Re: spark.sql.autoBroadcastJoinThreshold not taking effect

2019-05-13 Thread Lantao Jin
Maybe you could try “--conf spark.sql.statistics.fallBackToHdfs=true" On 2019/05/11 01:54:27, V0lleyBallJunki3 wrote: > Hello,> >I have set spark.sql.autoBroadcastJoinThreshold=1GB and I am running the> > spark job. However, my application is failing with:> > > at sun.reflect.NativeMetho

akka.FrameSize

2014-06-16 Thread Chen Jin
Hi all, I have run into a very interesting bug which is not exactly as same as Spark-1112. Here is how to reproduce the bug, I have one input csv file and use partitionBy function to create an RDD, say repartitionedRDD. The partitionBy function takes the number of partitions as a parameter such t

query classification using Apache spark Mlib

2014-12-08 Thread Huang,Jin
I have a question as the title says, the question link is http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib,thanks Jin

implement query to sparse vector representation in spark

2014-12-09 Thread Huang,Jin
I know quite a lot about machine learning, but new to scala and spark. Got stuck due to Spark API, so please advise. I have a txt file with each line format like this #label \t # query, a strong of words, delimited by space 1 wireless amazon kindle 2 apple iPhone 5 1 kindle fire 8G 2 ap

unsubscribe

2025-02-07 Thread Lantao Jin

[jira] Lantao Jin shared "SPARK-20680: Spark-sql do not support for void column datatype of view" with you

2017-05-09 Thread Lantao Jin (JIRA)
Lantao Jin shared an issue with you > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.

[jira] Lantao Jin shared "SPARK-21023: Ignore to load default properties file is not a good choice from the perspective of system" with you

2017-06-10 Thread Lantao Jin (JIRA)
Lantao Jin shared an issue with you Hi all, Do you think is it a bug? Should we keep the current behavior still? > Ignore to load default properties file is not a good choice from the > perspective of