date:20180822

Re: Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread t4

https://issues.apache.org/jira/browse/SPARK-23576 ? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Matei Zaharia

Hi Steffen, Thanks for sharing your results about MLlib — this sounds like a useful tool. However, I wanted to point out that some of the results may be expected for certain machine learning algorithms, so it might be good to design those tests with that in mind. For example: > - The classific

Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread Long, Andrew

Hello Friends, I’ve encountered a bug where spark silently corrupts data when reading from a parquet hive table where the table schema does not match the file schema. I’d like to give a shot at adding some extra validations to the code to handle this corner case and I was wondering if anyone h

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-22 Thread Ankur Gupta

Thanks for your responses Saisai and Marco. I agree that "rename" operation can be time-consuming on object storage, which can potentially delay the shutdown. I also agree that customers/users have a way to use log appenders to write log files and then send them along with Yarn application logs b

Spark github sync works now

2018-08-22 Thread Xiao Li

FYI. The Spark github sync was 10 hour behind this morning. You might get fail merges because of this. Just triggered a re-sync. It should work now. Thanks, Xiao

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Sean Owen

Certainly if your tests have found a problem, open a JIRA and/or pull request with the fix and relevant tests. More tests generally can't hurt, though I guess we should maybe have a look at them first. If they're a lot of boilerplate and covering basic functions already covered by other tests, the

Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz

Given popularity of related SO questions: - https://stackoverflow.com/q/41670103/1560062 - https://stackoverflow.com/q/42465568/1560062 - https://stackoverflow.com/q/41670103/1560062 it is probably more "nobody thought about asking", than "it is not used often". On Wed, 22 Aug 2018 at

Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Mike Hynes

Hi Reynold/Ivan, People familiar with pandas and R dataframes will likely have used the dataframe "melt" idiom, which is the functionality I believe you are referring to: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html I have had to write this function myself in my own wor

[MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Steffen Herbold

Dear developers, I am writing you because I applied an approach for the automated testing of classification algorithms to Spark MLlib and would like to forward the results to you. The approach is a combination of smoke testing and metamorphic testing. The smoke tests try to find problems by

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-22 Thread makatun

Manu, thank you very much for your response. 1. Your post helps to further optimize the spark jobs for wide data. (https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015) The suggested change of code: df.select(df.columns.map { col => df(col).isNotNull }: _*) provides

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-22 Thread Marco Gaido

I agree with Saisai. You can also configure log4j to append anywhere else other than the console. Many companies have their system for collecting and monitoring logs and they just customize the log4j configuration. I am not sure how needed this change would be. Thanks, Marco Il giorno mer 22 ago

Re: Spark data quality bug when reading parquet files from hive metastore

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

Spark data quality bug when reading parquet files from hive metastore

Re: Persisting driver logs in yarn client mode (SPARK-25118)

Spark github sync works now

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

Re: Spark DataFrame UNPIVOT feature

Re: Spark DataFrame UNPIVOT feature

[MLlib][Test] Smoke and Metamorphic Testing of MLlib

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

Re: Persisting driver logs in yarn client mode (SPARK-25118)

11 matches

Site Navigation

Mail list logo

Footer information