Re: [Spark SQL] Does Spark group small files

2018-11-13 Thread Silvio Fiorito
Yes, it does bin-packing for small files which is a good thing so you avoid having many small partitions especially if you’re writing this data back out (e.g. it’s compacting as you read). The default partition size is 128MB with a 4MB “cost” for opening files. You can configure this using the s

[ANNOUNCE] Apache Toree 0.3.0-incubating Released

2018-11-13 Thread Luciano Resende
Apache Toree is a kernel for the Jupyter Notebook platform providing interactive and remote access to Apache Spark. The Apache Toree community is pleased to announce the release of Apache Toree 0.3.0-incubating which provides various bug fixes and the following enhancements. * Fix JupyterLab s

[ANNOUNCE] Apache Bahir 2.2.2 Released

2018-11-13 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.2.2 which provides the following extensions for Apache S

[ANNOUNCE] Apache Bahir 2.1.3 Released

2018-11-13 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.1.3 which provides the following extensions for Apache S

inferred schemas for spark streaming from a Kafka source

2018-11-13 Thread Colin Williams
Does anybody know how to use inferred schemas with structured streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets I have some code like : object StreamingApp { def launch(config: Config, spa

[Spark SQL] Does Spark group small files

2018-11-13 Thread Yann Moisan
Hello, I'm using Spark 2.3.1. I have a job that reads 5.000 small parquet files into s3. When I do a mapPartitions followed by a collect, only *278* tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any l

Failed to convert java.sql.Date to String

2018-11-13 Thread luby
Hi, All, I'm new to Spark SQL and just start to use it in our project. We are using spark 2. When importing data from a Hive table, I got the following error: if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8St