from:"AssafMendelson"

Data source V2 in spark 2.4.0

2018-09-27 Thread AssafMendelson

Hi all, I understood from previous threads that the Data source V2 API will see some changes in spark 2.4.0, however, I can't seem to find what these changes are. Is there some documentation which summarizes the changes? The only mention I seem to find is this pull request: https://github.com/apa

IndexOutOfBoundException in catalyst when doing multiple approxDistinctCount

2017-08-08 Thread AssafMendelson

Hi, I am doing a large number of aggregations on a dataframe (without groupBy) to get some statistics. As part of this I am doing an approx_count_distinct(c, 0.01) Everything works fine but when I do the same aggregation a second time (for each column) I get the following error: [Stage 2:>

spark higher order functions

2017-06-20 Thread AssafMendelson

Hi, I have seen that databricks have higher order functions (https://docs.databricks.com/_static/notebooks/higher-order-functions.html, https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html) which basically allows to do generic ope

having trouble using structured streaming with file sink (parquet)

2017-06-13 Thread AssafMendelson

Hi all, I have recently started assessing structured streaming and ran into a little snag from the beginning. Basically I wanted to read some data, do some basic aggregation and write the result to file: import org.apache.spark.sql.functions.avg import org.apache.spark.sql.streaming.Processing

Adding metrics to spark datasource

2017-03-13 Thread AssafMendelson

Hi, I am building a data source so I can convert a custom source to dataframe. I have been going over examples such as JDBC and noticed that JDBC does the following: val inputMetrics = context.taskMetrics().inputMetrics and whenever a new record is added: inputMetrics.incRecordsRead(1) howeve

building runnable distribution from source

2016-09-29 Thread AssafMendelson

Hi, I am trying to compile the latest branch of spark in order to try out some code I wanted to contribute. I was looking at the instructions to build from http://spark.apache.org/docs/latest/building-spark.html So at first I did: ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTest

RE: How to make the result of sortByKey distributed evenly?

2016-09-06 Thread AssafMendelson

I imagine this is a sample example to explain a bigger concern. In general when you do a sort by key, it will implicitly shuffle the data by the key. Since you have 1 key (0) with 1 and the other with just 1 record it will simply shuffle it into two very skewed partitions. One way you can sol

Creating a UDF/UDAF using code generation

2016-09-04 Thread AssafMendelson

Hi, I want to write a UDF/UDAF which provides native processing performance. Currently, when creating a UDF/UDAF in a normal manner the performance is hit because it breaks optimizations. I tried something like this: import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.ca

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson

Assaf From: ayan guha [mailto:guha.a...@gmail.com] Sent: Sunday, September 04, 2016 11:00 AM To: Mendelson, Assaf Cc: user Subject: Re: Scala Vs Python Hi This one is quite interesting. Is it possible to share few toy examples? On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson mailt

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson

dering why would spark have it then? well probably because its ease of use for ML (that would be my best guess). [https://track.mixmax.com/api/track/v2/AD82gYqhkclMJCIdt/ISbvNmLslWYtdGQ5ATOoRnbhtmI] On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com<mailto:assaf.mendel...

using multiple worker instances in spark standalone

2016-09-01 Thread AssafMendelson

Hi, I have a machine with lots of memory. Since I understand all executors in a single worker run on the same JVM, I do not want to use just one worker for the whole memory. Instead I want to define multiple workers each with less than 30GB memory. Looking at the documentation I see this would

RE: Scala Vs Python

2016-08-31 Thread AssafMendelson

I believe this would greatly depend on your use case and your familiarity with the languages. In general, scala would have a much better performance than python and not all interfaces are available in python. That said, if you are planning to use dataframes without any UDF then the performance

broadcast fails on join

2016-08-30 Thread AssafMendelson

Hi, I am seeing a broadcast failure when doing a join as follows: Assume I have a dataframe df with ~80 million records I do: df2 = df.filter(cond) # reduces to ~50 million records grouped = broadcast(df.groupby(df2.colA).count()) total = df2.join(grouped, df2.colA == grouped.colA, "inner") tot

UDF/UDAF performance

2016-08-28 Thread AssafMendelson

I am trying to do a high performance calculations which require custom functions. As a first stage I am trying to profile the effect of using UDF and I am getting weird results. I created a simple test (in https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f

Data source V2 in spark 2.4.0

IndexOutOfBoundException in catalyst when doing multiple approxDistinctCount

spark higher order functions

having trouble using structured streaming with file sink (parquet)

Adding metrics to spark datasource

building runnable distribution from source

RE: How to make the result of sortByKey distributed evenly?

Creating a UDF/UDAF using code generation

RE: Scala Vs Python

RE: Scala Vs Python

using multiple worker instances in spark standalone

RE: Scala Vs Python

broadcast fails on join

UDF/UDAF performance

14 matches

Site Navigation

Mail list logo

Footer information