Why can window functions only have fixed window sizes?

2020-07-15 Thread Daniel Stojanov
Hi, My understanding of window functions is that they can only operate on fixed window sizes. For example, I can create a window like the following:     Window.partitionBy("group_identifier").orderBy("sequencial_counter").rowsBetween(-4, 5) or even:     Window.partitionBy("group_identifier").o

S3 read/write from PySpark

2020-08-05 Thread Daniel Stojanov
Hi, I am trying to read/write files to S3 from PySpark. The procedure that I have used is to download Spark, start PySpark with the hadoop-aws, guava, aws-java-sdk-bundle packages. The versions are explicitly specified by looking up the exact dependency version on Maven. Allowing dependencies to b

Understanding Spark execution plans

2020-08-05 Thread Daniel Stojanov
Hi, When an execution plan is printed it lists the tree of operations that will be completed when the job is run. The tasks have somewhat cryptic names of the sort: BroadcastHashJoin, Project, Filter, etc. These do not appear to map directly to functions that are performed on an RDD. 1) Is there

Re: S3 read/write from PySpark

2020-08-06 Thread Daniel Stojanov
become org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider. > > However, the way that the username and password are provided appears to > have changed so you will probably need to look in to that. > > Cheers, > > Steve C > > On 6 Aug 2020, at 11:15 am, Daniel Stojanov &g

MongoDB plugin to Spark - too many open cursors

2020-10-25 Thread Daniel Stojanov
Hi, I receive an error message from the MongoDB server if there are too many Spark applications trying to access the database at the same time (about 3 or 4), "Cannot open a new cursor since too many cursors are already opened." I am not too sure of how to remedy this. I am not sure how the

Re: MongoDB plugin to Spark - too many open cursors

2020-10-26 Thread Daniel Stojanov
eds of the hardware itself. Regards, On 26/10/20 1:52 pm, lec ssmi wrote: Is the connection pool configured by mongodb full? Daniel Stojanov <mailto:m...@danielstojanov.com>> 于2020年10月26日周一 上午10:28写道: Hi, I receive an error message from the MongoDB server if there are

How does order work in Row objects when .toDF() is called?

2020-11-05 Thread Daniel Stojanov
>>> row_1 = psq.Row(first=1, second=2) >>> row_2 = psq.Row(second=22, first=11) >>> spark.sparkContext.parallelize([row_1, row_2]).toDF().collect() [Row(first=1, second=2), Row(first=22, second=11)] (Spark 3.0.1) What is happening in the above? When .toDF() is called it appears that order is m

Re: Confuse on Spark to_date function

2020-11-05 Thread Daniel Stojanov
On 5/11/20 2:48 pm, 杨仲鲍 wrote: Code ```scala object Suit{ case class Data(node:String,root:String) def apply[A](xs:A *):List[A] = xs.toList def main(args: Array[String]): Unit ={ val spark = SparkSession.builder() .master("local") .appName("MoneyBackTest") .getOrCreate() import spark.imp

Pyspark application hangs (no error messages) on Python RDD .map

2020-11-10 Thread Daniel Stojanov
Hi, This code will hang indefinitely at the last line (the .map()). Interestingly, if I run the same code at the beginning of my application (removing the .write step) it executes as expected. Otherwise, the code appears further along in my application which is where it hangs. The debugging messag

Purpose of type in pandas_udf

2020-11-12 Thread Daniel Stojanov
Hi, Note "double" in the function decorator. Is this specifying the type of the data that goes into pandas_mean, or the type returned by that function? Regards, @pandas_udf("double", PandasUDFType.GROUPED_AGG) def pandas_mean(v):     return v.sum() --

[Pyspark] - Spark uses all available memory; unrelated to size of dataframe

2020-04-08 Thread Daniel Stojanov
My setup: using Pyspark; Mongodb to retrieve and store final results; Spark is in standalone cluster mode, on a single desktop. Spark v.2.4.4. Openjdk 8. My spark application (using pyspark) uses all available system memory. This seems to be unrelated to the data being processed. I tested with 32G

Spark Mongodb connector hangs indefinitely, not working on Amazon EMR

2020-04-21 Thread Daniel Stojanov
When running a Pyspark application on my local machine I am able to save and retrieve from the Mongodb server using the Mongodb Spark connector. All works properly. When submitting the exact same application on my Amazon EMR cluster I can see that the package for the Spark driver is being properly