RE: Spark streaming filling the disk with logs

2019-02-14 Thread Jain, Abhishek 3. (Nokia - IN/Bangalore)
The properties provided earlier, will work for the standalone mode. For cluster mode, the below properties need to be added in the spark-submit: --files "/log4j.properties" (to make log4j property file available for both driver and executor/s) (to enable the extra java options for driver and

RE: Spark streaming filling the disk with logs

2019-02-14 Thread email
I have a quick question about this configuration. Particularly this line : log4j.appender.rolling.file=/var/log/spark/ Where is that path at? At the driver level or for each executor individually? Thank you From: Jain, Abhishek 3. (Nokia - IN/Bangalore) Sent: Thursday, February

spark structured streaming handles pre-existing files

2019-02-14 Thread Lian Jiang
Hi, We have a spark structured streaming monitoring a folder and converting jsonl files into parquet. However, if there are some pre-existing jsonl files before the first time (no check point yet) running of the spark streaming job, these files will not be processed by the spark job when it runs.

Cross Join in Spark

2019-02-14 Thread Ankur Srivastava
Hello, We have a use case where we need to do a Cartesian join and for some reason we are not able to get it work with Dataset API's. We have similar use case implemented and working with RDD. We have two dataset: - one data set with 2 string columns say c1, c2. It is a small data set with ~1 mil

StackOverflow question regarding DataSets and mapGroups

2019-02-14 Thread Nathan Ronsse
Hello, I am experiencing an issue that I also posted here. Maybe I should be using an Aggregator instead of mapGroups? I have not found anything that would lead me to believe I am using mapGroups incorre

Spark lists paths after `write` - how to avoid refreshing the file index?

2019-02-14 Thread peay
Hello, I have a piece of code that looks roughly like this: df = spark.read.parquet("s3://bucket/data.parquet/name=A", "s3://bucket/data.parquet/name=B") df_out = df. # Do stuff to transform df df_out.write.partitionBy("name").parquet("s3://bucket/data.parquet") I specific explicit path

RE: Spark streaming filling the disk with logs

2019-02-14 Thread Jain, Abhishek 3. (Nokia - IN/Bangalore)
++ If you can afford loosing few old logs, then you can make use of rolling file Appender as well. log4j.rootLogger=INFO, rolling log4j.appender.rolling=org.apache.log4j.RollingFileAppender log4j.appender.rolling.layout=org.apache.log4j.PatternLayout log4j.appender.rolling.maxFileSize=50MB log4j.

RE: Spark streaming filling the disk with logs

2019-02-14 Thread Jain, Abhishek 3. (Nokia - IN/Bangalore)
Hi Deepak, The spark logging can be set for different purposes. Say for example if you want to control the spark-submit log, “log4j.logger.org.apache.spark.repl.Main=WARN/INFO/ERROR” can be set. Similarly, to control third party logs: log4j.logger.org.spark_project.jetty=, log4j.logger.org.apa

Re: SparkR + binary type + how to get value

2019-02-14 Thread Thijs Haarhuis
Hi Felix, Sure.. I have the following code: printSchema(results) cat("\n\n\n") firstRow <- first(results) value <- firstRow$value cat(paste0("Value Type: '",typeof(value),"'\n\n\n")) cat(paste0("Value: '",value,"'\n\n\n")) results is a Spark Data Frame here.