CLuster mode with HDFS? or local mode?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
What version of spark of spark are you using?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
All you need to do is -
spark.conf.set("spark.sql.shuffle.partitions", 2000)
spark.conf.set("spark.sql.orc.filterPushdown", True)
...etc
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-hive-configs-in-Spark-2-1-tp28429p28431.html
Sent from the Ap
It might be a memory issue. Try adding .persist(MEMORY_AND_DISK_ONLY) so that
if the RDD can't fit into memory it will persist parts of the RDD into disk.
cm_go.registerTempTable("x")
ko.registerTempTable("y")
joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2")
joined_
You should look into AWS EMR instead, with adding pip install steps to the
launch process. They have a pretty nice Jupyter notebook script that setups
up jupyter and lets you choose what packages you want to install -
https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-
You can use the
https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#wholeTextFiles(java.lang.String)
but it will return a rdd as such (filename,content)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-text-ma
Here is how you would read from Google Cloud Storage(note you need to create
a service account key) ->
os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars
/home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell"""
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSess
This blog post(Not mine) has some nice examples -
https://hadoopist.wordpress.com/2016/08/19/how-to-create-compressed-output-files-in-spark-2-0/
>From the blog -
df.write.mode("overwrite").format("parquet").option("compression",
"none").mode("overwrite").save("/tmp/file_no_compression_parq")
Can you be more specific on what you would want to change on the DF level?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Assuming you don't have your environment variables setup in your
.bash_profile you would do it like this -
import os
import sys
spark_home = '/usr/local/spark'
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home,
'python/lib/py4j-0.10.1-src.zip'))
#os.environ['P
Don't the jars need to be comma sperated when you pass?
i.e. --jars "hdfs://zzz:8020/jars/kafka_2.10-0.8.2.2.jar",
/opt/bigdevProject/sparkStreaming_jar4/sparkStreaming.jar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-on-yarn-can-t-load-kafka-depe
Yes it would.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Would-spark-dataframe-rdd-read-from-external-source-on-every-action-tp28157p28158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
What version of Spark are you using? I believe this was fixed in 2.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-join-and-subquery-tp28093p28097.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
All you need to do is load all the files into one dataframe at once. Then
save the dataframe using partitionBy -
df.write.format("parquet").partitionBy("directoryCol").save("hdfs://path")
Then if you look at the new folder it should look like how you want it I.E -
hdfs://path/dir=dir1/part-r-xxx.
Is there anything in the files to let you know which directory they should be
in?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28083.html
Sent from the Apache Spark User List mailing list archive at Nabble.co
You can have a list of all the columns and pass it to a recursive recursive
function to fit and make the transformation.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-a-Spark-Equivalent-for-Pandas-get-dummies-tp28064p28079.html
Sent from the Apac
No Spark-SQL, is part of Spark which is processing engine.
Apache Hive is a Data Warehouse on top of Hadoop.
Apache Impala is Both DataWarehouse(While Utilizing Hive Metastore) and
processing Engine.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-S
Are you using Windows? Switching over to Linux environment made that error go
away for me.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27884.html
Sent from the Apache Spark User List mailing list a
Try leveraging YARN/MESOS they have more Scheduling options then the
standalone cluster manager.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-a-scheduling-algorithm-in-Spark-tp27848p27856.html
Sent from the Apache Spark User List mailing
Assuming your using Sparks Standalone Cluster manager it has by default FIFO
>From the docs - /By default, applications submitted to the standalone mode
cluster will run in FIFO (first-in-first-out) order, and each application
will try to use all available nodes./ Link -
http://spark.apache.org/d
You need to use 2.0.0-M2-s_2.11 since Spark 2.0 is compiled with Scala 2.11
by default.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Write-to-Cassandra-table-from-pyspark-fails-with-scala-reflect-error-tp27723p27729.html
Sent from the Apache Spark User Li
Im assuming the dataset your dealing with is big hence why you wanted to
allocate ur full 16gb of Ram to it.
I suggest running the python spark-shell as such "pyspark --driver-memory
16g".
Also if you cache your data and it doesn't fully fit in memory you can do
df.cache(StorageLevel.MEMORY_AND_D
Double check your Driver Memory in your Spark Web UI make sure the driver
Memory is close to half of 16gb available.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Java-Heap-Error-tp27669p27704.html
Sent from the Apache Spark User List mailing list ar
If your in local mode just allocate all your memory you want to use to your
Driver(that acts as the executor in local mode) don't even bother changing
the executor memory. So your new settings should look like this...
spark.driver.memory 16g
spark.driver.maxResultSize 2g
spark
You need to pass --cluster-mode to spark-submit, this will push the driver to
cluster rather then it run locally on your computer.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-submit-Spark-Application-remotely-tp27640p27668.html
Sent fro
>From the spark
documentation(http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
yes you can use persist on a dataframe instead of cache. All cache is, is a
shorthand for the default persist storage level "MEMORY_ONLY". If you want
to persist the dataframe to disk you shoul
Why not just create a partitions for they key you want to groupby and save it
in there? Appending to a file already written to HDFS isn't the best idea
IMO.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p2
27 matches
Mail list logo