So I think cache large data is not a best practice.
At 2019-01-16 12:24:34, "大啊" wrote:
Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually
stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in
Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually
stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.
At 2019-01-16 02:20:56, "Tomas Bartalos" wrote:
Hi Mohit,
I’m not sure that there is a “correct” answer here, but I tend to use classes
whenever the input or output data represents something meaningful (such as a
domain model object). I would recommend against creating many temporary classes
for each and every transformation step as that
Glad to hear this.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Congrats, Great work Dongjoon.
Dongjoon Hyun 于2019年1月15日周二 下午3:47写道:
> We are happy to announce the availability of Spark 2.2.3!
>
> Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
> maintenance branch of Spark. We strongly recommend all 2.2.x users to
> upgrade to this st
You should check the active threads in your app. Since your pool uses
non-daemon threads, that will prevent the app from exiting.
spark.stop() should have stopped the Spark jobs in other threads, at
least. But if something is blocking one of those threads, or if
something is creating a non-daemon
I submitted a Spark job through ./spark-submit command, the code was
executed successfully, however, the application got stuck when trying to
quit spark.
My code snippet:
'''
{
val spark = SparkSession.builder.master(...).getOrCreate
val pool = Executors.newFixedThreadPool(3)
implicit val xc = E
Fellow Spark Coders,
I am trying to move from using Dataframes to Datasets for a reasonably
large code base. Today the code looks like this:
df_a= read_csv
df_b = df.withColumn ( some_transform_that_adds_more_columns )
//repeat the above several times
With datasets, this will require defining
ca
Hello,
I'm using spark-thrift server and I'm searching for best performing
solution to query hot set of data. I'm processing records with nested
structure, containing subtypes and arrays. 1 record takes up several KB.
I tried to make some improvement with cache table:
cache table event_jan_01 as
Hi,
Considering a directed graph with 15,000 vertices and 14,000 edges, I wonder
why GraphX (Pregel) takes much more time than the java implementation of a
graph to get all the vertices from a vertex to the leaf?
By the nature of the graph, we can almost consider it as a tree.
The java implementa
Hi all,
I want to re-send the previous SPIP on introducing a DataFrame-based graph
component to collect more feedback. It supports property graphs, Cypher
graph queries, and graph algorithms built on top of the DataFrame API. If
you are a GraphX user or your workload is essentially graph queries,
Hi,
In my problem data is stored on both Database and HDFS. I create an
application that according to the query, Spark load data, process the
query and return the answer.
I'm looking for a service that gets SQL queries and returns the answers
(like Databases command line). Is there a way that my
12 matches
Mail list logo