Re: Accessing Hive Tables in Spark

2018-04-10 Thread Dr. Kent Yao
Applying this fix https://github.com/apache/spark/pull/19663 
Or 
Using --files or --jars /local/path/to/hive-site.xml may works

Thanks,
Kent



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



cache OS memory and spark usage of it

2018-04-10 Thread José Raúl Pérez Rodríguez

Hi,

When I issue a "free -m" command in a host, I see a lot of memory used 
for cache in OS, however Spark Streaming is not able to request that 
memory for its usage, and it fail the execution due to not been able to 
launch executors.


What I understand of the OS memory cache (the one in "free -m" command 
result) is that, in practice is a free memory, because programs can 
request that memory for usage when needed, and OS "gives" the requested 
amount to the program. Is that right? If not, what is the behavior of OS 
cache? And what can spark do to use this memory?


Thanks a lot,

Raúl



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-10 Thread Louis Hust
We  want to extract data from mysql, and calculate in sparksql.
The sql explain like below.

== Parsed Logical Plan ==
> 'Sort ['revenue DESC NULLS LAST], true
> +- 'Aggregate ['n_name], ['n_name, 'SUM(('l_extendedprice * (1 -
> 'l_discount))) AS revenue#329]
>+- 'Filter ('c_custkey = 'o_custkey) && ('l_orderkey =
> 'o_orderkey)) && ('l_suppkey = 's_suppkey)) && (('c_nationkey =
> 's_nationkey) && ('s_nationkey = 'n_nationkey))) && ((('n_regionkey =
> 'r_regionkey) && ('r_name = AFRICA)) && (('o_orderdate >= 1993-01-01) &&
> ('o_orderdate < 1994-01-01
>   +- 'Join Inner
>  :- 'Join Inner
>  :  :- 'Join Inner
>  :  :  :- 'Join Inner
>  :  :  :  :- 'Join Inner
>  :  :  :  :  :- 'UnresolvedRelation `customer`
>  :  :  :  :  +- 'UnresolvedRelation `orders`
>  :  :  :  +- 'UnresolvedRelation `lineitem`
>  :  :  +- 'UnresolvedRelation `supplier`
>  :  +- 'UnresolvedRelation `nation`
>  +- 'UnresolvedRelation `region`
> == Analyzed Logical Plan ==
> n_name: string, revenue: decimal(38,4)
> Sort [revenue#329 DESC NULLS LAST], true
> +- Aggregate [n_name#176], [n_name#176,
> sum(CheckOverflow((promote_precision(cast(l_extendedprice#68 as
> decimal(16,2))) *
> promote_precision(cast(CheckOverflow((promote_precision(cast(cast(1 as
> decimal(1,0)) as decimal(16,2))) - promote_precision(cast(l_discount#69 as
> decimal(16,2, DecimalType(16,2)) as decimal(16,2,
> DecimalType(32,4))) AS revenue#329]
>+- Filter (c_custkey#273 = o_custkey#1) && (l_orderkey#63 =
> o_orderkey#0)) && (l_suppkey#65 = s_suppkey#224)) && ((c_nationkey#276 =
> s_nationkey#227) && (s_nationkey#227 = n_nationkey#175))) &&
> (((n_regionkey#177 = r_regionkey#203) && (r_name#204 = AFRICA)) &&
> ((cast(o_orderdate#4 as string) >= 1993-01-01) && (cast(o_orderdate#4 as
> string) < 1994-01-01
>   +- Join Inner
>  :- Join Inner
>  :  :- Join Inner
>  :  :  :- Join Inner
>  :  :  :  :- Join Inner
>  :  :  :  :  :- SubqueryAlias customer
>  :  :  :  :  :  +-
> Relation[C_CUSTKEY#273,C_NAME#274,C_ADDRESS#275,C_NATIONKEY#276,C_PHONE#277,C_ACCTBAL#278,C_MKTSEGMENT#279,C_COMMENT#280]
> JDBCRelation(customer) [numPartitions=1]
>  :  :  :  :  +- SubqueryAlias orders
>  :  :  :  : +-
> Relation[O_ORDERKEY#0,O_CUSTKEY#1,O_ORDERSTATUS#2,O_TOTALPRICE#3,O_ORDERDATE#4,O_ORDERPRIORITY#5,O_CLERK#6,O_SHIPPRIORITY#7,O_COMMENT#8]
> JDBCRelation(orders) [numPartitions=1]
>  :  :  :  +- SubqueryAlias lineitem
>  :  :  : +-
> Relation[L_ORDERKEY#63,L_PARTKEY#64,L_SUPPKEY#65,L_LINENUMBER#66,L_QUANTITY#67,L_EXTENDEDPRICE#68,L_DISCOUNT#69,L_TAX#70,L_RETURNFLAG#71,L_LINESTATUS#72,L_SHIPDATE#73,L_COMMITDATE#74,L_RECEIPTDATE#75,L_SHIPINSTRUCT#76,L_SHIPMODE#77,L_COMMENT#78]
> JDBCRelation(lineitem) [numPartitions=1]
>  :  :  +- SubqueryAlias supplier
>  :  : +-
> Relation[S_SUPPKEY#224,S_NAME#225,S_ADDRESS#226,S_NATIONKEY#227,S_PHONE#228,S_ACCTBAL#229,S_COMMENT#230]
> JDBCRelation(supplier) [numPartitions=1]
>  :  +- SubqueryAlias nation
>  : +-
> Relation[N_NATIONKEY#175,N_NAME#176,N_REGIONKEY#177,N_COMMENT#178]
> JDBCRelation(nation) [numPartitions=1]
>  +- SubqueryAlias region
> +- Relation[R_REGIONKEY#203,R_NAME#204,R_COMMENT#205]
> JDBCRelation(region) [numPartitions=1]
> == Optimized Logical Plan ==
> Sort [revenue#329 DESC NULLS LAST], true
> +- Aggregate [n_name#176], [n_name#176,
> sum(CheckOverflow((promote_precision(cast(l_extendedprice#68 as
> decimal(16,2))) * promote_precision(CheckOverflow((1.00 -
> promote_precision(cast(l_discount#69 as decimal(16,2,
> DecimalType(16,2, DecimalType(32,4))) AS revenue#329]
>+- Project [L_EXTENDEDPRICE#68, L_DISCOUNT#69, N_NAME#176]
>   +- Join Inner, (n_regionkey#177 = r_regionkey#203)
>  :- Project [L_EXTENDEDPRICE#68, L_DISCOUNT#69, N_NAME#176,
> N_REGIONKEY#177]
>  :  +- Join Inner, (s_nationkey#227 = n_nationkey#175)
>  : :- Project [L_EXTENDEDPRICE#68, L_DISCOUNT#69,
> S_NATIONKEY#227]
>  : :  +- Join Inner, ((l_suppkey#65 = s_suppkey#224) &&
> (c_nationkey#276 = s_nationkey#227))
>  : : :- Project [C_NATIONKEY#276, L_SUPPKEY#65,
> L_EXTENDEDPRICE#68, L_DISCOUNT#69]
>  : : :  +- Join Inner, (l_orderkey#63 = o_orderkey#0)
>  : : : :- Project [C_NATIONKEY#276, O_ORDERKEY#0]
>  : : : :  +- Join Inner, (c_custkey#273 = o_custkey#1)
>  : : : : :- Project [C_CUSTKEY#273,
> C_NATIONKEY#276]
>  : : : : :  +- Filter (isnotnull(c_custkey#273) &&
> isnotnull(c_nationkey#276))
>  : : : : : +- InMemoryRelation [C_CUSTKEY#273,
> C_NAME#274, C_ADDRESS#275, C_NATIONKEY#276, C_PHONE#277, C_ACCTBAL#278,
> C_MKTSEGMENT#279, C_COMMENT#280], true, 1, StorageLevel(disk, memory, 1
> repl

Re: cache OS memory and spark usage of it

2018-04-10 Thread Jose Raul Perez Rodriguez

it was helpful,

Then, the OS needs to fill some pressure from the applications 
requesting memory to free some memory cache?


Exactly under which circumstances the OS free that memory to give it to 
applications requesting it?


I mean if the total memory is 16GB and 10GB are used for OS cache, how 
the JVM can obtain memory from that.


Thanks,


On 11/04/18 01:36, yncxcw wrote:

hi, Raúl

First, the most of the OS memory cache is used by  Page Cache
   which OS use for caching the
recent read/write I/O.

I think the understanding of OS memory cache should be discussed in two
different perspectives. From a perspective of
user-space (e.g, Spark application), it is not used, since the Spark is not
allocating memory from this part of memory.
However, from a perspective of OS, it is actually used, because the memory
pages are already allocated for caching the
I/O pages. For each I/O request, the OS always allocate memory pages to
cache it to expect these cached I/O pages can be reused in near future.
Recall, you use vim/emacs to open a large file. It is pretty slow when you
open it at the first time. But it will be much faster when you close it and
open it immediately because the file has been cached in file cache at the
first time you open it.

It is hard for Spark to use this part of memory. Because this part of the
memory is managed by OS and is transparent to applications.  The only thing
you can do is that you can continuously allocate memory from OS (by
malloc()), to some certain points which the OS senses some memory pressure,
the OS will voluntarily release the page cache to satisfy your memory
allocation. Another thing is that the memory limit of Spark is limited by
maximum JVM heap size. So your memory request from your Spark application is
actually handled by JVM not the OS.


Hope this answer can help you!


Wei




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org