saveasorcfile on partitioned orc

2015-05-20 Thread patcharee
Hi, I followed the information on https://www.mail-archive.com/reviews@spark.apache.org/msg141113.html to save orc file with spark 1.2.1. I can save data to a new orc file. I wonder how to save data to an existing and partitioned orc file? Any suggestions? BR, Patcharee

Insert overwrite to hive - ArrayIndexOutOfBoundsException

2015-06-02 Thread patcharee
current.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Best, Patcharee - To unsubscribe, e-mail:

ERROR cluster.YarnScheduler: Lost executor

2015-06-02 Thread patcharee
Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

MetaException(message:java.security.AccessControlException: Permission denied

2015-06-03 Thread patcharee
at com.sun.proxy.$Proxy37.alter_partition(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:469) ... 26 more BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For addit

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee
943, chunkIndex=1}, buffer=FileSegmentManagedBuffer{file=/hdisk3/hadoop/yarn/local/usercache/patcharee/appcache/application_1432633634512_0213/blockmgr-12d59e6b-0895-4a0e-9d06-152d2f7ee855/09/shuffle_0_56_0.data, offset=896, length=1132499356}} to /10.10.255.238:35430; closing connect

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee
1.3.1, is the problem from the https://issues.apache.org/jira/browse/SPARK-4516? Best, Patcharee On 03. juni 2015 10:11, Akhil Das wrote: Which version of spark? Looks like you are hitting this one https://issues.apache.org/jira/browse/SPARK-4516 Thanks Best Regards On Wed, Jun 3, 2015 at 1

NullPointerException SQLConf.setConf

2015-06-04 Thread patcharee
nt.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: FetchFailed Exception

2015-06-05 Thread patcharee
Hi, I has this problem before, and in my case it is because the executor/container was killed by yarn when it used more memory than allocated. You can check if your case is the same by checking yarn node manager log. Best, Patcharee On 05. juni 2015 07:25, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I see this

write multiple outputs by key

2015-06-06 Thread patcharee
# partitions). at foreach there are > 1000 tasks as well, but 50 tasks (same as the # all key combination) gets datasets. How can I fix this problem? Any suggestions are appreciated. BR, Patcharee - To unsubscrib

hiveContext.sql NullPointerException

2015-06-06 Thread patcharee
Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What c

Re: hiveContext.sql NullPointerException

2015-06-07 Thread patcharee
Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51

Re: hiveContext.sql NullPointerException

2015-06-08 Thread patcharee
Hi, Thanks for your guidelines. I will try it out. Btw how do you know HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side? Where can I find document? BR, Patcharee On 07. juni 2015 16:40, Cheng Lian wrote: Spark SQL supports Hive dynamic

Re: hiveContext.sql NullPointerException

2015-06-11 Thread patcharee
ot;:true,\"metadata\":{}},{\"name\":\"v\",\"type\":\"float\",\"nullable\":true,\"metadata\":{}},{\"name\":\"zone\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}

sql.catalyst.ScalaReflection scala.reflect.internal.MissingRequirementError

2015-06-15 Thread patcharee
hemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHold

HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee
") .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark") --- The table contains 23 columns (longer than Tuple maximum length), so I use Row Object to store raw data, not Tupl

Re: HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee
I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the

Re: Kryo serialization of classes in additional jars

2015-06-26 Thread patcharee
Hi, I am having this problem on spark 1.4. Do you have any ideas how to solve it? I tried to use spark.executor.extraClassPath, but it did not help BR, Patcharee On 04. mai 2015 23:47, Imran Rashid wrote: Oh, this seems like a real pain. You should file a jira, I didn't see an open

Re: pyspark split pair rdd to multiple

2016-04-20 Thread patcharee
I can also use dataframe. Any suggestions? Best, Patcharee On 20. april 2016 10:43, Gourav Sengupta wrote: Is there any reason why you are not using data frames? Regards, Gourav On Tue, Apr 19, 2016 at 8:51 PM, pth001 <mailto:patcharee.thong...@uni.no>> wrote: Hi, How ca

what contribute to Task Deserialization Time

2016-07-21 Thread patcharee
vance! Patcharee - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

visualize data from spark streaming

2016-01-20 Thread patcharee
Hi, How to visualize realtime data (in graph/chart) from spark streaming? Any tools? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

spark streaming input rate strange

2016-01-22 Thread patcharee
raises up to 10,000, stays at 10,000 a while and drops to about 7000-8000. - When clients = 20,000 the event rate raises up to 20,000, stays at 20,000 a while and drops to about 15000-17000. The same pattern Processing time is just about 400 ms. Any ideas/suggestions? Thanks, Patcharee

streaming textFileStream problem - got only ONE line

2016-01-25 Thread patcharee
().print() The problem is sometimes the data received from scc.textFileStream is ONLY ONE line. But in fact there are multiple lines in the new file found in that interval. See log below which shows three intervals. In the 2nd interval, the new file is: hdfs://helmhdfs/user/patcharee/cerdata

Re: streaming textFileStream problem - got only ONE line

2016-01-29 Thread patcharee
I moved them every interval to the monitored directory. Patcharee On 25. jan. 2016 22:30, Shixiong(Ryan) Zhu wrote: Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or write into it directly? `textFileStream` requires that files must be written to the monitored

Pyspark filter not empty

2016-01-29 Thread patcharee
Hi, In pyspark how to filter if a column of dataframe is not empty? I tried: dfNotEmpty = df.filter(df['msg']!='') It did not work. Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@

kafka streaming topic partitions vs executors

2016-02-26 Thread patcharee
the topic's partitions). However some executors are given more than 1 tasks and work on these tasks sequentially. Why Spark does not distribute these 10 tasks to 10 executors? How to do that? Thanks, Patcharee

hiveContext sql number of tasks

2015-10-07 Thread patcharee
to force the spark sql to use less tasks? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

sql query orc slow

2015-10-08 Thread patcharee
Hi, I am using spark sql 1.5 to query a hive table stored as partitioned orc file. We have the total files is about 6000 files and each file size is about 245MB. What is the difference between these two query methods below: 1. Using query on hive table directly hiveContext.sql("select col1,

Re: sql query orc slow

2015-10-08 Thread patcharee
Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee wrote: Hi

Re: sql query orc slow

2015-10-09 Thread patcharee
this time in the log pushdown predicate was generated but results was wrong (no results at all) 15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS x 320) expr = leaf-0 Any ideas What wrong with this? Why the ORC pushdown predicate is not applied by the system? BR

Re: sql query orc slow

2015-10-09 Thread patcharee
I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the log No ORC pushdown predicate for my query with WHERE clause. 15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate I did not understand what wrong with this. BR, Patcharee On

execute native system commands in Spark

2015-11-02 Thread patcharee
Hi, Is it possible to execute native system commands (in parallel) Spark, like scala.sys.process ? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

How to run parallel on each DataFrame group

2015-11-05 Thread patcharee
lem is each group after filtered is handled by an executor one by one. How to change the code to allow each group run in parallel? I looked at groupBy, but seem only for aggregation. Thanks, Patcharee

spark streaming count msg in batch

2015-12-01 Thread patcharee
Hi, In spark streaming how to count the total number of message (from Socket) in one batch? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Spark Streaming - History UI

2015-12-01 Thread patcharee
Hi, On my history server UI, I cannot see "streaming" tab for any streaming jobs? I am using version 1.5.1. Any ideas? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional c

Re: Spark Streaming - History UI

2015-12-02 Thread patcharee
I meant there is no streaming tab at all. It looks like I need version 1.6 Patcharee On 02. des. 2015 11:34, Steve Loughran wrote: The history UI doesn't update itself for live apps (SPARK-7889) -though I'm working on it Are you trying to view a running streaming job? On 2 Dec 2

Spark applications metrics

2015-12-04 Thread patcharee
Hi How can I see the summary of data read / write, shuffle read / write, etc of an Application, not per stage? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

Spark UI - Streaming Tab

2015-12-04 Thread patcharee
need to configure the history UI somehow to get such interface? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark UI - Streaming Tab

2015-12-04 Thread patcharee
I ran streaming jobs, but no streaming tab appeared for those jobs. Patcharee On 04. des. 2015 18:12, PhuDuc Nguyen wrote: I believe the "Streaming" tab is dynamic - it appears once you have a streaming job running, not when the cluster is simply up. It does not depend on 1.6 and h

bad performance on PySpark - big text file

2015-12-08 Thread patcharee
log of these two input splits (check python.PythonRunner: Times: total ... ) 15/12/08 07:37:15 INFO rdd.NewHadoopRDD: Input split: hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/budisansblog.blogspot.com.html:39728447488+134217728 15/12/08 08:49:30 INFO python.PythonRunner

spark 1.5 sort slow

2015-09-01 Thread patcharee
y configuration explicitly? Any suggestions? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

spark performance - executor computing time

2015-09-15 Thread patcharee
and low gc time as others. What can impact the executor computing time? Any suggestions what parameters I should monitor/configure? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additio

Idle time between jobs

2015-09-16 Thread patcharee
.scala:143 15/09/16 11:21:08 INFO DAGScheduler: Got job 2 (saveAsTextFile at GenerateHistogram.scala:143) with 1 output partitions 15/09/16 11:21:08 INFO DAGScheduler: Final stage: ResultStage 2(saveAsTextFile at GenerateHistogram.scala:143) BR,

sparkR 3rd library

2017-09-03 Thread patcharee
ld not find function "rbga" at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala Any ide

RDD String foreach println

2015-02-24 Thread patcharee
differently on job submit and shell? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

method newAPIHadoopFile

2015-02-25 Thread patcharee
nts [no.uni.computing.io.WRFIndex,no.uni.computing.io.WRFVariable,no.uni.computing.io.input.NetCDFFileInputFormat] do not conform to method newAPIHadoopFile's type parameter bounds [K,V,F <: org.apache.hadoop.mapreduce.InputFormat[K,V]] What is the correct syntax for scala api? Best

Re: method newAPIHadoopFile

2015-02-25 Thread patcharee
This is the declaration of my custom inputformat public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat public abstract class ArrayBasedFileInputFormat extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat Best, Patcharee On 25. feb. 2015 10:15, patcharee wrote: Hi

Re: method newAPIHadoopFile

2015-02-25 Thread patcharee
complain. Please let me know if this solution is not good enough. Patcharee On 25. feb. 2015 10:57, Sean Owen wrote: OK, from the declaration you sent me separately: public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat public abstract class ArrayBasedFileInputFormat extends

custom inputformat serializable problem

2015-02-26 Thread patcharee
thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: no.uni.computing.io.WRFVariableText Any ideas? Best, Patcharee - To unsubscribe, e-mail: user-unsubs

NoSuchElementException: None.get

2015-02-27 Thread patcharee
belongs to a method of a case class, it should be executed sequentially? Any ideas? Best, Patcharee --- java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None

insert Hive table with RDD

2015-03-03 Thread patcharee
Hi, How can I insert an existing hive table with an RDD containing my data? Any examples? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: insert Hive table with RDD

2015-03-04 Thread patcharee
Hi, I guess that toDF() api in spark 1.3 which is required build from source code? Patcharee On 03. mars 2015 13:42, Cheng, Hao wrote: Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc = new HiveContext(sc) i

insert hive partitioned table

2015-03-16 Thread patcharee
month, zone) is from user input. If I would like to get the value of the partitioned column from the temporary table, how can I do that? BR, Patcharee

Re: insert hive partitioned table

2015-03-16 Thread patcharee
I would like to insert the table, and the value of the partition column to be inserted must be from temporary registered table/dataframe. Patcharee On 16. mars 2015 15:26, Cheng Lian wrote: Not quite sure whether I understand your question properly. But if you just want to read the

Spark Job History Server

2015-03-18 Thread patcharee
spark.yarn.historyServer.address sandbox.hortonworks.com:19888 But got Exception in thread "main" java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.history.YarnHistoryProvider What class is really needed? How to fix it? Br,

Re: Spark Job History Server

2015-03-18 Thread patcharee
ive Method) at java.lang.Class.forName(Class.java:191) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:183) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) Patcharee On 18. mars 2015 11:35, Akhil Das wrote: You can simply

Re: Spark Job History Server

2015-03-18 Thread patcharee
Hi, My spark was compiled with yarn profile, I can run spark on yarn without problem. For the spark job history server problem, I checked spark-assembly-1.3.0-hadoop2.4.0.jar and found that the package org.apache.spark.deploy.yarn.history is missing. I don't know why BR, Patc

override log4j.properties

2015-04-09 Thread patcharee
Hello, How to override log4j.properties for a specific spark job? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

AccessControlException hive table created from spark shell

2015-05-18 Thread patcharee
table(key INT, value STRING) stored as orc") hiveContext.hql("INSERT INTO table orc_table select * from testtable") --> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=patcharee, access=WRITE, inod

executor running time vs getting result from jupyter notebook

2016-04-14 Thread Patcharee Thongtra
factor of time spending on these steps? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra
not sorted / indexed - the split strategy hive.exec.orc.split.strategy BR, Patcharee On 10/09/2015 08:01 PM, Zhan Zhang wrote: That is weird. Unfortunately, there is no debug info available on this part. Can you please open a JIRA to add some debug information on the driver side? Thanks. Zhan

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra
Hi Zhan Zhang, Here is the issue https://issues.apache.org/jira/browse/SPARK-11087 BR, Patcharee On 10/13/2015 06:47 PM, Zhan Zhang wrote: Hi Patcharee, I am not sure which side is wrong, driver or executor. If it is executor side, the reason you mentioned may be possible. But if the

locality level counter

2015-11-25 Thread Patcharee Thongtra
? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

data local read counter

2015-11-25 Thread Patcharee Thongtra
Hi, Is there a counter for data local read? I understood that it is locality level counter, but it seems not. Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

custom inputformat recordreader

2015-11-26 Thread Patcharee Thongtra
Hi, In python how to use inputformat/custom recordreader? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

java.lang.RuntimeException: Couldn't find function Some

2015-03-09 Thread Patcharee Thongtra
) Any ideas? I tested the same code on spark shell, it worked. Best, Patcharee

bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-11 Thread Patcharee Thongtra
tory.value / "lib" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" libraryDependencies += "org.apache.spark" %% "spark-sql&quo

No assemblies found in assembly/target/scala-2.10

2015-03-13 Thread Patcharee Thongtra
der.java:177) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:102) at org.apache.spark.launcher.Main.main(Main.java:74) Any ideas? Patcharee