??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??
hi luo, thanks for your reply in fact I can use hive by spark on my spark master machine, but when I copy my spark files to another machine and when I want to access the hive by spark get the error "Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient" , I have copy hiv

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread Tristan Blakers
Hi Shahab, I’ve seen exceptions very similar to this (it also manifests as negative array size exception), and I believe it’s a really bug in Kryo. See this thread: http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail

Unable to join table across data sources using sparkSQL

2015-05-05 Thread Ishwardeep Singh
Hi , I am trying to use sparkSQL to join tables in different data sources - hive and teradata. I can access the tables individually but when I run join query I get an query exception. The same query runs if all the tables exist in teradata. Any help would be appreciated. I am running the foll

RE: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread Wang, Daoyuan
How did you configure your metastore? Thanks, Daoyuan From: 鹰 [mailto:980548...@qq.com] Sent: Tuesday, May 05, 2015 3:11 PM To: luohui20001 Cc: user Subject: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient hi luo, thanks for your reply in fact I can use hive

??????RE: ??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??
my metastore is like this javax.jdo.option.ConnectionURL jdbc:mysql://192.168.1.40:3306/hive javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUs

Re: spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread jeanlyn
Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar) > ?? 2015??5??53:32 <980548...@qq.com> ?? > > my metastore is like this > > javax.jdo.option.ConnectionURL >

Re: Unable to join table across data sources using sparkSQL

2015-05-05 Thread ayan guha
The error shows d_year column does not exist. You may need to modify the query. On 5 May 2015 17:20, "Ishwardeep Singh" wrote: > Hi , > > I am trying to use sparkSQL to join tables in different data sources - hive > and teradata. I can access the tables individually but when I run join > query >

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread shahab
Thanks Tristan for sharing this. Actually this happens when I am reading a csv file of 3.5 GB. best, /Shahab On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers wrote: > Hi Shahab, > > I’ve seen exceptions very similar to this (it also manifests as negative > array size exception), and I believe

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
Turned out that is was sufficient do to repartitionAndSortWithinPartitions ... so far so good ;) On Tue, May 5, 2015 at 9:45 AM Marius Danciu wrote: > Hi Imran, > > Yes that's what MyPartitioner does. I do see (using traces from > MyPartitioner) that the key is partitioned on partition 0 but the

RE: Unable to join table across data sources using sparkSQL

2015-05-05 Thread Ishwardeep Singh
Hi , I am using Spark 1.3.0. I was able to join a JSON file on HDFS registered as a TempTable with a table in MySQL. On the same lines I tried to join a table in Hive with another table in Teradata but I get a query parse exception. Regards, Ishwardeep From: ankitjindal [via Apache Spark Use

?????? spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??
thanks jeanlyn itworks -- -- ??: "jeanlyn";; : 2015??5??5??(??) 3:40 ??: "??"<980548...@qq.com>; : "Wang, Daoyuan"; "user"; : Re: spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Have

Re: Event generator for SPARK-Streaming from csv

2015-05-05 Thread anshu shukla
I know these methods , but i need to create events using the timestamps in the data tuples ,means every time a new tuple is generated using the timestamp in a CSV file .this will be useful to simulate the data rate with time just like real sensor data . On Fri, May 1, 2015 at 2:52 PM, Juan Rodrí

setting spark configuration properties problem

2015-05-05 Thread Hafiz Mujadid
Hi all, i have declared spark context at start of my program and then i want to change it's configurations at some later stage in my code as written below val conf = new SparkConf().setAppName("Cassandra Demo") var sc:SparkContext=new SparkContext(conf) sc.getConf.set("spark.cassandra.connection.

JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta
Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav

example code for current date in spark sql

2015-05-05 Thread kiran mavatoor
Hi, In Hive , I am using unix_timestamp() as 'update_on' to insert current date in 'update_on' column of the table. Now I am converting it into spark sql. Please suggest example code to insert current date and time into column of the table using spark sql.  CheersKiran.

Two DataFrames with different schema, unionAll issue.

2015-05-05 Thread Wilhelm
Hey there, 1.) I'm loading 2 avro files with that have slightly different schema df1 = sqlc.load(file1, "com.databricks.spark.avro") df2 = sqlc.load(file2, "com.databricks.spark.avro") 2.) I want to unionAll them nfd = dfs1.unionAll(dfs2) 3.) Getting the following error --

Re: JAVA for SPARK certification

2015-05-05 Thread Kartik Mehta
I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, "Gourav Sengupta" wrote: > Hi, > > how important is JAVA for Spark certif

Re: JAVA for SPARK certification

2015-05-05 Thread Stephen Boesch
There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta : > I too have similar question. > > My understanding is since Spark written in scala, having done in Scala > will be ok for certification. > > If someone who has done certification can confirm. > > Thanks, > > Kar

spark sql, creating literal columns in java.

2015-05-05 Thread Jan-Paul Bultmann
Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail: user-unsubscr...

Re: RDD coalesce or repartition by #records or #bytes?

2015-05-05 Thread Du Li
Hi, Spark experts: I did rdd.coalesce(numPartitions).saveAsSequenceFile("dir") in my code, which generates the rdd's in streamed batches. It generates numPartitions of files as expected with names dir/part-x. However, the first couple of files (e.g., part-0, part-1) have many times o

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
And how important is to have production environment? On 5 May 2015 20:51, "Stephen Boesch" wrote: > There are questions in all three languages. > > 2015-05-05 3:49 GMT-07:00 Kartik Mehta : > >> I too have similar question. >> >> My understanding is since Spark written in scala, having done in Sca

Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz
I'm tryting to execute the "Hello World" example with Spark + Kafka ( https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala) with createDirectStream and I get this error. java.lang.NoSuchMethodError: kafka.mes

RE: Unable to join table across data sources using sparkSQL

2015-05-05 Thread Ishwardeep Singh
Hi Ankit, printSchema() works fine for all the tables. hiveStoreSalesDF.printSchema() root |-- store_sales.ss_sold_date_sk: integer (nullable = true) |-- store_sales.ss_sold_time_sk: integer (nullable = true) |-- store_sales.ss_item_sk: integer (nullable = true) |-- store_sales.ss_customer_sk: in

RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-05 Thread Cheng, Hao
56mb / 26mb is very small size, do you observe data skew? More precisely, many records with the same chrname / name? And can you also double check the jvm settings for the executor process? From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 5, 2015 7:50 PM To: Cheng, H

Re: Unable to join table across data sources using sparkSQL

2015-05-05 Thread ayan guha
I suggest you to try with date_dim.d_year in the query On Tue, May 5, 2015 at 10:47 PM, Ishwardeep Singh < ishwardeep.si...@impetus.co.in> wrote: > Hi Ankit, > > > > printSchema() works fine for all the tables. > > > > hiveStoreSalesDF.printSchema() > > root > > |-- store_sales.ss_sold_date_sk:

multiple hdfs folder & files input to PySpark

2015-05-05 Thread Oleg Ruchovets
Hi We are using pyspark 1.3 and input is text files located on hdfs. file structure file1.txt file2.txt file1.txt file2.txt ... Question: 1) What is the way to provide as an input for PySpark job multiple files

Re: Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz
Sorry, I had a duplicated kafka dependency with another older version in another pom.xml 2015-05-05 14:46 GMT+02:00 Guillermo Ortiz : > I'm tryting to execute the "Hello World" example with Spark + Kafka ( > https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache

How to separate messages of different topics.

2015-05-05 Thread Guillermo Ortiz
I want to read from many topics in Kafka and know from where each message is coming (topic1, topic2 and so on). val kafkaParams = Map[String, String]("metadata.broker.list" -> "myKafka:9092") val topics = Set("EntryLog", "presOpManager") val directKafkaStream = KafkaUtils.createDirectStream[Str

Re: JAVA for SPARK certification

2015-05-05 Thread Kartik Mehta
Production - not whole lot of companies have implemented Spark in production and so though it is good to have, not must. If you are on LinkedIn, a group of folks including myself are preparing for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM, "

Re: JAVA for SPARK certification

2015-05-05 Thread Zoltán Zvara
I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from above)? Thanks! On Tue, May 5, 2015 at 3:37 PM Kartik Mehta wrote: > Production - not whole lot of companies have implemented Spark in > production an

Re: How to separate messages of different topics.

2015-05-05 Thread Cody Koeninger
Make sure to read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md The directStream / KafkaRDD has a 1 : 1 relationship between kafka topic/partition and spark partition. So a given spark partition only has messages from 1 kafka topic. You can tell what topic that is using

Parquet number of partitions

2015-05-05 Thread Eric Eijkelenboom
Hello guys Q1: How does Spark determine the number of partitions when reading a Parquet file? val df = sqlContext.parquetFile(path) Is it some way related to the number of Parquet row groups in my input? Q2: How can I reduce this number of partitions? Doing this: df.rdd.coalesce(200).count f

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
Very interested @Kartik/Zoltan. Please let me know how to connect on LI On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara wrote: > I might join in to this conversation with an ask. Would someone point me > to a decent exercise that would approximate the level of this exam (from > above)? Thanks! > >

Re: Spark job concurrency problem

2015-05-05 Thread Imran Rashid
can you give your entire spark submit command? Are you missing "--executor-cores "? Also, if you intend to use all 6 nodes, you also need "--num-executors 6" On Mon, May 4, 2015 at 2:07 AM, Xi Shen wrote: > Hi, > > I have two small RDD, each has about 600 records. In my code, I did > > val rdd

Re: JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta
Hi, I think all the required materials for reference are mentioned here: http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification My question was regarding the proficiency level required for Java. There are detailed examples and code mentioned for JAVA, Python

Re: How to skip corrupted avro files

2015-05-05 Thread Imran Rashid
You might be interested in https://issues.apache.org/jira/browse/SPARK-6593 and the discussion around the PRs. This is probably more complicated than what you are looking for, but you could copy the code for HadoopReliableRDD in the PR into your own code and use it, without having to wait for the

Re: Parquet number of partitions

2015-05-05 Thread Masf
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x=> number of partitions. Depend on your case, repartition could be a heavy task Regards. Miguel

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-05 Thread Imran Rashid
Gerard is totally correct -- to expand a little more, I think what you want to do is a solrInputDocumentJavaRDD.foreachPartition, instead of solrInputDocumentJavaRDD.foreach: solrInputDocumentJavaRDD.foreachPartition( new VoidFunction>() { @Override public void call(Iterator docItr) {

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread Imran Rashid
Are you setting a really large max buffer size for kryo? Was this fixed by https://issues.apache.org/jira/browse/SPARK-6405 ? If not, we should open up another issue to get a better warning in these cases. On Tue, May 5, 2015 at 2:47 AM, shahab wrote: > Thanks Tristan for sharing this. Actuall

Where does Spark persist RDDs on disk?

2015-05-05 Thread Haoliang Quan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed? Thank y

Where does Spark persist RDDs on disk?

2015-05-05 Thread hquan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed? Thank

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server so far and using JDBC prepared statements to escape potentially malicious user input. I am trying to port our code directly to HiveContext now (i.e. eliminate the use of Thrift Server) and I am not quite sure how to genera

Possible to disable Spark HTTP server ?

2015-05-05 Thread roy
Hi, When we start spark job it start new HTTP server for each new job. Is it possible to disable HTTP server for each job ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html Sent from the Apache Sp

Inserting Nulls

2015-05-05 Thread Masf
Hi. I have a spark application where I store the results into table (with HiveContext). Some of these columns allow nulls. In Scala, this columns are represented through Option[Int] or Option[Double].. Depend on the data type. For example: *val hc = new HiveContext(sc)* *var col1: Option[Ingeger

Re: Possible to disable Spark HTTP server ?

2015-05-05 Thread Ted Yu
SPARK-3490 introduced "spark.ui.enabled" FYI On Tue, May 5, 2015 at 8:41 AM, roy wrote: > Hi, > > When we start spark job it start new HTTP server for each new job. > Is it possible to disable HTTP server for each job ? > > > Thanks > > > > -- > View this message in context: > http://apache-s

Re: How to skip corrupted avro files

2015-05-05 Thread Shing Hing Man
Thanks for the info ! Shing On Tuesday, 5 May 2015, 15:11, Imran Rashid wrote: You might be interested in https://issues.apache.org/jira/browse/SPARK-6593 and the discussion around the PRs. This is probably more complicated than what you are looking for, but you could copy the cod

Spark applications Web UI at 4040 doesn't exist

2015-05-05 Thread marco.doncel
Hi all, I'm not able to access to the Spark Streaming running applications that I'm submitting to the EC2 standalone cluster (spark 1.3.1) via port 4040. The problem is that I don't even see running applications in the master's web UI (I do see running drivers). This is the command I use to submit

Number of files to load

2015-05-05 Thread Rendy Bambang Junior
Let say I am storing my data in HDFS with folder structure and file partitioning as per below: /analytics/2015/05/02/partition-2015-05-02-13-50- Note that new file is created every 5 minutes. As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minut

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney
"As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes." This depends on the file format. Many file formats are splittable (like parquet), meaning that you can seek into various points of the file. 2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior

Re: Inserting Nulls

2015-05-05 Thread Michael Armbrust
Option only works when you are going from case classes. Just put null into the Row, when you want the value to be null. On Tue, May 5, 2015 at 9:00 AM, Masf wrote: > Hi. > > I have a spark application where I store the results into table (with > HiveContext). Some of these columns allow nulls.

Re: spark sql, creating literal columns in java.

2015-05-05 Thread Michael Armbrust
This should work from java too: http://spark.apache.org/docs/1.3.1/api/java/index.html#org.apache.spark.sql.functions$ On Tue, May 5, 2015 at 4:15 AM, Jan-Paul Bultmann wrote: > Hey, > What is the recommended way to create literal columns in java? > Scala has the `lit` function from `org.apache

saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan
I have searched all replies to this question & not found an answer.I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount).Only line I added to t

Parquet Partition Strategy - how to partition data correctly

2015-05-05 Thread Todd Nist
Hi, I have a DataFrame that represents my data looks like this: +-++ | col_name| data_type | +-++ | obj_id | string | | type| string | | name

Maximum Core Utilization

2015-05-05 Thread Manu Kaul
Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large number of cores available (like 100 cores), the Spark system seems to stop after using just 4 cores and after that the runtime is pretty much a straight line n

Re: Maximum Core Utilization

2015-05-05 Thread Richard Marscher
Hi, do you have information on how many partitions/tasks the stage/job is running? By default there is 1 core per task, and your number of concurrent tasks may be limiting core utilization. There are a few settings you could play with, assuming your issue is related to the above: spark.default.pa

Multilabel Classification in spark

2015-05-05 Thread peterg
Hi all, I'm looking to implement a Multilabel classification algorithm but I am surprised to find that there are not any in the spark-mllib core library. Am I missing something? Would someone point me in the right direction? Thanks! Peter -- View this message in context: http://apache-spar

RE: Remoting warning when submitting to cluster

2015-05-05 Thread Javier Delgadillo
I downloaded the 1.3.1 source distribution and built on Windows (laptop 8.0 and desktop 8.1) Here’s what I’m running: Desktop: Spark Master (%SPARK_HOME%\bin\spark-class2.cmd org.apache.spark.deploy.master.Master -h desktop --port 7077) Spark Worker (%SPARK_HOME%\bin\spark-class2.cmd org.apache

Re: Multilabel Classification in spark

2015-05-05 Thread DB Tsai
LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg wrote: > Hi all, > > I'm looking to implement a Multilabel classification algor

Help with datetime comparison in SparkSQL statement ...

2015-05-05 Thread subscripti...@prismalytics.io
Hello Friends: Here's sample output from a SparkSQL query that works, just so you can see the underlying data structure; followed by one that fails. >>> # Just you you can see the DataFrame structure ... >>> >>> resultsRDD = sqlCtx.sql("SELECT * FROM rides WHERE trip_time_in_secs = 3780") >>

Map one RDD into two RDD

2015-05-05 Thread Bill Q
Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thank

Re: Map one RDD into two RDD

2015-05-05 Thread Ted Yu
Have you looked at RDD#randomSplit() (as example) ? Cheers On Tue, May 5, 2015 at 2:42 PM, Bill Q wrote: > Hi all, > I have a large RDD that I map a function to it. Based on the nature of > each record in the input RDD, I will generate two types of data. I would > like to save each type into it

Configuring Number of Nodes with Standalone Scheduler

2015-05-05 Thread Nastooh Avessta (navesta)
Hi I have a 1.0.0 cluster with multiple worker nodes that deploy a number of external tasks, through getRuntime().exec. Currently I have no control on how many nodes get deployed for a given task. At times scheduler evenly distributes the executors among all nodes and at other times it only us

Spark SQL Standalone mode missing parquet?

2015-05-05 Thread Manu Mukerji
Hi All, When I try and run Spark SQL in standalone mode it appears to be missing the parquet jar, I have to pass it as -jars and that works.. sbin/start-thriftserver.sh --jars lib/parquet-hive-bundle-1.6.0.jar --driver-memory 28g --master local[10] Any ideas on why? I downloaded the one pre buil

[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-05 Thread Reynold Xin
Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015, will be the last minor release that supports Java 6. That is to say: Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8. Spark 1.5+ (~

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, "Sudarshan" wrote: > > I have searched all replies to this question & not found an answer. > > I am running standalone Spark 1.3.1 and Hortonwork's

Re: Multilabel Classification in spark

2015-05-05 Thread Joseph Bradley
If you mean "multilabel" (predicting multiple label values), then MLlib does not yet support that. You would need to predict each label separately. If you mean "multiclass" (1 label taking >2 categorical values), then MLlib supports it via LogisticRegression (as DB said), as well as DecisionTree

Re: Maximum Core Utilization

2015-05-05 Thread ayan guha
Also, if not already done, you may want to try repartition your data to 50 partition s On 6 May 2015 05:56, "Manu Kaul" wrote: > Hi All, > For a job I am running on Spark with a dataset of say 350,000 lines (not > big), I am finding that even though my cluster has a large number of cores > avail

AvroFiles

2015-05-05 Thread Pankaj Deshpande
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro file was created using Avro 1.7.7. Similar to the example mentioned in http://www.infoobjects.com/spark-with-avro/ I am getting a nullPointerException on Schema read. It could be a avro version mismatch. Has anybody had a simil

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
You are most probably right. I assumed others may have run into this. When I try to put the files in there, it creates a directory structure with the part-0 and part1 files but these files are of size 0 - no content. The client error and the server logs have the error message shown - which

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
Another thing - could it be a permission problem ? It creates all the directory structure (in red)/tmp/wordcount/ _temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 so I am guessing not. On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty wrote: > You are most probably right

Re: Number of files to load

2015-05-05 Thread Rendy Bambang Junior
Thanks, Im not aware of splittable file formats. If that is the case, is number of files affect spark performance? Maybe because overhead when opening file? And that problem is solved by having a big sized files in splittable file format? Any suggestion from your experience how to organize data

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney
You should check out parquet. If you can avoid 5minute log files, you can have an hourly (or daily!) MR job that compacts these. Another nice thing about parquet is it has filter push down so if you want a smaller range of time you can avoid deserializing most of the other data El martes, 5 de ma

Re: Join between Streaming data vs Historical Data in spark

2015-05-05 Thread Rendy Bambang Junior
Thanks. Since join will be done in regular basis in short period of time ( let say 20s) do you have any suggestions how to make it faster? I am thinking of partitioning data set and cache it. Rendy On Apr 30, 2015 6:31 AM, "Tathagata Das" wrote: > Have you taken a look at the join section in t

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
Try to add one more data node or make minreplication to 0. Hdfs is trying to replicate at least one more copy and not able to find another DN to do thay On 6 May 2015 09:37, "Sudarshan Murty" wrote: > Another thing - could it be a permission problem ? > It creates all the directory structure (in

Re: Two DataFrames with different schema, unionAll issue.

2015-05-05 Thread Michael Armbrust
You need to add a select clause to at least one dataframe to give them the same schema before you can union them (much like in SQL). On Tue, May 5, 2015 at 3:24 AM, Wilhelm wrote: > Hey there, > > 1.) I'm loading 2 avro files with that have slightly different schema > > df1 = sqlc.load(file1, "c

Re: Two DataFrames with different schema, unionAll issue.

2015-05-05 Thread Michael Armbrust
I'll add that simple type promotion is done automatically when the types are compatible (i.e. Int -> Long). On Tue, May 5, 2015 at 5:55 PM, Michael Armbrust wrote: > You need to add a select clause to at least one dataframe to give them the > same schema before you can union them (much like in S

MLlib libsvm isssues with data

2015-05-05 Thread doyere
hi all: I’ve met a issues with MLlib.I used posted to the community seems put the wrong place:( .Then I put in stackoverflowf.for a good format details plz seehttp://stackoverflow.com/questions/30048344/spark-mllib-libsvm-isssues-with-data.hope someone could help 😢 I guess it’s due to my data.bu

Re: AvroFiles

2015-05-05 Thread Todd Nist
Are you using Kryo or Java serialization? I found this post useful: http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist If using kryo, you need to register the classes with kryo, something like this: sc.registerKryoClasses(Array( cla

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-05 Thread nitinkak001
I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with hive-config.xml? I tried but does not work. Just want a conformation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-confi

RE: Multilabel Classification in spark

2015-05-05 Thread Ulanov, Alexander
If you are interested in multilabel (not multiclass), you might want to take a look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is supposed to perform one-versus-all transformation on classes, which is usually how multilabel classifiers are built. Alexander From: Joseph B

overloaded method constructor Strategy with alternatives

2015-05-05 Thread xweb
I am getting on following code Error:(164, 25) *overloaded method constructor Strategy with alternatives:* (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses: Int,maxBins: Int,categoricalFeaturesInfo: java.ut

Re: overloaded method constructor Strategy with alternatives

2015-05-05 Thread Ted Yu
Can you give us a bit more information ? Such as release of Spark you're using, version of Scala, etc. Thanks On Tue, May 5, 2015 at 6:37 PM, xweb wrote: > I am getting on following code > Error:(164, 25) *overloaded method constructor Strategy with alternatives:* > (algo: org.apache.spark.ml

what does "Container exited with a non-zero exit code 10" means?

2015-05-05 Thread felicia
Hi all, We're trying to implement SparkSQL on CDH5.3.0 with cluster mode, and we get this error either using java or python; Application application_1430482716098_0607 failed 2 times due to AM Container for appattempt_1430482716098_0607_02 exited with exitCode: 10 due to: Exception from co

Re: what does "Container exited with a non-zero exit code 10" means?

2015-05-05 Thread Marcelo Vanzin
What Spark tarball are you using? You may want to try the one for hadoop 2.6 (the one for hadoop 2.4 may cause that issue, IIRC). On Tue, May 5, 2015 at 6:54 PM, felicia wrote: > Hi all, > > We're trying to implement SparkSQL on CDH5.3.0 with cluster mode, > and we get this error either using ja

Re: AvroFiles

2015-05-05 Thread Pankaj Deshpande
I am not using kyro. I was using the regular sqlcontext.avrofiles to open. The files loads properly with the schema. Exception happens when I try to read it. Will try kyro serializer and see if that helps. On May 5, 2015 9:02 PM, "Todd Nist" wrote: > Are you using Kryo or Java serialization? I f

Re: overloaded method constructor Strategy with alternatives

2015-05-05 Thread Ash G
I am using Spark 1.3.0 and Scala 2.10. Thanks On Tue, May 5, 2015 at 6:48 PM, Ted Yu wrote: > Can you give us a bit more information ? > Such as release of Spark you're using, version of Scala, etc. > > Thanks > > On Tue, May 5, 2015 at 6:37 PM, xweb wrote: > >> I am getting on following code

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
Thanks much for your help. Here's what was happening ... The HDP VM was running in VirtualBox and host was connected to the guest VM in NAT mode. When I connected this in "Bridged Adapter" mode it worked ! On Tue, May 5, 2015 at 8:54 PM, ayan guha wrote: > Try to add one more data node or make

Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread Rendy Bambang Junior
Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS? What are concerns on doing this? There are no processing to be done by Spark, only to store data to HDFS from Kafka for storage and for further Spark processing Rendy

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread MrAsanjar .
why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior < rendy.b.jun...@gmail.com> wrote: > Hi all, > > I am planning to load data from Kafka to HDFS. Is it normal to use spark > streaming to load data from Kafka to HD

回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-05 Thread luohui20001
update status after i did some tests. I modified some other parameters, found 2 parameters maybe relative.spark_worker_instance and spark.sql.shuffle.partitions before Today I used default setting of spark_worker_instance and spark.sql.shuffle.partitions whose value is 1 and 200.At that time , my

Re: multiple hdfs folder & files input to PySpark

2015-05-05 Thread Ai He
Hi Oleg, For 1, RDD#union will help. You can iterate over folders and union the obtained RDD along. For 2, seems like it won’t work in a deterministic way according to this discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontex

Re: Remoting warning when submitting to cluster

2015-05-05 Thread Akhil Das
Here's one of the settings that i used for a closed environment: .set("spark.blockManager.port","40010") .set("spark.broadcast.port","40020") .set("spark.driver.port","40030") .set("spark.executor.port","40040") .set("spark.fileserver.port","40050") .set("spark.replClassSer

Re: OOM error with GMMs on 4GB dataset

2015-05-05 Thread Xiangrui Meng
Did you set `--driver-memory` with spark-submit? -Xiangrui On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni wrote: > Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). > The spark (1.3.1) job is allocated 120 executors with 6GB each and the > driver also has 6GB. > Spark C