date:20160320

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-20 Thread Sebastian Piu

We use this, but not sure how the schema is stored Job job = Job.getInstance(); ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class); AvroParquetOutputFormat.setSchema(job, schema); LazyOutputFormat.setOutputFormatClass(job, new ParquetOutputFormat().getClass()); job.getConfigurat

Re: Setting up spark to run on two nodes

2016-03-20 Thread Akhil Das

You can simply execute the sbin/start-slaves.sh file to start up all slave processes. Just make sure you have spark installed on the same path on all the machines. Thanks Best Regards On Sat, Mar 19, 2016 at 4:01 AM, Ashok Kumar wrote: > Experts. > > Please your valued advice. > > I have spark

declare constant as date

2016-03-20 Thread Divya Gehlot

Hi, In Spark 1.5.2 Do we have any utiility which converts a constant value as shown below orcan we declare a date variable like val start_date :Date = "2015-03-02" val start_date = "2015-03-02" toDate like how we convert to toInt ,toString I searched for it but couldnt find it Thanks, Divya

Re: Potential conflict with org.iq80.snappy in Spark 1.6.0 environment?

2016-03-20 Thread Akhil Das

Looks like a jar conflict, could you paste the piece of code? and how your dependency file looks like? Thanks Best Regards On Sat, Mar 19, 2016 at 7:49 AM, vasu20 wrote: > Hi, > > I have some code that parses a snappy thrift file for objects. This code > works fine when run standalone (outside

How to name features and perform custom cross validation in ML

2016-03-20 Thread iguana314

Hello, I'm trying to a simple linear regression in Spark ML. Below is my Data Frame along with some sample code and output done via Spyder on a local spark cluster. *## #Begin Code ##* regressionDF.show(5) +---++ | label|features| +---+

Re: Building spark submodule source code

2016-03-20 Thread Akhil Das

Have a look at the intellij setup https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ Once you have the setup ready, you don't have to recompile the whole stuff every time. Thanks Best Regards On Mon, Mar 21, 2016 at 8:14 AM, Tenghuan He wrote:

Re: Error using collectAsMap() in scala

2016-03-20 Thread Akhil Das

What you should be doing is a join, something like this: //Create a key, value pair, key being the column1 val rdd1 = sc.textFile(file1).map(x => (x.split(",")(0),x.split(",")) //Create a key, value pair, key being the column2 val rdd2 = sc.textFile(file2).map(x => (x.split(",")(1),x.split(","))

Run External R script from Spark

2016-03-20 Thread sujeet jog

Hi, I have been working on a POC on some time series related stuff, i'm using python since i need spark streaming and sparkR is yet to have a spark streaming front end, couple of algorithms i want to use are not yet present in Spark-TS package, so I'm thinking of invoking a external R script for

Best way to store Avro Objects as Parquet using SPARK

2016-03-20 Thread Manivannan Selvadurai

Hi All, In my current project there is a requirement to store avro data (json format) as parquet files. I was able to use AvroParquetWriter in separately to create the Parquet Files. The parquet files along with the data also had the 'avro schema' stored on them as a part of their footer

Re: best practices: running multi user jupyter notebook server

2016-03-20 Thread charles li

Hi, andy, I think you can make that with some open source packages/libs built for IPython and Spark. here is one : https://github.com/litaotao/IPython-Dashboard On Thu, Mar 17, 2016 at 1:36 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > We are considering deploying a notebook serve

Re: Error using collectAsMap() in scala

2016-03-20 Thread Shishir Anshuman

I have stored the contents of two csv files in separate RDDs. file1.csv format*: (column1,column2,column3)* file2.csv format*: (column1, column2)* *column1 of file1 *and* column2 of file2 *contains similar data. I want to compare the two columns and if match is found: - Replace the data at *c

Re: What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-20 Thread Dhaval Modi

+1 On Mar 21, 2016 09:52, "Hiroyuki Yamada" wrote: > Could anyone give me some advices or recommendations or usual ways to do > this ? > > I am trying to get all (probably top 100) product recommendations for each > user from a model (MatrixFactorizationModel), > but I haven't figured out yet to

Re: Error using collectAsMap() in scala

2016-03-20 Thread Prem Sure

any specific reason you would like to use collectasmap only? You probably move to normal RDD instead of a Pair. On Monday, March 21, 2016, Mark Hamstra wrote: > You're not getting what Ted is telling you. Your `dict` is an RDD[String] > -- i.e. it is a collection of a single value type, Strin

Re: Error using collectAsMap() in scala

2016-03-20 Thread Mark Hamstra

You're not getting what Ted is telling you. Your `dict` is an RDD[String] -- i.e. it is a collection of a single value type, String. But `collectAsMap` is only defined for PairRDDs that have key-value pairs for their data elements. Both a key and a value are needed to collect into a Map[K, V].

Re: What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-20 Thread Hiroyuki Yamada

Could anyone give me some advices or recommendations or usual ways to do this ? I am trying to get all (probably top 100) product recommendations for each user from a model (MatrixFactorizationModel), but I haven't figured out yet to do it efficiently. So far, calling predict (predictAll in pyspa

Re: Get the number of days dynamically in with Column

2016-03-20 Thread Silvio Fiorito

I’m not entirely sure if this is what you’re asking, but you could just use the datediff function: val df2 = df.withColumn("ID”, datediff($"end", $"start”)) If you want it formatted as {n}D then: val df2 = df.withColumn("ID", concat(datediff($"end", $"start"),lit("D"))) Thanks, Silvio From: D

Get the number of days dynamically in with Column

2016-03-20 Thread Divya Gehlot

I have a time stamping table which has data like No of Days ID 11D 22D and so on till 30 days Have another Dataframe with start date and end date I need to get the difference between these two days and get the ID from Time Stamping table and do With Column . The

Spark: The build-in indexes in ORC file do not work.

2016-03-20 Thread Joseph

Hi all, Has anyone used ORC indexes in sparkSQL? Does SparkSQL support ORC indexes completely? I user the shell script "${SPARK_HOME}/bin/spark-sql" to run sparksql REPL and execute my query statement. The following is my test in sparksql REPL： spark-sql>set spark.sql.orc.filterPushdown=true;

Re: Building spark submodule source code

2016-03-20 Thread Ted Yu

To speed up the build process, take a look at install_zinc() in build/mvn, around line 83. And the following around line 137: # Now that zinc is ensured to be installed, check its status and, if its # not running or just installed, start it FYI On Sun, Mar 20, 2016 at 7:44 PM, Tenghuan He wrot

Building spark submodule source code

2016-03-20 Thread Tenghuan He

Hi everyone, I am trying to add a new method to spark RDD. After changing the code of RDD.scala and running the following command mvn -pl :spark-core_2.10 -DskipTests clean install It BUILD SUCCESS, however, when starting the bin\spark-shell, my method cannot be found. Do I have to

support vector machine question

2016-03-20 Thread prem09

Hi, I created a dataset of 100 points, ranging from X=1.0 to to X=100.0. I let the y variable be 0.0 if X < 51.0 and 1.0 otherwise. I then fit a SVMwithSGD. When I predict the y values for the same values of X as in the sample, I get back 1.0 for each predicted y! Incidentally, I don't get perfe

Re: reading csv file, operation on column or columns

2016-03-20 Thread Mich Talebzadeh

Apologies. Good point def convertColumn(df: org.apache.spark.sql.DataFrame, name:String, newType:String) = { | val df_1 = df.withColumnRenamed(name, "ConvertColumn") | df_1.withColumn(name, df_1.col("ConvertColumn").cast(newType)).drop("ConvertColumn") | } val df_3 = convertColumn(d

Re: reading csv file, operation on column or columns

2016-03-20 Thread Ted Yu

Mich: Looks like convertColumn() is method of your own - I don't see it in Spark code base. On Sun, Mar 20, 2016 at 3:38 PM, Mich Talebzadeh wrote: > Pretty straight forward as pointed out by Ted. > > --read csv file into a df > val df = > sqlContext.read.format("com.databricks.spark.csv").optio

Re: reading csv file, operation on column or columns

2016-03-20 Thread Mich Talebzadeh

Pretty straight forward as pointed out by Ted. --read csv file into a df val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") scala> df.printSchema root |-- Invoice Number: string (nullable = true) |-- Paymen

Re: reading csv file, operation on column or columns

2016-03-20 Thread Ted Yu

Please refer to the following methods of DataFrame: def withColumn(colName: String, col: Column): DataFrame = { def drop(colName: String): DataFrame = { On Sun, Mar 20, 2016 at 2:47 PM, Ashok Kumar wrote: > Gurus, > > I would like to read a csv file into a Data Frame but able to rename the

reading csv file, operation on column or columns

2016-03-20 Thread Ashok Kumar

Gurus, I would like to read a csv file into a Data Frame but able to rename the column name, change a column type from String to Integer or drop the column from further analysis before saving data as parquet file? Thanks

Re: Flume with Spark Streaming Sink

2016-03-20 Thread Luciano Resende

You should use it as described in the documentation and passing it as a package: ./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_2.10:1.6.1 ... On Sun, Mar 20, 2016 at 9:22 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to use the Spark Sin

Re: Flume with Spark Streaming Sink

2016-03-20 Thread Ted Yu

$ jar tvf ./external/flume-sink/target/spark-streaming-flume-sink_2.10-1.6.1.jar | grep SparkFlumeProtocol 841 Thu Mar 03 11:09:36 PST 2016 org/apache/spark/streaming/flume/sink/SparkFlumeProtocol$Callback.class 2363 Thu Mar 03 11:09:36 PST 2016 org/apache/spark/streaming/flume/sink/SparkFlume

Flume with Spark Streaming Sink

2016-03-20 Thread Daniel Haviv

Hi, I'm trying to use the Spark Sink with Flume but it seems I'm missing some of the dependencies. I'm running the following code: ./bin/spark-shell --master yarn --jars /home/impact/flumeStreaming/spark-streaming-flume_2.10-1.6.1.jar,/home/impact/flumeStreaming/flume-ng-core-1.6.0.jar,/home/impac

Re: Spark 2.0 Shell -csv package weirdness

2016-03-20 Thread Marco Mistroni

Hi I try tomorrow same settings as you to see if I can experience same issues. Will report back once done Thanks On 20 Mar 2016 3:50 pm, "Vincent Ohprecio" wrote: > Thanks Mich and Marco for your help. I have created a ticket to look into > it on dev channel. > Here is the issue https://issues.ap

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-20 Thread Surendra , Manchikanti

Hi, Can you check Kafka topic replication ? And leader information? Regards, Surendra M -- Surendra Manchikanti On Thu, Mar 17, 2016 at 7:28 PM, Ascot Moss wrote: > Hi, > > I have a SparkStream (with Kafka) job, after running several days, it > failed with following errors: > ERROR DirectKa

Re: Spark 2.0 Shell -csv package weirdness

2016-03-20 Thread Vincent Ohprecio

Thanks Mich and Marco for your help. I have created a ticket to look into it on dev channel. Here is the issue https://issues.apache.org/jira/browse/SPARK-14031 On Sun, Mar 20, 2016 at 2:57 AM, Mich Talebzadeh wrote: > Hi Vincent, > > I downloads the CSV file and did the test. > > Spark version

Re: Limit pyspark.daemon threads

2016-03-20 Thread Ted Yu

I took a look at docs/configuration.md Though I didn't find answer for your first question, I think the following pertains to your second question: spark.python.worker.memory 512m Amount of memory to use per python worker process during aggregation, in the same format as JVM memory

MLPC model can not be saved

2016-03-20 Thread HanPan

Hi Guys, I built a ML pipeline that includes multilayer perceptron classifier, I got the following error message when I tried to save the pipeline model. It seems like MLPC model can not be saved which means I have no ways to save the trained model. Is there any way to save the model t

Re: Stanford CoreNLP sentiment extraction: lost executor

2016-03-20 Thread tundo

*SOLVED:* Unfortunately, stderr log in Hadoop's Resource Manager UI was not useful since it just reported "... Lost executor XX on workerYYY...". Therefore, I dumped locally the whole app-related logs: /yarn logs -applicationId application_1458320004153_0343 > ~/application_1458320004153_0343.txt

Re: spark-submit reset JVM

2016-03-20 Thread Ted Yu

Not that I know of. Can you be a little more specific on which JVM(s) you want restarted (assuming spark-submit is used to start a second job) ? Thanks On Sun, Mar 20, 2016 at 6:20 AM, Udo Fholl wrote: > Hi all, > > Is there a way for spark-submit to restart the JVM in the worker machines? > >

ClassNotFoundException in RDD.map

2016-03-20 Thread Dirceu Semighini Filho

Hello, I found a strange behavior after executing a prediction with MLIB. My code return an RDD[(Any,Double)] where Any is the id of my dataset, which is BigDecimal, and Double is the prediction for that line. When I run myRdd.take(10) it returns ok res16: Array[_ >: (Double, Double) <: (Any, Doubl

spark-submit reset JVM

2016-03-20 Thread Udo Fholl

Hi all, Is there a way for spark-submit to restart the JVM in the worker machines? Thanks. Udo.

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-20 Thread Jakob Odersky

Can you share a snippet that reproduces the error? What was spark.sql.autoBroadcastJoinThreshold before your last change? On Thu, Mar 17, 2016 at 10:03 AM, Jiří Syrový wrote: > Hi, > > any idea what could be causing this issue? It started appearing after > changing parameter > > spark.sql.aut

Re: Subquery performance

2016-03-20 Thread Michael Armbrust

If you encode the data in something like parquet we usually have more information and will try to broadcast. On Thu, Mar 17, 2016 at 7:27 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Anyways to cache the subquery or force a broadcast join without persisting > it? > > > > y > > >

[no subject]

2016-03-20 Thread Vinay Varma

Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-20 Thread craigiggy

Also, this is the command I use to submit the Spark application: ** where *recommendation_engine-0.1-py2.7.egg* is a Python egg of my own library I've written for this application, and *'file'* and *'/home/spark/enigma_analytics/tests/msg-epims0730_small.json'* are input arguments for the applica

Re: Spark 2.0 Shell -csv package weirdness

2016-03-20 Thread Mich Talebzadeh

Hi Vincent, I downloads the CSV file and did the test. Spark version 1.5.2 The full code as follows. Minor changes to delete yearAndCancelled.parquet and output.csv files if they are already created //$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 val HiveContext = n

Re: spark launching range is 10 mins

2016-03-20 Thread Enrico Rotundo

You might wanna try to assign more cores to the driver?! Sent from my iPhone > On 20 Mar 2016, at 07:34, Jialin Liu wrote: > > Hi, > I have set the partitions as 6000, and requested 100 nodes, with 32 > cores each node, > and the number of executors is 32 per node > > spark-submit --master $SP

Re: ClassNotFoundException in RDD.map

2016-03-20 Thread Jakob Odersky

The error is very strange indeed, however without code that reproduces it, we can't really provide much help beyond speculation. One thing that stood out to me immediately is that you say you have an RDD of Any where every Any should be a BigDecimal, so why not specify that type information? When

Re: best way to do deep learning on spark ?

2016-03-20 Thread James Hammerton

In the meantime there is also deeplearning4j which integrates with Spark (for both Java and Scala): http://deeplearning4j.org/ Regards, James On 17 March 2016 at 02:32, Ulanov, Alexander wrote: > Hi Charles, > > > > There is an implementation of multilayer perceptron in Spark (since 1.5): > >

Re: df.dtypes -> pyspark.sql.types

2016-03-20 Thread Reynold Xin

We probably should have the alias. Is this still a problem on master branch? On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov wrote: > Running following: > > #fix schema for gaid which should not be Double >> from pyspark.sql.types import * >> customSchema = StructType() >> for (col,typ) in ts

Restarting an executor during execution causes it to lose AWS credentials (anyone seen this?)

2016-03-20 Thread Allen George

Hi guys, I'm having a problem where respawning a failed executor during a job that reads/writes parquet on S3 causes subsequent tasks to fail because of missing AWS keys. Setup: I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple standalone cluster: 1 master 2 workers My

48 matches

Mail list logo