date:20150215

Re: New ColumnType For Decimal Caching

2015-02-15 Thread Cheng Lian

Hi Manoj, Yes, you've already hit the point. I think timestamp type support in the in-memory columnar support can be a good reference for you. Also, you may want to enable compression support for decimal type by adding DECIMAL column type to RunLengthEncoding.supports and DictionaryEncoding.suppor

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-15 Thread Takeshi Yamamuro

Hi, I tried quick and simple tests though, ISTM the vertices below were correctly cached. Could you give me the differences between my codes and yours? import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ object Prog { def processInt(d: Int) = d * 2 } val g = GraphLoader.edge

Re: SQLContext.applySchema strictness

2015-02-15 Thread Michael Armbrust

Applying schema is a pretty low-level operation, and I would expect most users would use the type safe interfaces. If you are unsure you can always run: import org.apache.spark.sql.execution.debug._ schemaRDD.typeCheck() and it will tell you if you have made any mistakes. Michael On Sat, Feb 1

Re: Shuffle write increases in spark 1.2

2015-02-15 Thread Aaron Davidson

I think Xuefeng Wu's suggestion is likely correct. This different is more likely explained by the compression library changing versions than sort vs hash shuffle (which should not affect output size significantly). Others have reported that switching to lz4 fixed their issue. We should document th

shark queries failed

2015-02-15 Thread Grandl Robert

Hi guys, I deployed BlinkDB(built atop Shark) and running Spark 0.9. I tried to run several TPCDS shark queries taken from https://github.com/cloudera/impala-tpcds-kit/tree/master/queries-sql92-modified/queries/shark. However, the following exceptions are encountered. Do you have any idea why t

Re: Shuffle write increases in spark 1.2

2015-02-15 Thread Ami Khandeshi

I have seen same behavior! I would love to hear an update on this... Thanks, Ami On Thu, Feb 5, 2015 at 8:26 AM, Anubhav Srivastav < anubhav.srivas...@gmail.com> wrote: > Hi Kevin, > We seem to be facing the same problem as well. Were you able to find > anything after that? The ticket does not

Array in broadcast can't be serialized

2015-02-15 Thread Tao Xiao

I'm using Spark 1.1.0 and find that *ImmutableBytesWritable* can be serialized by Kryo but *Array[ImmutableBytesWritable] *can't be serialized even when I registered both of them in Kryo. The code is as follows: val conf = new SparkConf() .setAppName("Hello Spark")

Re: New ColumnType For Decimal Caching

2015-02-15 Thread Michael Armbrust

That sound right to me. Cheng could elaborate if you are missing something. On Fri, Feb 13, 2015 at 11:36 AM, Manoj Samel wrote: > Thanks Michael for the pointer & Sorry for the delayed reply. > > Taking a quick inventory of scope of change - Is the column type for > Decimal caching needed only

Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread olegshirokikh

Hi there, Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when launching Spark on Amazon cloud with provided scripts? What is the default AMI, operating system that is launched by EC-2 script? Thanks -- View this message in context: http://apache-spark-user-list.1001560

RE: Extract hour from Timestamp in Spark SQL

2015-02-15 Thread Cheng, Hao

Are you using the SQLContext? I think the HiveContext is recommended. Cheng Hao From: Wush Wu [mailto:w...@bridgewell.com] Sent: Thursday, February 12, 2015 2:24 PM To: u...@spark.incubator.apache.org Subject: Extract hour from Timestamp in Spark SQL Dear all, I am new to Spark SQL and have no

Re: Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread gen tang

Hi, You can use -a or --ami to launch the cluster using specific ami. If I remember well, the default system is Amazon Linux. Hope it will help Cheers Gen On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh wrote: > Hi there, > > Is there a way to specify the AWS AMI with particular OS (say Ubun

Re: Array in broadcast can't be serialized

2015-02-15 Thread Ted Yu

I was looking at https://github.com/twitter/chill It seems this would achieve what you want: chill-scala/src/main/scala/com/twitter/chill/WrappedArraySerializer.scala Cheers On Sat, Feb 14, 2015 at 6:36 PM, Tao Xiao wrote: > I'm using Spark 1.1.0 and find that *ImmutableBytesWritable* can be >

Inconsistent execution times for same application.

2015-02-15 Thread Kartheek.R

Hi, My spark cluster contains machines like Pentium-4, dual core and quad-core machines. I am trying to run a character frequency count application. The application contains several threads, each submitting a job(action) that counts the frequency of a single character. But, my problem is, I get dif

Percentile example

2015-02-15 Thread SiMaYunRui

hello, I am a newbie to spark and trying to figure out how to get percentile against a big data set. Actually, I googled this topic but not find any very useful code example and explanation. Seems that I can use transformer SortBykey to get my data set in order, but not pretty sure how can I ge

Multidimensional K-Means

2015-02-15 Thread Attila Tóth

Dear Spark User List, I'm fairly new to Spark, trying to use it for multi-dimensional clustering (using the k-means clustering from MLib). However, based on the examples the clustering seems to work only for a single dimension (KMeans.train() accepts an RDD[Vector], which is a vector of doubles -

Re: Multidimensional K-Means

2015-02-15 Thread Sean Owen

Clustering operates on a large number of n-dimensional vectors. That seems to be what you are describing, and that is what the MLlib API accepts. What are you expecting that you don't find? Did you have a look at the KMeansModel that this method returns? it has a "clusterCenters" method that gives

Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang

Hi, HCatalog allows you to specify the pattern of paths for partitions, which will be used by dynamic partition loading. https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions#HCatalogDynamicPartitions-ExternalTables Can we have similar feature in SparkSQL? Jira is here: h

Re: shark queries failed

2015-02-15 Thread Akhil Das

I'd suggest you updating your spark to the latest version and try SparkSQL instead of Shark. Thanks Best Regards On Sun, Feb 15, 2015 at 7:36 AM, Grandl Robert wrote: > Hi guys, > > I deployed BlinkDB(built atop Shark) and running Spark 0.9. > > I tried to run several TPCDS shark queries taken

Re: SparkStreaming Low Performance

2015-02-15 Thread Akhil Das

Thanks Enno, let me have a look at Stream Parser version of Jackson. Thanks Best Regards On Sat, Feb 14, 2015 at 9:30 PM, Enno Shioji wrote: > Huh, that would come to 6.5ms per one JSON. That does feel like a lot but > if your JSON file is big enough, I guess you could get that sort of > proces

Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde

2015-02-15 Thread Jianshi Huang

Hi, If I have a table in Hive metastore saved as Parquet, and I want to use it in Spark. It seems Spark will use Hive's Parquet serde to load the actual data. So is there any difference here? Will predicate pushdown, pruning and future Parquet optimizations in SparkSQL work for using Hive serde?

spark-local dir running out of space during long ALS run

2015-02-15 Thread Antony Mayi

Hi, I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about 3 billions of ratings and I am doing several trainImplicit() runs in loop within one spark session. I have four node cluster with 3TB disk space on each. before starting the job there is less then 8% of the disk

Re: Multidimensional K-Means

2015-02-15 Thread Attila Tóth

Hi Sean, Thanks for the quick answer. I have not realized that I can make an RDD[Vector] with eg. val dataSet = sparkContext.makeRDD(List(Vectors.dense(10.0,20.0), Vectors.dense(20.0,30.0))) Using this KMeans.train works as it should. So my bad. Thanks again! Attila 2015-02-15 17:29 GMT+01:00

Re: Unable to query hive tables from spark

2015-02-15 Thread Todd Nist

What does your hive-site.xml look like? Do you actually have a directory at the location shown in the error? i.e does "/user/hive/warehouse/src" exist? You should be able to override this by specifying the following: --hiveconf hive.metastore.warehouse.dir=/location/where/your/warehouse/exists

Re: shark queries failed

2015-02-15 Thread Grandl Robert

Thanks for reply, Akhil. I cannot update the spark version and run SparkSQL due to some old dependencies and a specific project I want to run. I was wondering if you have any clue, why that exception might be triggered, or if you saw it before. Thanks,Robert On Sunday, February 15, 20

Re: spark-local dir running out of space during long ALS run

2015-02-15 Thread Antony Mayi

spark.cleaner.ttl ? On Sunday, 15 February 2015, 18:23, Antony Mayi wrote: Hi, I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about 3 billions of ratings and I am doing several trainImplicit() runs in loop within one spark session. I have four node clus

monit with spark

2015-02-15 Thread Mike Sam

We want to monitor spark master and spark slaves using monit but we want to use the sbin scripts to do so. The scripts create the spark master and salve processes independent from themselves so monit would not know the started processed pid to watch. Is this correct? Should we watch the ports? How

Loading JSON dataset with Spark Mllib

2015-02-15 Thread pankaj channe

Hi, I am new to spark and planning on writing a machine learning application with Spark mllib. My dataset is in json format. Is it possible to load data into spark without using any external json libraries? I have explored the option of SparkSql but I believe that is only for interactive use or lo

Re: Loading JSON dataset with Spark Mllib

2015-02-15 Thread gen tang

Hi, In fact, you can use sqlCtx.jsonFile() which loads a text file storing one JSON object per line as a SchemaRDD. Or you can use sc.textFile() to load the textFile to RDD and then use sqlCtx.jsonRDD() which loads an RDD storing one JSON object per string as a SchemaRDD. Hope it could help Cheer

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-02-15 Thread matroyd

It works now using 1.2.1. Thanks for all the help. Spark rocks !! - Thanks, Roy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-HiveContext-created-SchemaRDD-s-saveAsTable-is-not-working-on-1-2-0-tp21442p21664.html Sent from the Apache Spark User List

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain

I used the latest assembly jar and the below as suggested by Akhil to fix this problem... temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, String.class, *(Class)* TextOutputFormat.class); Thanks All for the help ! On Wed, Feb 11, 2015 at 1:38 PM, Sean Owen wrote: > That kinda dodges the

WARN from Similarity Calculation

2015-02-15 Thread Debasish Das

Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms exceeds 45000ms Do I need to increase the default 45 s to larger values for cases wh

Re: New ColumnType For Decimal Caching

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

Re: SQLContext.applySchema strictness

Re: Shuffle write increases in spark 1.2

shark queries failed

Re: Shuffle write increases in spark 1.2

Array in broadcast can't be serialized

Re: New ColumnType For Decimal Caching

Specifying AMI when using Spark EC-2 scripts

RE: Extract hour from Timestamp in Spark SQL

Re: Specifying AMI when using Spark EC-2 scripts

Re: Array in broadcast can't be serialized

Inconsistent execution times for same application.

Percentile example

Multidimensional K-Means

Re: Multidimensional K-Means

Dynamic partition pattern support

Re: shark queries failed

Re: SparkStreaming Low Performance

Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde

spark-local dir running out of space during long ALS run

Re: Multidimensional K-Means

Re: Unable to query hive tables from spark

Re: shark queries failed

Re: spark-local dir running out of space during long ALS run

monit with spark

Loading JSON dataset with Spark Mllib

Re: Loading JSON dataset with Spark Mllib

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

Re: Writing to HDFS from spark Streaming

WARN from Similarity Calculation

31 matches

Site Navigation

Mail list logo

Footer information