Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Michael
path. I've also tried with mode("overwrite") but looks that each batch gets written on the same file every time. Any help would be greatly appreciated. Thanks, Michael -- def save_rdd(batch_time, rdd): sqlContext = SQLContext(rdd.context) df = sqlContext.create

Re: Spark Streaming idempotent writes to HDFS

2015-11-24 Thread Michael
so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to set the output filenames. Thanks Burak :) -Michael On Mon, Nov 23, 2015, at 09:19 PM

Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
Hi, I think you’re asking the right question, however you’re making an assumption that he’s on the cloud and he never talked about the size of the file. It could be that he’s got a lot of small-ish data sets. 1GB is kinda small in relative terms. Again YMMV. Personally if you’re going t

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-24 Thread Michael Mior
If you want to ensure the persisted RDD has been calculated first, just run foreach with a dummy function first to force evaluation. -- Michael Mior michael.m...@gmail.com Le jeu. 24 sept. 2020 à 00:38, Arya Ketan a écrit : > > Thanks, we were able to validate the same behaviour. > >

[Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Moser, Michael
gt;From a pythonic point of view, I could also imagine a single read/write >method, with the format as an arg and kwargs related to the different file >format options. Best, Michael

[Spark Streaming] support of non-timebased windows and lag function

2020-12-22 Thread Moser, Michael
the above functionality, while being more general, would be to use joins, but for this a streaming-ready, monotonically increasing and (!) concurrent uid would be needed. Thanks a lot & best, Michael

Handling skew in window functions

2021-04-27 Thread Michael Doo
there are 5 entries with a count >= 10,000, 100 records with a count between 1,000 and 10,000, and 150,000 entries with a count less than 1,000 (usually count = 1). Thanks in advance! Michael Code: ``` import pyspark.sql.functions as ffrom pyspark.sql.window import Window empty_col_a_cond =

Re: Handling skew in window functions

2021-04-28 Thread Michael Doo
e it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from >

Issues getting Apache Spark

2022-05-26 Thread Martin, Michael
annot get Spark to work on my laptop. Michael Martin

[PySpark] [applyInPandas] Regression Bug: Cogroup in pandas drops columns from the first dataframe

2022-11-25 Thread Michael Bílý
) # 10.0.1 It works on AWS Glue session with these versions: [image: image.png] It prints: +--+-+ |day |value| +--+-+ |2017-08-17|1| +--+-+ as expected. Thank you, Michael

Re: Naming an DF aggregated column

2015-05-19 Thread Michael Armbrust
customerDF.groupBy("state").agg(max($"discount").alias("newName")) (or .as(...), both functions can take a String or a Symbol) On Tue, May 19, 2015 at 2:11 PM, Cesar Flores wrote: > > I would like to ask if there is a way of specifying the column name of a > data frame aggregation. For example

Re: DataFrame groupBy vs RDD groupBy

2015-05-22 Thread Michael Armbrust
DataFrames have a lot more information about the data, so there is a whole class of optimizations that are possible there that we cannot do in RDDs. This is why we are focusing a lot of effort on this part of the project. In Spark 1.4 you can accomplish what you want using the new window function f

Powered by Spark listing

2015-05-24 Thread Michael Roberts
Information Innovators, Inc. http://www.iiinfo.com/ Spark, Spark Streaming, Spark SQL, MLLib Developing data analytics systems for federal healthcare, national defense and other programs using Spark on YARN. -- This page tracks the users of Spark. To add yourself to the list, please email user@spa

debug jsonRDD problem?

2015-05-27 Thread Michael Stone
Can anyone provide some suggestions on how to debug this? Using spark 1.3.1. The json itself seems to be valid (other programs can parse it) and the problem seems to lie in jsonRDD trying to describe & use a schema. scala> sqlContext.jsonRDD(rdd).count() java.util.NoSuchElementException: None.

Re: debug jsonRDD problem?

2015-05-27 Thread Michael Stone
On Wed, May 27, 2015 at 01:13:43PM -0700, Ted Yu wrote: Can you tell us a bit more about (schema of) your JSON ? It's fairly simple, consisting of 22 fields with values that are mostly strings or integers, except that some of the fields are objects with http header/value pairs. I'd guess it's

Re: debug jsonRDD problem?

2015-05-28 Thread Michael Stone
On Wed, May 27, 2015 at 02:06:16PM -0700, Ted Yu wrote: Looks like the exception was caused by resolved.get(prefix ++ a) returning None :         a => StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should

Re: RDD staleness

2015-05-31 Thread Michael Armbrust
Each time you run a Spark SQL query we will create new RDDs that load the data and thus you should see the newest results. There is one caveat: formats that use the native Data Source API (parquet, ORC (in Spark 1.4), JSON (in Spark 1.5)) cache file metadata to speed up interactive querying. To cl

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Michael Armbrust
This sounds like a problem that was fixed in Spark 1.3.1. https://issues.apache.org/jira/browse/SPARK-6351 On Mon, Jun 1, 2015 at 5:44 PM, Akhil Das wrote: > This thread > > has

Re: Wired Problem: Task not serializable[Spark Streaming]

2015-06-08 Thread Michael Albert
Note that in scala, "return" is a non-local return:  https://tpolecat.github.io/2014/05/09/return.htmlSo that "return" is *NOT* returning from the anonymous function, but attempting to return from the enclosing method, i.e., "main".Which is running on the driver, not on the workers.So on the wor

Re: DataFrames for non-SQL computation?

2015-06-11 Thread Michael Armbrust
Yes, DataFrames are for much more than SQL and I would recommend using them where ever possible. It is much easier for us to do optimizations when we have more information about the schema of your data, and as such, most of our on going optimization effort will focus on making DataFrames faster.

Re: Spark SQL and Skewed Joins

2015-06-12 Thread Michael Armbrust
> > 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use > 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is > available, would any of the JOIN enhancements help this situation? > I would try Spark 1.4 after running "SET spark.sql.planner.sortMergeJoin=true"

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Michael Armbrust
Yes, its all just RDDs under the covers. DataFrames/SQL is just a more concise way to express your parallel programs. On Sat, Jun 13, 2015 at 5:25 PM, Rex X wrote: > Thanks, Don! Does SQL implementation of spark do parallel processing on > records by default? > > -Rex > > > > On Sat, Jun 13, 20

Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Michael Armbrust
Sounds like SPARK-5456 . Which is fixed in Spark 1.4. On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hello Everyone, > > I pulled 2 different tables from the JDBC source and then joined them > usi

Re: DataFrame and JDBC regression?

2015-06-14 Thread Michael Armbrust
Can you please file a JIRA? On Sun, Jun 14, 2015 at 2:20 PM, Peter Haumer wrote: > Hello. > I have an ETL app that appends to a JDBC table new results found at each > run. In 1.3.1 I did this: > > testResultsDF.insertIntoJDBC(CONNECTION_URL + ";user=" + USER + > ";password=" + PASSWORD, TABLE_

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Michael Armbrust
> > this would be a great addition to spark, and ideally it belongs in spark > core not sql. > I agree with the fact that this would be a great addition, but we would likely want a specialized SQL implementation for performance reasons.

Re: cassandra with jdbcRDD

2015-06-16 Thread Michael Armbrust
I would suggest looking at https://github.com/datastax/spark-cassandra-connector On Tue, Jun 16, 2015 at 4:01 AM, Hafiz Mujadid wrote: > hi all! > > > is there a way to connect cassandra with jdbcRDD ? > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabb

Re: Spark or Storm

2015-06-17 Thread Michael Segel
Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming

Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Michael Armbrust
I would also love to see a more recent version of Spark SQL. There have been a lot of performance improvements between 1.2 and 1.4 :) On Thu, Jun 18, 2015 at 3:18 PM, Steve Nunez wrote: > Interesting. What where the Hive settings? Specifically it would be > useful to know if this was Hive on

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
How are you adding com.rr.data.Visit to spark? With --jars? It is possible we are using the wrong classloader. Could you open a JIRA? On Thu, Jun 18, 2015 at 2:56 PM, Chad Urso McDaniel wrote: > We are seeing class exceptions when converting to a DataFrame. > Anyone out there with some sugges

Re: [SparkSQL]. MissingRequirementError when creating dataframe from RDD (new error in 1.4)

2015-06-18 Thread Michael Armbrust
Thanks for reporting. Filed as: https://issues.apache.org/jira/browse/SPARK-8470 On Thu, Jun 18, 2015 at 5:35 PM, Adam Lewandowski < adam.lewandow...@gmail.com> wrote: > Since upgrading to Spark 1.4, I'm getting a > scala.reflect.internal.MissingRequirementError when creating a DataFrame > from

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
t.conf --class > com.rr.data.visits.VisitSequencerRunner > ./mvt-master-SNAPSHOT-jar-with-dependencies.jar > --- > > Our jar contains both com.rr.data.visits.orc.OrcReadWrite (which you can > see in the stack trace) and the unfound com.rr.data.Visit. > > I'll open a Jira ticket > > &

Re: SparkSQL: leftOuterJoin is VERY slow!

2015-06-19 Thread Michael Armbrust
Broadcast outer joins are on my short list for 1.5. On Fri, Jun 19, 2015 at 10:48 AM, Piero Cinquegrana < pcinquegr...@marketshare.com> wrote: > Hello, > > I have two RDDs: tv and sessions. I need to convert these DataFrames into > RDDs because I need to use the groupByKey function. The reduceByK

Re: Spark group by sub coulumn

2015-06-19 Thread Michael Armbrust
You are probably looking to do .select(explode($"to"), ...) first, which will produce a new row for each value in the input array. On Fri, Jun 19, 2015 at 12:02 AM, Suraj Shetiya wrote: > > Hi, > > I wanted to obtain a grouped by frame from a dataframe. > > A snippet of the column on which I nee

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Michael Armbrust
You can also do this using a sequence of case classes (in the example stored in a tuple, though the outer container could also be a case class): case class MyRecord(name: String, location: String) val df = Seq((1, Seq(MyRecord("Michael", "Berkeley"), MyRecord("Andy

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Michael Armbrust
Starting in Spark 1.4 there is also an explode that you can use directly from the select clause (much like in HiveQL): import org.apache.spark.sql.functions._ df.select(explode($"entities.user_mentions").as("mention")) Unlike standard HiveQL, you can also include other attributes in the select or

Re: [sparksql] sparse floating point data compression in sparksql cache

2015-06-24 Thread Michael Armbrust
Have you considered instead using the mllib SparseVector type (which is supported in Spark SQL?) On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov wrote: > When my 22M Parquet test file ended up taking 3G when cached in-memory I > looked closer at how column compression works in 1.4.0. My test data

Re: sql dataframe internal representation

2015-06-25 Thread Michael Armbrust
In many cases we use more efficient mutable implementations internally (i.e. mutable undecoded utf8 instead of java.lang.String, or a BigDecimal implementation that uses a Long when the number is small enough). On Thu, Jun 25, 2015 at 1:56 PM, Koert Kuipers wrote: > i noticed in DataFrame that t

RE: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak
I would also add, from a data locality theoretic standpoint, mapPartitions() provides for node-local computation that plain old map-reduce does not. From my Android phone on T-Mobile. The first nationwide 4G network. Original message From: Ashic Mahtab Date: 06/28/2015 10:5

Re: Subsecond queries possible?

2015-06-30 Thread Michael Armbrust
> > This brings up another question/issue - there doesn't seem to be a way to > partition cached tables in the same way you can partition, say a Hive > table. For example, we would like to partition the overall dataset (233m > rows, 9.2Gb) by (product, coupon) so when we run one of these queries >

Re: Custom order by in Spark SQL

2015-07-01 Thread Michael Armbrust
Easiest way to do this today is to define a UDF that maps from string to a number. On Wed, Jul 1, 2015 at 10:25 AM, Mick Davies wrote: > Hi, > > Is there a way to specify a custom order by (Ordering) on a column in Spark > SQL > > In particular I would like to have the order by applied to a curr

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Michael Armbrust
I would still look at your executor logs. A count() is rewritten by the optimizer to be much more efficient because you don't actually need any of the columns. Also, writing parquet allocates quite a few large buffers. On Wed, Jul 1, 2015 at 5:42 AM, Pooja Jain wrote: > Join is happening succe

Re: BroadcastHashJoin when RDD is not cached

2015-07-01 Thread Michael Armbrust
We don't know that the table is small unless you cache it. In Spark 1.5 you'll be able to give us a hint though ( https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L581 ) On Wed, Jul 1, 2015 at 8:30 AM, Srikanth wrote: > Hello, > > > > I ha

Re: Check for null in PySpark DataFrame

2015-07-01 Thread Michael Armbrust
There is an isNotNull function on any column. df._1.isNotNull or from pyspark.sql.functions import * col("myColumn").isNotNull On Wed, Jul 1, 2015 at 3:07 AM, Olivier Girardot wrote: > I must admit I've been using the same "back to SQL" strategy for now :p > So I'd be glad to have insights in

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-01 Thread Michael Armbrust
You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya wrote: > Thanks Salih. :) > > > The output of the groupby is as below. > > 2015-01-14 "SEC

Re: Converting spark JDBCRDD to DataFrame

2015-07-06 Thread Michael Armbrust
Use the built in JDBC data source: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases On Mon, Jul 6, 2015 at 6:42 AM, Hafiz Mujadid wrote: > Hi all! > > what is the most efficient way to convert jdbcRDD to DataFrame. > > any example? > > Thanks > > > > -- >

Re: DataFrame question

2015-07-07 Thread Michael Armbrust
You probably want to explode the array to produce one row per element: df.select(explode(df("links")).alias("link")) On Tue, Jul 7, 2015 at 10:29 AM, Naveen Madhire wrote: > Hi All, > > I am working with dataframes and have been struggling with this thing, any > pointers would be helpful. > > I

Re: [SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-08 Thread Michael Armbrust
Here's a related JIRA: https://issues.apache.org/jira/browse/SPARK-7819 Typically you can work around this by making sure that the classes are shared across the isolation boundary, as discussed in the comments. On Tue, Jul 7, 2015 at 3:29 AM, Sea

Re: [Spark Hive SQL] Set the hive connection in hive context is broken in spark 1.4.1-rc1?

2015-07-10 Thread Michael Armbrust
Metastore configuration should be set in hive-site.xml. On Thu, Jul 9, 2015 at 8:59 PM, Terry Hole wrote: > Hi, > > I am trying to set the hive metadata destination to a mysql database in > hive context, it works fine in spark 1.3.1, but it seems broken in spark > 1.4.1-rc1, where it always conn

Re: Spark performance

2015-07-12 Thread Michael Segel
Not necessarily. It depends on the use case and what you intend to do with the data. 4-6 TB will easily fit on an SMP box and can be efficiently searched by an RDBMS. Again it depends on what you want to do and how you want to do it. Informix’s IDS engine with its extensibility could still o

Re: Basic Spark SQL question

2015-07-13 Thread Michael Armbrust
I'd look at the JDBC server (a long running yarn job you can submit queries too) https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang wrote: > Well for adhoc queries you can use the CLI > > On Mon, Jul

Re: Research ideas using spark

2015-07-15 Thread Michael Segel
Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? "I am doing my PHD thesis on large scale machine learning e.g Online learning,

Re: Research ideas using spark

2015-07-16 Thread Michael Segel
015, at 12:40 PM, vaquar khan wrote: > > I would suggest study spark ,flink,strom and based on your understanding and > finding prepare your research paper. > > May be you will invented new spark ☺ > > Regards, > Vaquar khan > > On 16 Jul 2015 00:47, &q

Re: spark streaming job to hbase write

2015-07-16 Thread Michael Segel
You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than

Re: PairRDDFunctions and DataFrames

2015-07-16 Thread Michael Armbrust
Instead of using that RDD operation just use the native DataFrame function approxCountDistinct https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Jul 16, 2015 at 6:58 AM, Yana Kadiyska wrote: > Hi, could someone point me to the recommended way of u

Re: Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread Michael Armbrust
Using a hive-site.xml file on the classpath. On Fri, Jul 17, 2015 at 8:37 AM, spark user wrote: > Hi Roberto > > I have question regarding HiveContext . > > when you create HiveContext where you define Hive connection properties ? > Suppose Hive is not in local machine i need to connect , how Hi

Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
The difference between a map and a struct here is that in a struct all possible keys are defined as part of the schema and can each can have a different type (and we don't support union types). JSON doesn't have differentiated data structures so we go with the one that gives you more information w

Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
I'll add there is a JIRA to override the default past some threshold of # of unique keys: https://issues.apache.org/jira/browse/SPARK-4476 <https://issues.apache.org/jira/browse/SPARK-4476> On Fri, Jul 17, 2015 at 1:32 PM, Michael Armbrust wrote: > The difference between a map and

Re: Data frames select and where clause dependency

2015-07-17 Thread Michael Armbrust
Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis wrote: > I'd like t

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Michael Armbrust
-dev +user StructType(StructField(data,ArrayType(StructType(StructField( > *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), > StructField(name,StringType,true)),true),true), StructField(othertype, > ArrayType(StructType(StructField(company,String

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Michael Armbrust
Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith > wrote: > >> Thanks Michael, it's not a great example really, as the data I'm working >> with has some source files that do fit the schema, and some that don't (out >> of millions that do work, perhaps 10 might not). &

Re: Spark SQL Json Parse

2016-03-03 Thread Michael Segel
Why do you want to write out NULL if the column has no data? Just insert the fields that you have. > On Mar 3, 2016, at 9:10 AM, barisak wrote: > > Hi, > > I have a problem with Json Parser. I am using spark streaming with > hiveContext for keeping json format tweets. The flume collects twee

Re: Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Michael Armbrust
Read the docs at the link that you pasted: http://spark.apache.org/docs/latest/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore Spark will always compile against the same version of Hive (1.2.1), but it can dynamically load jars to speak to other versions. On Fri,

Re: Spark SQL - udf with entire row as parameter

2016-03-04 Thread Michael Armbrust
You have to use SQL to call it (but you will be able to do it with dataframes in Spark 2.0 due to a better parser). You need to construct a struct(*) and then pass that to your function since a function must have a fixed number of arguments. Here is an example

Re: Spark 1.5.2 : change datatype in programaticallly generated schema

2016-03-04 Thread Michael Armbrust
Change the type of a subset of the columns using withColumn, after you have loaded the DataFrame. Here is an example. On Thu, Mar 3,

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
of Data Source API file formats (text, json, csv, orc, parquet). On Tue, Mar 8, 2016 at 7:38 AM, Jacek Laskowski wrote: > Hi Praveen, > > I don't really know. I think TD or Michael should know as they > personally involved in the task (as far as I could figure it out from &

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust
A few clarifications: > 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once, we’ve

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust
Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546 <

Re: udf StructField to JSON String

2016-03-11 Thread Michael Armbrust
df.select("event").toJSON On Fri, Mar 11, 2016 at 9:53 AM, Caires Vinicius wrote: > Hmm. I think my problem is a little more complex. I'm using > https://github.com/databricks/spark-redshift and when I read from JSON > file I got this schema. > > root > > |-- app: string (nullable = true) > > |

Re: adding rows to a DataFrame

2016-03-11 Thread Michael Armbrust
Or look at explode on DataFrame On Fri, Mar 11, 2016 at 10:45 AM, Stefan Panayotov wrote: > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of the columns in each

Re: Can someone fix this download URL?

2016-03-14 Thread Michael Armbrust
Yeah, sorry. I'll make sure this gets fixed. On Mon, Mar 14, 2016 at 12:48 AM, Sean Owen wrote: > Yeah I can't seem to download any of the artifacts via the direct download > / cloudfront URL. The Apache mirrors are fine, so use those for the moment. > @marmbrus were you maybe the last to deal

Re: Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Michael Armbrust
> > Each json file is of a single object and has the potential to have > variance in the schema. > How much variance are we talking? JSON->Parquet is going to do well with 100s of different columns, but at 10,000s many things will probably start breaking.

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust
+1 to upgrading Spark. 1.2.1 has non of the memory management improvements that were added in 1.4-1.6. On Mon, Mar 14, 2016 at 2:03 AM, Prabhu Joseph wrote: > The issue is the query hits OOM on a Stage when reading Shuffle Output > from previous stage.How come increasing shuffle memory helps to

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust
On Mon, Mar 14, 2016 at 1:30 PM, Prabhu Joseph wrote: > > Thanks for the recommendation. But can you share what are the > improvements made above Spark-1.2.1 and how which specifically handle the > issue that is observed here. > Memory used for query execution is now explicitly accounted for:

Re: Subquery performance

2016-03-19 Thread Michael Armbrust
Try running EXPLAIN on both version of the query. Likely when you cache the subquery we know that its going to be small so use a broadcast join instead of a shuffling the data. On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > > > I’m running

Re: Subquery performance

2016-03-20 Thread Michael Armbrust
t? > > > > y > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* March-17-16 8:59 PM > *To:* Younes Naguib > *Cc:* user@spark.apache.org > *Subject:* Re: Subquery performance > > > > Try running EXPLAIN on both version of the query.

Re: Spark SQL Optimization

2016-03-21 Thread Michael Armbrust
It's helpful if you can include the output of EXPLAIN EXTENDED or df.explain(true) whenever asking about query performance. On Mon, Mar 21, 2016 at 6:27 AM, gtinside wrote: > Hi , > > I am trying to execute a simple query with join on 3 tables. When I look at > the execution plan , it varies wit

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Michael Armbrust
> > But when tired using Spark streamng I could not find a way to store the > data with the avro schema information. The closest that I got was to create > a Dataframe using the json RDDs and store them as parquet. Here the parquet > files had a spark specific schema in their footer. > Does this c

Re: Spark schema evolution

2016-03-22 Thread Michael Armbrust
Which version of Spark? This sounds like a bug (that might be fixed). On Tue, Mar 22, 2016 at 6:34 AM, gtinside wrote: > Hi , > > I have a table sourced from* 2 parquet files* with few extra columns in one > of the parquet file. Simple * queries works fine but queries with predicate > on extra

Re: Spark with Druid

2016-03-23 Thread Michael Malak
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases? From: Raymond Honderdors To: "yuzhih...@gmail.com" Cc: "user@spark.apache.org" Sent: Wednesday, March 23, 2016 8:43 AM Subject: Re: Spark with Druid I saw these but i fail to understand how to direct th

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar wrote: > Gurus, > > If I register a temporary table as below > > r.toDF > res58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: >

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
csv column names using databricks when > mapping > > val r = df.filter(col("paid") > "").map(x => > (x.getString(0),x.getString(1).) > > can I call example x.getString(0).as.(firstcolumn) in above when mapping > if possible so columns will have lab

Re: calling individual columns from spark temporary table

2016-03-24 Thread Michael Armbrust
tString(0),x.getString(1).) > > Can you give an example of column expression please > like > > df.filter(col("paid") > "").col("firstcolumn").getString ? > > > > > On Thursday, 24 March 2016, 0:45, Michael Armbrust > wrote: &

Re: Column explode a map

2016-03-24 Thread Michael Armbrust
If you know the map keys ahead of time then you can just extract them directly. Here are a few examples . On Thu, Mar 24, 2016 at 12:01

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Have you tried setting a base path for partition discovery? Starting from Spark 1.6.0, partition discovery only finds partitions under > the given paths by default. For the above example, if users pass > path/to/table/gender=male to either SQLContext.read.parquet or > SQLContext.read.load, gender

Re: DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Michael Armbrust
I would not recommend using the direct output committer with HDFS. Its intended only as an optimization for S3. On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar wrote: > Hi, > > We are doing the following to save a dataframe in parquet (using > DirectParquetOutputCommitter) as follows. > > dfWri

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
ot;).json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") > > If so, it returns the same error: > > java.lang.AssertionError: assertion failed: Conflicting directory > structures detected. Suspicious paths:? > hdfs://user/hdfs/analytics/app1/PAGEVIEW > hdfs://user/hdfs/an

Fwd: Spark and N-tier architecture

2016-03-29 Thread Michael Segel
> Begin forwarded message: > > From: Michael Segel > Subject: Re: Spark and N-tier architecture > Date: March 29, 2016 at 4:16:44 PM MST > To: Alexander Pivovarov > Cc: Mich Talebzadeh , Ashok Kumar > , User > > So… > > Is spark-jobserver an offi

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Michael Segel
It sounds like when you start up spark, its using 0.0.0.0 which means it will listen on all interfaces. You should be able to limit which interface to use. The weird thing is that if you are specifying the IP Address and Port, Spark shouldn’t be listening on all of the interfaces for that po

Re: pyspark read json file with high dimensional sparse data

2016-03-30 Thread Michael Armbrust
You can force the data to be loaded as a sparse map assuming the key/value types are consistent. Here is an example . On Wed, Mar 30, 20

Re: Relation between number of partitions and cores.

2016-04-01 Thread Michael Segel
There’s a mix of terms here. CPU is the physical chip which most likely contains more than 1 physical core. If you’re on Intel, there are physical cores and virtual cores. So 1 physical core is seen by the OS as two virtual cores. Then there are ‘cores per executor’ (spark terminology). So

Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Hello; I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle. Is there a way to force spark to stay in memory when doing shuffle operations? The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and never

Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Hello; I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle. Is there a way to force spark to stay in memory when doing shuffle operations? The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and never

Re: Support for time column type?

2016-04-01 Thread Michael Armbrust
There is also CalendarIntervalType. Is that what you are looking for? On Fri, Apr 1, 2016 at 1:11 PM, Philip Weaver wrote: > Hi, I don't see any mention of a time type in the documentation (there is > DateType and TimestampType, but not TimeType), and have been unable to find > any documentatio

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
number of beefy > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into improving > performance for those. Meantime, you can setup local ramdisks on each node > for shuffle writes. > > > > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <mailto:slavi...@g

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
oncerns with taking that approach to test ? (I dont see > any, but I am not sure if I missed something). > > > Regards, > Mridul > > > > > On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch > wrote: > > I totally disagree that it’s not a problem. > >

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Shuffling a 1tb set of keys and values (aka sort by key) results in about 500gb of io to disk if compression is enabled. Is there any way to eliminate shuffling causing io? On Fri, Apr 1, 2016, 6:32 PM Reynold Xin wrote: > Michael - I'm not sure if you actually read my email, but s

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
As I mentioned earlier this flag is now ignored. On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch wrote: > Shuffling a 1tb set of keys and values (aka sort by key) results in about > 500gb of io to disk if compression is enabled. Is there any way to > eliminate shuffling causing io? &

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Yes we see it on final write. Our preference is to eliminate this. On Fri, Apr 1, 2016, 7:25 PM Saisai Shao wrote: > Hi Michael, shuffle data (mapper output) have to be materialized into disk > finally, no matter how large memory you have, it is the design purpose of > Spark. In you

Re: [Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Michael Armbrust
What error are you getting? Here is an example . External types are documented here: http://spark.apache.org/docs/latest/sql-programming

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Michael Slavitch
Just to be sure: Has spark-env.sh and spark-defaults.conf been correctly propagated to all nodes? Are they identical? > On Apr 4, 2016, at 9:12 AM, Mike Hynes <91m...@gmail.com> wrote: > > [ CC'ing dev list since nearly identical questions have occurred in > user list recently w/o resolution;

  1   2   3   4   5   6   7   8   9   10   >