Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it? > On Apr 5, 2017, at 4:30 PM, Maciej Bryński wrote: > > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN op

Re: Handling questions in the mailing lists

2016-11-08 Thread Michael Segel
Guys… please take what I say with a grain of salt… The issue is that the input is a stream of messages where they are addressed in a LIFO manner. This means that messages may be ignored. The stream of data (user@spark for example) is semi-structured in that the stream contains a lot of message

Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi, Apologies if I’ve asked this question before but I didn’t see it in the list and I’m certain that my last surviving brain cell has gone on strike over my attempt to reduce my caffeine intake… Posting this to both user and dev because I think the question / topic jumps in to both camps. A

Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Segel
Silly question? When you talk about ‘user specified schema’ do you mean for the user to supply an additional schema, or that you’re using the schema that’s described by the JSON string? (or both? [either/or] ) Thx On Sep 28, 2016, at 12:52 PM, Michael Armbrust mailto:mich...@databricks.com>>

Re: Spark Thrift Server Concurrency

2016-06-23 Thread Michael Segel
Hi, There are a lot of moving parts and a lot of unknowns from your description. Besides the version stuff. How many executors, how many cores? How much memory? Are you persisting (memory and disk) or just caching (memory) During the execution… same tables… are you seeing a lot of shufflin

Re: Secondary Indexing?

2016-05-30 Thread Michael Segel
w?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 30 May 2016 at 17:08, Michael Segel <mailto:msegel_had...@hotmail.com>> wrote: > I’m not sure where to post this since its a bit of a philosophical qu

Secondary Indexing?

2016-05-30 Thread Michael Segel
I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark. If we look at SparkSQL and performance… where does Secondary indexing fit in? The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary, i

Indexing of RDDs and DF in 2.0?

2016-05-17 Thread Michael Segel
Hi, I saw a replay of a talk about what’s coming in Spark 2.0 and the speed performances… I am curious about indexing of data sets. In HBase/MapRDB you can create ordered sets of indexes through an inverted table. Here, you can take the intersection of the indexes to find the result set of

Re: inter spark application communication

2016-04-18 Thread Michael Segel
have you thought about Akka? What are you trying to send? Why do you want them to talk to one another? > On Apr 18, 2016, at 12:04 PM, Soumitra Johri > wrote: > > Hi, > > I have two applications : App1 and App2. > On a single cluster I have to spawn 5 instances os App1 and 1 instance of >

Re: Any documentation on Spark's security model beyond YARN?

2016-04-01 Thread Michael Segel
> wrote: >>> >>>> On 29 Mar 2016, at 22:19, Michael Segel wrote: >>>> >>>> Hi, >>>> >>>> So yeah, I know that Spark jobs running on a Hadoop cluster will inherit >>>> its security from the underlying YARN job. >

Any documentation on Spark's security model beyond YARN?

2016-03-29 Thread Michael Segel
Hi, So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its security from the underlying YARN job. However… that’s not really saying much when you think about some use cases. Like using the thrift service … I’m wondering what else is new and what people have been thinki

Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Hi, I’m looking at the online docs for building spark 1.4.1 … http://spark.apache.org/docs/latest/building-spark.html I was interested in building spark for Scala 2.11 (latest scala) and also for Hive and JDBC support. The docs say