Re: Append to an existing Delta Lake using structured streaming

2021-07-21 Thread Denny Lee
Including the Delta Lake Users and Developers DL to help out.

Saying this, could you clarify how data is not being added?  By any chance
do you have any code samples to recreate this?

Sent via Superhuman 


On Wed, Jul 21, 2021 at 2:49 AM,  wrote:

> Hi all,
>   I stumbled upon an interessting problem. I have an existing Deltalake
> with data recovered from a backup and would like to append to this
> Deltalake using Spark structured streaming. This does not work. Although
> the streaming job is running no data is appended.
> If I created the original file with structured streaming than appending to
> this file with a streaming job (at least with the same job) works
> flawlessly.  Did I missunderstand something here?
>
> best regards
>Eugen Wintersberger
>


Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Denny Lee
Hi Karan,

You may want to ping Databricks Help  or
Forums  as this is a Databricks
specific question.  I'm a little surprised that a Databricks cluster would
take a long time to create so it may be best to utilize these forums to
grok the cause.

HTH!
Denny


Sent via Superhuman 


On Mon, Aug 16, 2021 at 11:10 PM, karan alang  wrote:

> Hello - i've been using the Databricks notebook(for pyspark or scala/spark
> development), and recently have had issues wherein the cluster creation
> takes a long time to get created, often timing out.
>
> Any ideas on how to resolve this ?
> Any other alternatives to databricks notebook ?
>


Re: Prometheus with spark

2022-10-27 Thread Denny Lee
Hi Raja,

A little atypical way to respond to your question - please check out the
most recent Spark AMA where we discuss this:
https://www.linkedin.com/posts/apachespark_apachespark-ama-committers-activity-6989052811397279744-jpWH?utm_source=share&utm_medium=member_ios

HTH!
Denny



On Tue, Oct 25, 2022 at 09:16 Raja bhupati 
wrote:

> We have use case where we would like process Prometheus metrics data with
> spark
>
> On Tue, Oct 25, 2022, 19:49 Jacek Laskowski  wrote:
>
>> Hi Raj,
>>
>> Do you want to do the following?
>>
>> spark.read.format("prometheus").load...
>>
>> I haven't heard of such a data source / format before.
>>
>> What would you like it for?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Fri, Oct 21, 2022 at 6:12 PM Raj ks  wrote:
>>
>>> Hi Team,
>>>
>>>
>>> We wanted to query Prometheus data with spark. Any suggestions will
>>> be appreciated
>>>
>>> Searched for documents but did not got any prompt one
>>>
>>


Re: Online classes for spark topics

2023-03-08 Thread Denny Lee
We used to run Spark webinars on the Apache Spark LinkedIn group
 but
honestly the turnout was pretty low.  We had dove into various features.
If there are particular topics that. you would like to discuss during a
live session, please let me know and we can try to restart them.  HTH!

On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World  wrote:

> +1
>
> On Wed, Mar 8, 2023 at 10:40 PM Winston Lai  wrote:
>
>> +1, any webinar on Spark related topic is appreciated 👍
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* asma zgolli 
>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>> *To:* karan alang 
>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
>> ashok34...@yahoo.com>; User 
>> *Subject:* Re: Online classes for spark topics
>>
>> +1
>>
>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>> écrit :
>>
>> +1 .. I'm happy to be part of these discussions as well !
>>
>>
>>
>>
>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> I guess I can schedule this work over a course of time. I for myself can
>> contribute plus learn from others.
>>
>> So +1 for me.
>>
>> Let us see if anyone else is interested.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>> wrote:
>>
>>
>> Hello Mich.
>>
>> Greetings. Would you be able to arrange for Spark Structured Streaming
>> learning webinar.?
>>
>> This is something I haven been struggling with recently. it will be very
>> helpful.
>>
>> Thanks and Regard
>>
>> AK
>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> This might  be a worthwhile exercise on the assumption that the
>> contributors will find the time and bandwidth to chip in so to speak.
>>
>> I am sure there are many but on top of my head I can think of Holden
>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>> experienced.
>>
>> Anyone else 🤔
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>  wrote:
>>
>> Hello gurus,
>>
>> Does Spark arranges online webinars for special topics like Spark on K8s,
>> data science and Spark Structured Streaming?
>>
>> I would be most grateful if experts can share their experience with
>> learners with intermediate knowledge like myself. Hopefully we will find
>> the practical experiences told valuable.
>>
>> Respectively,
>>
>> AK
>>
>>
>>
>>
>


Re: Online classes for spark topics

2023-03-12 Thread Denny Lee
Looks like we have some good topics here - I'm glad to help with setting up
the infrastructure to broadcast if it helps?

On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani 
wrote:

> I am happy to be a part of this discussion as well.
>
> Regards,
> Neeraj
>
> On Wed, 8 Mar 2023 at 22:41, Winston Lai  wrote:
>
>> +1, any webinar on Spark related topic is appreciated 👍
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* asma zgolli 
>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>> *To:* karan alang 
>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
>> ashok34...@yahoo.com>; User 
>> *Subject:* Re: Online classes for spark topics
>>
>> +1
>>
>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>> écrit :
>>
>> +1 .. I'm happy to be part of these discussions as well !
>>
>>
>>
>>
>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> I guess I can schedule this work over a course of time. I for myself can
>> contribute plus learn from others.
>>
>> So +1 for me.
>>
>> Let us see if anyone else is interested.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>> wrote:
>>
>>
>> Hello Mich.
>>
>> Greetings. Would you be able to arrange for Spark Structured Streaming
>> learning webinar.?
>>
>> This is something I haven been struggling with recently. it will be very
>> helpful.
>>
>> Thanks and Regard
>>
>> AK
>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> This might  be a worthwhile exercise on the assumption that the
>> contributors will find the time and bandwidth to chip in so to speak.
>>
>> I am sure there are many but on top of my head I can think of Holden
>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>> experienced.
>>
>> Anyone else 🤔
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>  wrote:
>>
>> Hello gurus,
>>
>> Does Spark arranges online webinars for special topics like Spark on K8s,
>> data science and Spark Structured Streaming?
>>
>> I would be most grateful if experts can share their experience with
>> learners with intermediate knowledge like myself. Hopefully we will find
>> the practical experiences told valuable.
>>
>> Respectively,
>>
>> AK
>>
>>
>>
>>
>


Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee
In the past, we've been using the Apache Spark LinkedIn page
<https://www.linkedin.com/company/apachespark/> and group to broadcast
these type of events - if you're cool with this?  Or we could go through
the process of submitting and updating the current https://spark.apache.org
or request to leverage the original Spark confluence page
<https://cwiki.apache.org/confluence/display/SPARK>.WDYT?

On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh 
wrote:

> Well that needs to be created first for this purpose. The appropriate name
> etc. to be decided. Maybe @Denny Lee   can
> facilitate this as he offered his help.
>
>
> cheers
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>
>> Hello Mich,
>>
>> Can you please provide the link for the confluence page?
>>
>> Many thanks
>> Asma
>> Ph.D. in Big Data - Applied Machine Learning
>>
>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh 
>> a écrit :
>>
>>> Apologies I missed the list.
>>>
>>> To move forward I selected these topics from the thread "Online classes
>>> for spark topics".
>>>
>>> To take this further I propose a confluence page to be seup.
>>>
>>>
>>>1. Spark UI
>>>2. Dynamic allocation
>>>3. Tuning of jobs
>>>4. Collecting spark metrics for monitoring and alerting
>>>5.  For those who prefer to use Pandas API on Spark since the
>>>release of Spark 3.2, What are some important notes for those users? For
>>>example, what are the additional factors affecting the Spark performance
>>>using Pandas API on Spark? How to tune them in addition to the 
>>> conventional
>>>Spark tuning methods applied to Spark SQL users.
>>>6. Spark internals and/or comparing spark 3 and 2
>>>7. Spark Streaming & Spark Structured Streaming
>>>8. Spark on notebooks
>>>9. Spark on serverless (for example Spark on Google Cloud)
>>>10. Spark on k8s
>>>
>>> Opinions and how to is welcome
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh 
>>> wrote:
>>>
>>>> Hi guys
>>>>
>>>> To move forward I selected these topics from the thread "Online classes
>>>> for spark topics".
>>>>
>>>> To take this further I propose a confluence page to be seup.
>>>>
>>>> Opinions and how to is welcome
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>
>>
>>


Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
Thanks Mich for tackling this!  I encourage everyone to add to the list so
we can have a comprehensive list of topics, eh?!

On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
wrote:

> Hi all,
>
> Thanks to @Denny Lee   to give access to
>
> https://www.linkedin.com/company/apachespark/
>
> and contribution from @asma zgolli 
>
> You will see my post at the bottom. Please add anything else on topics to
> the list as a comment.
>
> We will then put them together in an article perhaps. Comments and
> contributions are welcome.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead,
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh 
> wrote:
>
>> Hi Denny,
>>
>> That Apache Spark Linkedin page
>> https://www.linkedin.com/company/apachespark/ looks fine. It also allows
>> a wider audience to benefit from it.
>>
>> +1 for me
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:
>>
>>> In the past, we've been using the Apache Spark LinkedIn page
>>> <https://www.linkedin.com/company/apachespark/> and group to broadcast
>>> these type of events - if you're cool with this?  Or we could go through
>>> the process of submitting and updating the current
>>> https://spark.apache.org or request to leverage the original Spark
>>> confluence page <https://cwiki.apache.org/confluence/display/SPARK>.
>>>  WDYT?
>>>
>>> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well that needs to be created first for this purpose. The appropriate
>>>> name etc. to be decided. Maybe @Denny Lee   can
>>>> facilitate this as he offered his help.
>>>>
>>>>
>>>> cheers
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>>>>
>>>>> Hello Mich,
>>>>>
>>>>> Can you please provide the link for the confluence page?
>>>>>
>>>>> Many thanks
>>>>> Asma
>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>
>>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> a écrit :
>>>>>
>>>>>> Apologies I missed the list.
>>>>>>
>>>>>> To move forward I selected these topics from the thread "Online
>>>>>> classes for spark topics".
>>>>>>
>>>>>> To take this further I propose a confluence page to be seup.
>>>>>>
>>>>>>
>>>>>>1. Spark UI
>>>&g

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
What we can do is get into the habit of compiling the list on LinkedIn but
making sure this list is shared and broadcast here, eh?!

As well, when we broadcast the videos, we can do this using zoom/jitsi/
riverside.fm as well as simulcasting this on LinkedIn. This way you can
view directly on the former without ever logging in with a user ID.

HTH!!

On Wed, Mar 15, 2023 at 4:30 PM Mich Talebzadeh 
wrote:

> Understood Nitin It would be wrong to act against one's conviction. I am
> sure we can find a way around providing the contents
>
> Regards
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Mar 2023 at 22:34, Nitin Bhansali 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for your prompt response ... much appreciated. I know how to and
>> can create login IDs on such sites but I had taken conscious decision some
>> 20 years ago ( and i will be going against my principles) not to be on such
>> sites. Hence I had asked for is there any other way I can join/view
>> recording of webinar.
>>
>> Anyways not to worry.
>>
>> Thanks & Regards
>>
>> Nitin.
>>
>>
>> On Wednesday, 15 March 2023 at 20:37:55 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi Nitin,
>>
>> Linkedin is more of a professional media.  FYI, I am only a member of
>> Linkedin, no facebook, etc.There is no reason for you NOT to create a
>> profile for yourself  in linkedin :)
>>
>>
>> https://www.linkedin.com/help/linkedin/answer/a1338223/sign-up-to-join-linkedin?lang=en
>>
>> see you there as well.
>>
>> Best of luck.
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead,
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Mar 2023 at 18:31, Nitin Bhansali 
>> wrote:
>>
>> Hello Mich,
>>
>> My apologies  ...  but I am not on any of such social/professional sites?
>> Any other way to access such webinars/classes?
>>
>> Thanks & Regards
>> Nitin.
>>
>> On Wednesday, 15 March 2023 at 18:26:51 GMT, Denny Lee <
>> denny.g@gmail.com> wrote:
>>
>>
>> Thanks Mich for tackling this!  I encourage everyone to add to the list
>> so we can have a comprehensive list of topics, eh?!
>>
>> On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
>> wrote:
>>
>> Hi all,
>>
>> Thanks to @Denny Lee   to give access to
>>
>> https://www.linkedin.com/company/apachespark/
>>
>> and contribution from @asma zgolli 
>>
>> You will see my post at the bottom. Please add anything else on topics to
>> the list as a comment.
>>
>> We will then put them together in an article perhaps. Comments and
>> contributions are welcome.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead,
>> Palantir Technologies Limited
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>&g

Re: Slack for PySpark users

2023-03-27 Thread Denny Lee
+1 I think this is a great idea!

On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon  wrote:

> Yeah, actually I think we should better have a slack channel so we can
> easily discuss with users and developers.
>
> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>
>> Hi all,
>> I really like *Slack *as communication channel for a tech community.
>> There is a Slack workspace for *delta lake users* (
>> https://go.delta.io/slack) that I enjoy a lot.
>> I was wondering if there is something similar for PySpark users.
>>
>> If not, would there be anything wrong with creating a new Slack workspace
>> for PySpark users? (when explicitly mentioning that this is *not*
>> officially part of Apache Spark)?
>>
>> Cheers
>> Martin
>>
>


Re: Slack for PySpark users

2023-03-30 Thread Denny Lee
#x27;t refer to any specific mailing
>>>> list because we didn't set up any rule here yet.
>>>>
>>>> To Xiao. I understand what you mean. That's the reason why I added
>>>> Matei from your side.
>>>> > I did not see an objection from the ASF board.
>>>>
>>>> There is on-going discussion about the communication channels outside
>>>> ASF email which is specifically concerning Slack.
>>>> Please hold on any official action for this topic. We will know how to
>>>> support it seamlessly.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Mar 30, 2023 at 9:21 AM Xiao Li  wrote:
>>>>
>>>>> Hi, Dongjoon,
>>>>>
>>>>> The other communities (e.g., Pinot, Druid, Flink) created their own
>>>>> Slack workspaces last year. I did not see an objection from the ASF board.
>>>>> At the same time, Slack workspaces are very popular and useful in most
>>>>> non-ASF open source communities. TBH, we are kind of late. I think we can
>>>>> do the same in our community?
>>>>>
>>>>> We can follow the guide when the ASF has an official process for ASF
>>>>> archiving. Since our PMC are the owner of the slack workspace, we can make
>>>>> a change based on the policy. WDYT?
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>> Dongjoon Hyun  于2023年3月30日周四 09:03写道:
>>>>>
>>>>>> Hi, Xiao and all.
>>>>>>
>>>>>> (cc Matei)
>>>>>>
>>>>>> Please hold on the vote.
>>>>>>
>>>>>> There is a concern expressed by ASF board because recent Slack
>>>>>> activities created an isolated silo outside of ASF mailing list archive.
>>>>>>
>>>>>> We need to establish a way to embrace it back to ASF archive before
>>>>>> starting anything official.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> + @d...@spark.apache.org 
>>>>>>>
>>>>>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>>>>>>> Flink) have created their own dedicated Slack workspaces for faster
>>>>>>> communication. We can do the same in Apache Spark. The Slack workspace 
>>>>>>> will
>>>>>>> be maintained by the Apache Spark PMC. I propose to initiate a vote for 
>>>>>>> the
>>>>>>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>>>>>>>
>>>>>>>> I created one at slack called pyspark
>>>>>>>>
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>>>> Palantir Technologies Limited
>>>>>>>>
>>>>>>>>
>>>>>>>>view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 good idea, I d like to join as well.
>>>>>>>>>
>>>>>>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai 
>>>>>>>>> a écrit :
>>>>>>>>>
>>>>>>>>>> Please let us know when the channel is created. I'd like to join
>>>>>>>>>> :)
>>>>>>>>>>
>>>>>>>>>> Thank You & Best Regards
>>>>>>>>>> Winston Lai
>>>>>>>>>> --
>>>>>>>>>> *From:* Denny Lee 
>>>>>>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>>>>>>> *To:* Hyukjin Kwon 
>>>>>>>>>> *Cc:* keen ; user@spark.apache.org <
>>>>>>>>>> user@spark.apache.org>
>>>>>>>>>> *Subject:* Re: Slack for PySpark users
>>>>>>>>>>
>>>>>>>>>> +1 I think this is a great idea!
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Yeah, actually I think we should better have a slack channel so
>>>>>>>>>> we can easily discuss with users and developers.
>>>>>>>>>>
>>>>>>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>> I really like *Slack *as communication channel for a tech
>>>>>>>>>> community.
>>>>>>>>>> There is a Slack workspace for *delta lake users* (
>>>>>>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>>>>>>> I was wondering if there is something similar for PySpark users.
>>>>>>>>>>
>>>>>>>>>> If not, would there be anything wrong with creating a new
>>>>>>>>>> Slack workspace for PySpark users? (when explicitly mentioning that 
>>>>>>>>>> this is
>>>>>>>>>> *not* officially part of Apache Spark)?
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Martin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Asma ZGOLLI
>>>>>>>>>
>>>>>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4
>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>> 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Re: Slack for PySpark users

2023-04-03 Thread Denny Lee
;>>>>
>>>>>
>>>>> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> I'm reading through the page "Briefing: The Apache Way", and in the
>>>>>> section of "Open Communications", restriction of communication inside ASF
>>>>>> INFRA (mailing list) is more about code and decision-making.
>>>>>>
>>>>>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>>>>>
>>>>>> It's unavoidable if "users" prefer to use an alternative
>>>>>> communication mechanism rather than the user mailing list. Before Stack
>>>>>> Overflow days, there had been a meaningful number of questions around 
>>>>>> user@.
>>>>>> It's just impossible to let them go back and post to the user mailing 
>>>>>> list.
>>>>>>
>>>>>> We just need to make sure it is not the purpose of employing Slack to
>>>>>> move all discussions about developments, direction of the project, etc
>>>>>> which must happen in dev@/private@. The purpose of Slack thread here
>>>>>> does not seem to aim to serve the purpose.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Good discussions and proposals.all around.
>>>>>>>
>>>>>>> I have used slack in anger on a customer site before. For small and
>>>>>>> medium size groups it is good and affordable. Alternatives have been
>>>>>>> suggested as well so those who like investigative search can agree and 
>>>>>>> come
>>>>>>> up with a freebie one.
>>>>>>> I am inclined to agree with Bjorn that this slack has more social
>>>>>>> dimensions than the mailing list. It is akin to a sports club using
>>>>>>> WhatsApp groups for communication. Remember we were originally looking 
>>>>>>> for
>>>>>>> space for webinars, including Spark on Linkedin that Denney Lee 
>>>>>>> suggested.
>>>>>>> I think Slack and mailing groups can coexist happily. On a more serious
>>>>>>> note, when I joined the user group back in 2015-2016, there was a lot of
>>>>>>> traffic. Currently we hardly get many mails daily <> less than 5. So 
>>>>>>> having
>>>>>>> a slack type medium may improve members participation.
>>>>>>>
>>>>>>> so +1 for me as well.
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>>> Palantir Technologies Limited
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 30 Mar 2023 at 22:19, Denny Lee 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1.
>>>>>>>>
>>>>>>>> To Shani’s point, there are multiple OSS projects that use the free
>>>>>>>> Slack version - top of mind include Delta, Presto, Flink, Trino, 
>>>>>>>> Datahub,
>>>>>>>> MLflow, etc.
>>>>>>>>
>>>>>>>> On 

Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram,

We have some good guidance at
https://spark.apache.org/contributing.html

HTH!
Denny


On Sun, Sep 17, 2023 at 17:18 ram manickam  wrote:

>
>
>
> Hello All,
> Recently, joined this community and would like to contribute. Is there a
> guideline or recommendation on tasks that can be picked up by a first timer
> or a started task?.
>
> Tried looking at stack overflow tag: apache-spark
> , couldn't find
> any information for first time contributors.
>
> Looking forward to learning and contributing.
>
> Thanks
> Ram
>


Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Denny Lee
I had run into the same problem where everything was working swimmingly
with Spark 1.3.1.  When I switched to Spark 1.4, either by upgrading to
Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
HTH!



On Mon, Jul 6, 2015 at 8:31 AM Andy Huang  wrote:

> We have hit the same issue in spark shell when registering a temp table.
> We observed it happening with those who had JDK 6. The problem went away
> after installing jdk 8. This was only for the tutorial materials which was
> about loading a parquet file.
>
> Regards
> Andy
>
> On Sat, Jul 4, 2015 at 2:54 AM, sim  wrote:
>
>> @bipin, in my case the error happens immediately in a fresh shell in
>> 1.4.0.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
> f: 02 9376 0730| m: 0433221979
>


Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Denny Lee
I went ahead and tested your file and the results from the tests can be
seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.

Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default}
the query ran without any issues.  I was able to recreate the issue with
{Java 7, default}.  I included the commands I used to start the spark-shell
but basically I just used all defaults (no alteration to driver or executor
memory) with the only additional call was with driver-class-path to connect
to MySQL Hive metastore.  This is on OSX Macbook Pro.

One thing I did notice is that your version of Java 7 is version 51 while
my version of Java 7 version 79.  Could you see if updating to Java 7
version 79 perhaps allows you to use the MaxPermSize call?




On Mon, Jul 6, 2015 at 1:36 PM Simeon Simeonov  wrote:

>  The file is at
> https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1
>
>  The command was included in the gist
>
>  SPARK_REPL_OPTS="-XX:MaxPermSize=256m"
> spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
> com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g
>
>  /Sim
>
>  Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/>
> @simeons <http://twitter.com/simeons> | blog.simeonov.com | 617.299.6746
>
>
>   From: Yin Huai 
> Date: Monday, July 6, 2015 at 12:59 AM
> To: Simeon Simeonov 
> Cc: Denny Lee , Andy Huang <
> andy.hu...@servian.com.au>, user 
>
> Subject: Re: 1.4.0 regression: out-of-memory errors on small data
>
>   I have never seen issue like this. Setting PermGen size to 256m should
> solve the problem. Can you send me your test file and the command used to
> launch the spark shell or your application?
>
>  Thanks,
>
>  Yin
>
> On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov  wrote:
>
>>   Yin,
>>
>>  With 512Mb PermGen, the process still hung and had to be kill -9ed.
>>
>>  At 1Gb the spark shell & associated processes stopped hanging and
>> started exiting with
>>
>>  scala> println(dfCount.first.getLong(0))
>> 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040)
>> called with curMem=0, maxMem=2223023063
>> 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
>> values in memory (estimated size 229.5 KB, free 2.1 GB)
>> 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184) called
>> with curMem=235040, maxMem=2223023063
>> 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
>> stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
>> 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added broadcast_2_piece0
>> in memory on localhost:65464 (size: 19.7 KB, free: 2.1 GB)
>> 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from first
>> at :30
>> java.lang.OutOfMemoryError: PermGen space
>> Stopping spark context.
>> Exception in thread "main"
>> Exception: java.lang.OutOfMemoryError thrown from the
>> UncaughtExceptionHandler in thread "main"
>> 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
>> broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
>> GB)
>>
>>  That did not change up until 4Gb of PermGen space and 8Gb for driver &
>> executor each.
>>
>>  I stopped at this point because the exercise started looking silly. It
>> is clear that 1.4.0 is using memory in a substantially different manner.
>>
>>  I'd be happy to share the test file so you can reproduce this in your
>> own environment.
>>
>>  /Sim
>>
>>  Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/>
>> @simeons <http://twitter.com/simeons> | blog.simeonov.com | 617.299.6746
>>
>>
>>   From: Yin Huai 
>> Date: Sunday, July 5, 2015 at 11:04 PM
>> To: Denny Lee 
>> Cc: Andy Huang , Simeon Simeonov <
>> s...@swoop.com>, user 
>> Subject: Re: 1.4.0 regression: out-of-memory errors on small data
>>
>>   Sim,
>>
>>  Can you increase the PermGen size? Please let me know what is your
>> setting when the problem disappears.
>>
>>  Thanks,
>>
>>  Yin
>>
>> On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee  wrote:
>>
>>>  I had run into the same problem where everything was working
>>> swimmingly with Spark 1.3.1.  When I switched to Spark 1.4, either by
>>> upgrading to Java8 (from Java7) or by knocking up the PermGenSize had
>>> solved my issue.  HTH!
>>>
>>>
>>>
>>>  On Mon, Jul 6, 2015 at 8:31 AM Andy Huang 
&

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee
Within the context of your question, Spark SQL utilizing the Hive context
is primarily about very fast queries.  If you want to use real-time
queries, I would utilize Spark Streaming.  A couple of great resources on
this topic include Guest Lecture on Spark Streaming in Stanford CME 323:
Distributed Algorithms and Optimization

and Recipes for Running Spark Streaming Applications in Production

(from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki 
wrote:

> Hello,
>
> I'm actually asking my self about performance of using Spark SQL with Hive
> to do real time analytics.
> I know that Hive has been created for batch processing, and Spark is use to
> do fast queries.
>
> But, use Spark SQL with Hive will allow me to do real time queries ? Or it
> just will make fastest queries but not real time.
> Should I use an other datawarehouse, like Hbase ?
>
> Thanks in advance for your time and consideration,
> Florian
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Denny Lee
Hey Dean,
Sure, will take care of this.
HTH,
Denny

On Tue, Jul 7, 2015 at 10:07 Dean Wampler  wrote:

> Here's our home page: http://www.meetup.com/Chicago-Spark-Users/
>
> Thanks,
> Dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>


Re: Spark GraphFrames

2016-08-02 Thread Denny Lee
Hi Divya,

Here's a blog post concerning On-Time Flight Performance with GraphFrames:
https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html

It also includes a Databricks notebook that has the code in it.

HTH!
Denny


On Tue, Aug 2, 2016 at 1:16 AM Kazuaki Ishizaki  wrote:

> Sorry
> Please ignore this mail. Sorry for misinterpretation of GraphFrame in
> Spark. I thought that Frame Graph for profiling tool.
>
> Kazuaki Ishizaki,
>
>
>
> From:Kazuaki Ishizaki/Japan/IBM@IBMJP
> To:Divya Gehlot 
> Cc:"user @spark" 
> Date:2016/08/02 17:06
> Subject:Re: Spark GraphFrames
> --
>
>
>
> Hi,
> Kay wrote a procedure to use GraphFrames with Spark.
> *https://gist.github.com/kayousterhout/7008a8ebf2babeedc7ce6f8723fd1bf4*
> 
>
> Kazuaki Ishizaki
>
>
>
> From:Divya Gehlot 
> To:"user @spark" 
> Date:2016/08/02 14:52
> Subject:Spark GraphFrames
> --
>
>
>
> Hi,
>
> Has anybody has worked with GraphFrames.
> Pls let me know as I need to know the real case scenarios where It can
> used .
>
>
> Thanks,
> Divya
>
>
>


Re: Intercept in Linear Regression

2015-12-15 Thread Denny Lee
If you're using
model = LinearRegressionWithSGD.train(parseddata, iterations=100,
step=0.01, intercept=True)

then to get the intercept, you would use
model.intercept

More information can be found at:
http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression

HTH!


On Tue, Dec 15, 2015 at 11:06 PM Arunkumar Pillai 
wrote:

>
> How to get intercept in  Linear Regression Model?
>
> LinearRegressionWithSGD.train(parsedData, numIterations)
>
> --
> Thanks and Regards
> Arun
>


Re: subscribe

2016-01-08 Thread Denny Lee
To subscribe, please go to http://spark.apache.org/community.html to join
the mailing list.


On Fri, Jan 8, 2016 at 3:58 AM Jeetendra Gangele 
wrote:

>
>


Re: How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Denny Lee
Per http://spark.apache.org/docs/latest/submitting-applications.html:

For Python, you can use the --py-files argument of spark-submit to add .py,
.zip or .egg files to be distributed with your application. If you depend
on multiple Python files we recommend packaging them into a .zip or .egg.



On Fri, Jan 8, 2016 at 6:44 PM Ascot Moss  wrote:

> Hi,
>
> Instead of using Spark-shell, does anyone know how to build .zip (or .egg)
> for Python and use Spark-submit to run?
>
> Regards
>


Re: Meetup in Rome

2016-02-19 Thread Denny Lee
Hey Domenico,

Glad to hear that you love Spark and would like to organize a meetup in
Rome. We created a Meetup-in-a-box to help with that - check out the post
https://databricks.com/blog/2015/11/19/meetup-in-a-box.html.

HTH!
Denny



On Fri, Feb 19, 2016 at 02:38 Domenico Pontari 
wrote:

>
> Hi guys,
> I spent till September 2015 in the bay area working with Spark and I love
> it. Now I'm back to Rome and I'd like to organize a meetup about it and Big
> Data in general. Any idea / suggestions? Can you eventually sponsor beers
> and pizza for it?
> Best,
> Domenico
>


Spark Survey Results 2015 are now available

2015-10-05 Thread Denny Lee
Thanks to all of you who provided valuable feedback in our Spark Survey
2015.  Because of the survey, we have a better picture of who’s using
Spark, how they’re using it, and what they’re using it to build–insights
that will guide major updates to the Spark platform as we move into Spark’s
next phase of growth. The results are summarized in an info graphic
available here: Spark Survey Results 2015 are now available
.
Thank you to everyone who participated in Spark Survey 2015 and for your
help in shaping Spark’s future!


Re: Best practises

2015-11-02 Thread Denny Lee
In addition, you may want to check out Tuning and Debugging in Apache Spark
(https://sparkhub.databricks.com/video/tuning-and-debugging-apache-spark/)

On Mon, Nov 2, 2015 at 05:27 Stefano Baghino 
wrote:

> There is this interesting book from Databricks:
> https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
>
> What do you think? Does it contain the info you're looking for? :)
>
> On Mon, Nov 2, 2015 at 2:18 PM, satish chandra j  > wrote:
>
>> HI All,
>> Yes, any such doc will be a great help!!!
>>
>>
>>
>> On Fri, Oct 30, 2015 at 4:35 PM, huangzheng <1106944...@qq.com> wrote:
>>
>>> I have the same question.anyone help us.
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "Deepak Sharma";
>>> *发送时间:* 2015年10月30日(星期五) 晚上7:23
>>> *收件人:* "user";
>>> *主题:* Best practises
>>>
>>> Hi
>>> I am looking for any blog / doc on the developer's best practices if
>>> using Spark .I have already looked at the tuning guide on
>>> spark.apache.org.
>>> Please do let me know if any one is aware of any such resource.
>>>
>>> Thanks
>>> Deepak
>>>
>>
>>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>


Re: SQL Server to Spark

2015-07-23 Thread Denny Lee
It sort of depends on optimized. There is a good thread on the topic at
http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded

If you have an archival type strategy, you could do daily BCP extracts out
to load the data into HDFS / S3 / etc. This would result in minimal impact
to SQL Server for the extracts (for that scenario, that was of primary
importance).

On Thu, Jul 23, 2015 at 16:42 vinod kumar  wrote:

> Hi Everyone,
>
> I am in need to use the table from MsSQLSERVER in SPARK.Any one please
> share me the optimized way for that?
>
> Thanks in advance,
> Vinod
>
>


Re: GraphFrame BFS

2016-11-01 Thread Denny Lee
You should be able to GraphX or GraphFrames subgraph to build up your
subgraph.  A good example for GraphFrames can be found at:
http://graphframes.github.io/user-guide.html#subgraphs.  HTH!

On Mon, Oct 10, 2016 at 9:32 PM cashinpj  wrote:

> Hello,
>
> I have a set of data representing various network connections.  Each vertex
> is represented by a single id, while the edges have  a source id,
> destination id, and a relationship (peer to peer, customer to provider, or
> provider to customer).  I am trying to create a sub graph build around a
> single source node following one type of edge as far as possible.
>
> For example:
> 1 2 p2p
> 2 3 p2p
> 2 3 c2p
>
> Following the p2p edges would give:
>
> 1 2 p2p
> 2 3 p2p
>
> I am pretty new to GraphX and GraphFrames, but was wondering if it is
> possible to get this behavior using the GraphFrames bfs() function or would
> it be better to modify the already existing Pregel implementation of bfs?
>
> Thank you for your time.
>
> Padraic
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphFrame-BFS-tp27876.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee
If you're able to read the data in as a DataFrame, perhaps you can use a
BroadcastHashJoin so that way you can join to that table presuming its
small enough to distributed?  Here's a handy guide on a BroadcastHashJoin:
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit  wrote:

> I have a lookup table in HANA database. I want to create a spark broadcast
> variable for it.
> What would be the suggested approach? Should I read it as an data frame
> and convert data frame into broadcast variable?
>
> Thanks,
> Nishit
>


Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Denny Lee
The one you're looking for is the Data Sciences and Engineering with Apache
Spark at
https://www.edx.org/xseries/data-science-engineering-apacher-sparktm.

Note, a great quick start is the Getting Started with Apache Spark on
Databricks at https://databricks.com/product/getting-started-guide

HTH!

On Sun, Nov 6, 2016 at 22:20 Raghav  wrote:

> Can you please point out the right courses from EDX/Berkeley ?
>
> Many thanks.
>
> On Sun, Nov 6, 2016 at 6:08 PM, ayan guha  wrote:
>
> I would start with Spark documentation, really. Then you would probably
> start with some older videos from youtube, especially spark summit
> 2014,2015 and 2016 videos. Regading practice, I would strongly suggest
> Databricks cloud (or download prebuilt from spark site). You can also take
> courses from EDX/Berkley, which are very good starter courses.
>
> On Mon, Nov 7, 2016 at 11:57 AM, raghav  wrote:
>
> I am newbie in the world of big data analytics, and I want to teach myself
> Apache Spark, and want to be able to write scripts to tinker with data.
>
> I have some understanding of Map Reduce but have not had a chance to get my
> hands dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread Denny Lee
There are a number of great resources to learn Apache Spark - a good
starting point is the Apache Spark Documentation at:
http://spark.apache.org/documentation.html


The two books that immediately come to mind are

- Learning Spark: http://shop.oreilly.com/product/mobile/0636920028512.do
(there's also a Chinese language version of this book)

- Advanced Analytics with Apache Spark:
http://shop.oreilly.com/product/mobile/0636920035091.do

You can also find a pretty decent listing of Apache Spark resources at:
https://sparkhub.databricks.com/resources/

HTH!


On Sun, Nov 6, 2016 at 19:00 litg <1933443...@qq.com> wrote:

>I'm a postgraduate from  Shanghai Jiao Tong University,China.
> recently, I
> carry out a project about the  realization of artificial algorithms on
> spark
> in python. however, I am not familiar with this field.furthermore,there are
> few Chinese books about spark.
>  Actually,I strongly want to have a further study at this field.hope
> someone can  kindly recommend me some books about  the mechanism of spark,
> or just give me suggestions about how to  program with spark.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/hope-someone-can-recommend-some-books-for-me-a-spark-beginner-tp28033.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee
Generally, yes - you should try to have larger data sizes due to the
overhead of opening up files.  Typical guidance is between 64MB-1GB;
personally I usually stick with 128MB-512MB with the default of snappy
codec compression with parquet.  A good reference is Vida Ha's
presentation Data
Storage Tips for Optimal Spark Performance
.


On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:

> Hi Everyone,
> Does anyone know what is the best practise of writing parquet file from
> Spark ?
>
> As Spark app write data to parquet and it shows that under that directory
> there are heaps of very small parquet file (such as
> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
> 15KB
>
> Should it write each chunk of  bigger data size (such as 128 MB) with
> proper number of files ?
>
> Does anyone find out any performance changes when changing data size of
> each parquet file ?
>
> Thanks,
> Kevin.
>


Re: UNSUBSCRIBE

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:41 PM, Chris Murphy - ChrisSMurphy.com 
cont...@chrissmurphy.com
wrote:
PLEASE!!

Re: unsubscribe

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:40 PM, william tellme williamtellme...@gmail.com
wrote:

Support Stored By Clause

2017-03-27 Thread Denny Lee
Per SPARK-19630, wondering if there are plans to support "STORED BY" clause
for Spark 2.x?

Thanks!


Re: Azure Event Hub with Pyspark

2017-04-20 Thread Denny Lee
As well, perhaps another option could be to use the Spark Connector to
DocumentDB (https://github.com/Azure/azure-documentdb-spark) if sticking
with Scala?
On Thu, Apr 20, 2017 at 21:46 Nan Zhu  wrote:

> DocDB does have a java client? Anything prevent you using that?
>
> Get Outlook for iOS 
> --
> *From:* ayan guha 
> *Sent:* Thursday, April 20, 2017 9:24:03 PM
> *To:* Ashish Singh
> *Cc:* user
> *Subject:* Re: Azure Event Hub with Pyspark
>
> Hi
>
> yes, its only scala. I am looking for a pyspark version, as i want to
> write to documentDB which has good python integration.
>
> Thanks in advance
>
> best
> Ayan
>
> On Fri, Apr 21, 2017 at 2:02 PM, Ashish Singh 
> wrote:
>
>> Hi ,
>>
>> You can try https://github.com/hdinsight/spark-eventhubs : which is
>> eventhub receiver for spark streaming
>> We are using it but you have scala version only i guess
>>
>>
>> Thanks,
>> Ashish Singh
>>
>> On Fri, Apr 21, 2017 at 9:19 AM, ayan guha  wrote:
>>
>>> [image: Boxbe]  This message is
>>> eligible for Automatic Cleanup! (guha.a...@gmail.com) Add cleanup rule
>>> 
>>> | More info
>>> 
>>>
>>> Hi
>>>
>>> I am not able to find any conector to be used to connect spark streaming
>>> with Azure Event Hub, using pyspark.
>>>
>>> Does anyone know if there is such library/package exists>?
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Spark Shell issue on HDInsight

2017-05-08 Thread Denny Lee
This appears to be an issue with the Spark to DocumentDB connector,
specifically version 0.0.1. Could you run the 0.0.3 version of the jar and
see if you're still getting the same error?  i.e.

spark-shell --master yarn --jars
azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar


On Mon, May 8, 2017 at 5:01 AM ayan guha  wrote:

> Hi
>
> I am facing an issue while trying to use azure-document-db connector from
> Microsoft. Instructions/Github
> 
> .
>
> Error while trying to add jar in spark-shell:
>
> spark-shell --jars
> azure-documentdb-spark-0.0.1.jar,azure-documentdb-1.9.6.jar
> SPARK_MAJOR_VERSION is set to 2, using Spark2
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> [init] error: error while loading , Error accessing
> /home/sshuser/azure-spark-docdb-test/v1/azure-documentdb-spark-0.0.1.jar
>
> Failed to initialize compiler: object java.lang.Object in compiler mirror
> not found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programmatically, settings.usejavacp.value = true.
>
> Failed to initialize compiler: object java.lang.Object in compiler mirror
> not found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programmatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.NullPointerException
> at
> scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
> at
> scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:896)
> at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request.headerPreamble$lzycompute(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request.headerPreamble(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:918)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1337)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1336)
> at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1336)
> at
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:908)
> at
> scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:1002)
> at
> scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997)
> at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit

Re: Spark Shell issue on HDInsight

2017-05-11 Thread Denny Lee
sc.interpreter.IMain$Request.compile(IMain.scala:997)
> at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sshuser@ed0-svochd:~/azure-spark-docdb-test$
>
>
> On Mon, May 8, 2017 at 11:50 PM, Denny Lee  wrote:
>
>> This appears to be an issue with the Spark to DocumentDB connector,
>> specifically version 0.0.1. Could you run the 0.0.3 version of the jar and
>> see if you're still getting the same error?  i.e.
>>
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>>
>> On Mon, May 8, 2017 at 5:01 AM ayan guha  wrote:
>>
>>> Hi
>>>
>>> I am facing an issue while trying to use azure-document-db connector
>>> from Microsoft. Instructions/Github
>>> <https://github.com/Azure/azure-documentdb-spark/wiki/Azure-DocumentDB-Spark-Connector-User-Guide>
>>> .
>>>
>>> Error while trying to add jar in spark-shell:
>>>
>>> spark-shell --jars
>>> azure-documentdb-spark-0.0.1.jar,azure-documentdb-1.9.6.jar
>>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel).
>>> [init] error: error while loading , Error accessing
>>> /home/sshuser/azure-spark-docdb-test/v1/azure-documentdb-spark-0.0.1.jar
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>> Exception in thread "main" java.lang.NullPointerException
>>> at
>>> scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
>>> at
>>> scala.tools.nsc.interp

Re: Spark Shell issue on HDInsight

2017-05-14 Thread Denny Lee
Sorry for the delay, you just did as I'm with the Azure CosmosDB (formerly
DocumentDB) team.  If you'd like to make it official, why not add an issue
to the GitHub repo at https://github.com/Azure/azure-documentdb-spark/issues.
HTH!

On Thu, May 11, 2017 at 9:08 PM ayan guha  wrote:

> Works for me tooyou are a life-saver :)
>
> But the question: should/how we report this to Azure team?
>
> On Fri, May 12, 2017 at 10:32 AM, Denny Lee  wrote:
>
>> I was able to repro your issue when I had downloaded the jars via blob
>> but when I downloaded them as raw, I was able to get everything up and
>> running.  For example:
>>
>> wget https://github.com/Azure/azure-documentdb-spark/*blob*
>> /master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-1.10.0.jar
>> wget https://github.com/Azure/azure-documentdb-spark/*blob*
>> /master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>> resulted in the error:
>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel).
>> [init] error: error while loading , Error accessing
>> /home/sshuser/jars/test/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>>
>> Failed to initialize compiler: object java.lang.Object in compiler mirror
>> not found.
>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>> ** object programmatically, settings.usejavacp.value = true.
>>
>> But when running:
>> wget
>> https://github.com/Azure/azure-documentdb-spark/raw/master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-1.10.0.jar
>> wget
>> https://github.com/Azure/azure-documentdb-spark/raw/master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>> it was up and running:
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel).
>> 17/05/11 22:54:06 WARN SparkContext: Use an existing SparkContext, some
>> configuration may not take effect.
>> Spark context Web UI available at http://10.0.0.22:4040
>> Spark context available as 'sc' (master = yarn, app id =
>> application_1494248502247_0013).
>> Spark session available as 'spark'.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.2.2.5.4.0-121
>>   /_/
>>
>> Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala>
>>
>> HTH!
>>
>>
>> On Wed, May 10, 2017 at 11:49 PM ayan guha  wrote:
>>
>>> Hi
>>>
>>> Thanks for reply, but unfortunately did not work. I am getting same
>>> error.
>>>
>>> sshuser@ed0-svochd:~/azure-spark-docdb-test$ spark-shell --jars
>>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel).
>>> [init] error: error while loading , Error accessing
>>> /home/sshuser/azure-spark-docdb-test/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>> Exception in thread &quo

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Denny Lee
This is amazingly awesome! :)

On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
wrote:

> That's great!
>
>
>
> On 12 July 2017 at 12:41, Felix Cheung  wrote:
>
>> Awesome! Congrats!!
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Wednesday, July 12, 2017 12:26:00 PM
>> *To:* user@spark.apache.org
>> *Subject:* With 2.2.0 PySpark is now available for pip install from PyPI
>> :)
>>
>> Hi wonderful Python + Spark folks,
>>
>> I'm excited to announce that with Spark 2.2.0 we finally have PySpark
>> published on PyPI (see https://pypi.python.org/pypi/pyspark /
>> https://twitter.com/holdenkarau/status/885207416173756417). This has
>> been a long time coming (previous releases included pip installable
>> artifacts that for a variety of reasons couldn't be published to PyPI). So
>> if you (or your friends) want to be able to work with PySpark locally on
>> your laptop you've got an easier path getting started (pip install pyspark).
>>
>> If you are setting up a standalone cluster your cluster will still need
>> the "full" Spark packaging, but the pip installed PySpark should be able to
>> work with YARN or an existing standalone cluster installation (of the same
>> version).
>>
>> Happy Sparking y'all!
>>
>> Holden :)
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
That’s correct - you can use GraphFrames though as it does support PySpark.
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

> I can not find anything for graphx module in the python API document, does
> it mean it is not supported yet?
>


Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
Most likely not as most of the effort is currently on GraphFrames  - a
great blog post on the what GraphFrames offers can be found at:
https://databricks.com/blog/2016/03/03/introducing-graphframes.html.   Is
there a particular scenario or situation that you're addressing that
requires GraphX vs. GraphFrames?

On Sat, Feb 17, 2018 at 8:26 PM xiaobo  wrote:

> Thanks Denny, will it be supported in the near future?
>
>
>
> -- Original ------
> *From:* Denny Lee 
> *Date:* Sun,Feb 18,2018 11:05 AM
> *To:* 94035420 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: Does Pyspark Support Graphx?
>
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420  wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>


Re: Does Pyspark Support Graphx?

2018-02-18 Thread Denny Lee
Note the --packages option works for both PySpark and Spark (Scala).  For
the SparkLauncher class, you should be able to include packages ala:

spark.addSparkArg("--packages", "graphframes:0.5.0-spark2.0-s_2.11")


On Sun, Feb 18, 2018 at 3:30 PM xiaobo  wrote:

> Hi Denny,
> The pyspark script uses the --packages option to load graphframe library,
> what about the SparkLauncher class?
>
>
>
> -- Original --
> *From:* Denny Lee 
> *Date:* Sun,Feb 18,2018 11:07 AM
> *To:* 94035420 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: Does Pyspark Support Graphx?
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420  wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Denny Lee
+1

On Fri, May 31, 2019 at 17:58 Holden Karau  wrote:

> +1
>
> On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:
>
>> +1 and the draft sounds good
>>
>> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>>
>>> Here is the draft announcement:
>>>
>>> ===
>>> Plan for dropping Python 2 support
>>>
>>> As many of you already knew, Python core development team and many
>>> utilized Python packages like Pandas and NumPy will drop Python 2 support
>>> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
>>> since Spark 1.4 release in 2015. However, maintaining Python 2/3
>>> compatibility is an increasing burden and it essentially limits the use of
>>> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
>>> coming, we plan to eventually drop Python 2 support as well. The current
>>> plan is as follows:
>>>
>>> * In the next major release in 2019, we will deprecate Python 2 support.
>>> PySpark users will see a deprecation warning if Python 2 is used. We will
>>> publish a migration guide for PySpark users to migrate to Python 3.
>>> * We will drop Python 2 support in a future release in 2020, after
>>> Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is
>>> used.
>>> * For releases that support Python 2, e.g., Spark 2.4, their patch
>>> releases will continue supporting Python 2. However, after Python 2 EOL, we
>>> might not take patches that are specific to Python 2.
>>> ===
>>>
>>> Sean helped make a pass. If it looks good, I'm going to upload it to
>>> Spark website and announce it here. Let me know if you think we should do a
>>> VOTE instead.
>>>
>>> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>>>
 I created https://issues.apache.org/jira/browse/SPARK-27884 to track
 the work.

 On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
 wrote:

> We don’t usually reference a future release on website
>
> > Spark website and state that Python 2 is deprecated in Spark 3.0
>
> I suspect people will then ask when is Spark 3.0 coming out then.
> Might need to provide some clarity on that.
>

 We can say the "next major release in 2019" instead of Spark 3.0. Spark
 3.0 timeline certainly requires a new thread to discuss.


>
>
> --
> *From:* Reynold Xin 
> *Sent:* Thursday, May 30, 2019 12:59:14 AM
> *To:* shane knapp
> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
> Fen; Xiangrui Meng; dev; user
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> +1 on Xiangrui’s plan.
>
> On Thu, May 30, 2019 at 7:55 AM shane knapp 
> wrote:
>
>> I don't have a good sense of the overhead of continuing to support
>>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>>
>>> from the build/test side, it will actually be pretty easy to
>> continue support for python2.7 for spark 2.x as the feature sets won't be
>> expanding.
>>
>
>> that being said, i will be cracking a bottle of champagne when i can
>> delete all of the ansible and anaconda configs for python2.x.  :)
>>
>
 On the development side, in a future release that drops Python 2
 support we can remove code that maintains python 2/3 compatibility and
 start using python 3 only features, which is also quite exciting.


>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread Denny Lee
There are a number of really good datasets already available including (but
not limited to):
- South Korea COVID-19 Dataset

- 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns
Hopkins CSSE 
- COVID-19 Open Research Dataset Challenge (CORD-19)


BTW, I had co-presented in a recent tech talk on Analyzing COVID-19: Can
the Data Community Help? 

In the US, there is a good resource Coronavirus in the United States:
Mapping the COVID-19 outbreak
 and
there are various global starter projects on Reddit's r/CovidProjects
.

There are a lot of good projects that we can all help individually or
together.  I would suggest to see what hospitals/academic institutions that
are doing analysis in your local region.  Even if you're analyzing public
worldwide data,  how it acts in your local region will often be different.







On Thu, Mar 26, 2020 at 12:30 PM Rajev Agarwal 
wrote:

> Actually I thought these sites exist look at John's hopkins and
> worldometers
>
> On Thu, Mar 26, 2020, 2:27 PM Zahid Rahman  wrote:
>
>>
>> "We can then donate this to WHO or others and we can make it very modular
>> though microservices etc."
>>
>> I have no interest because there are 8 million muslims locked up in their
>> home for 8 months by the Hindutwa (Indians)
>> You didn't take any notice of them.
>> Now you are locked up in your home and you want to contribute to the WHO.
>> The same WHO and you who didn't take any notice of the 8 million Kashmiri
>> Muslims.
>> The daily rapes of women and the imprisonment and torture of  men.
>>
>> Indian is the most dangerous country for women.
>>
>>
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>>
>> On Thu, 26 Mar 2020 at 14:53, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks but nobody claimed we can fix it. However, we can all contribute
>>> to it. When it utilizes the cloud then it become a global digitization
>>> issue.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 26 Mar 2020 at 14:43, Laurent Bastien Corbeil <
>>> bastiencorb...@gmail.com> wrote:
>>>
 People in tech should be more humble and admit this is not something
 they can fix. There's already plenty of visualizations, dashboards etc
 showing the spread of the virus. This is not even a big data problem, so
 Spark would have limited use.

 On Thu, Mar 26, 2020 at 10:37 AM Sol Rodriguez 
 wrote:

> IMO it's not about technology, it's about data... if we don't have
> access to the data there's no point throwing "microservices" and "kafka" 
> at
> the problem. You might find that the most effective analysis might be
> delivered through an excel sheet ;)
> So before technology I'd suggest to get access to sources and then
> figure out how to best exploit them and deliver the information to the
> right people
>
> On Thu, Mar 26, 2020 at 2:29 PM Chenguang He 
> wrote:
>
>> Have you taken a look at this (
>> https://coronavirus.1point3acres.com/en/test  )?
>>
>> They have a visualizer with a very basic analysis of the outbreak.
>>
>> On Thu, Mar 26, 2020 at 8:54 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks.
>>>
>>> Agreed, computers are not the end but means to an end. We all have
>>> to start from somewhere. It all helps.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property which 
>>> may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author w

Re: How to unsubscribe

2020-05-06 Thread Denny Lee
Hi Fred,

To unsubscribe, could you please email: user-unsubscr...@spark.apache.org
(for more information, please refer to
https://spark.apache.org/community.html).

Thanks!
Denny


On Wed, May 6, 2020 at 10:12 AM Fred Liu  wrote:

> Hi guys
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> *From:* Fred Liu 
> *Sent:* Wednesday, May 6, 2020 10:10 AM
> *To:* user@spark.apache.org
> *Subject:* Unsubscribe
>
>
>
> *[External E-mail]*
>
> *CAUTION: This email originated from outside the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.*
>
>
>
>
>


Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
I may be missing something here but typically when the hive-site.xml
configurations do not require you to place "s" within the configuration
itself.  Both the retry.delay and socket.timeout values are in seconds so
you should only need to place the integer value (which are in seconds).

On Sun Feb 01 2015 at 2:28:09 AM guxiaobo1982  wrote:

> Hi,
>
> To order to let a local spark-shell connect to  a remote spark stand-alone
> cluster and access  hive tables there, I must put the hive-site.xml file
> into the local spark installation's conf path, but spark-shell even can't
> import the default settings there, I found two errors:
>
> 
>
>   hive.metastore.client.connect.retry.delay
>
>   5s
>
> 
>
> 
>
>   hive.metastore.client.socket.timeout
>
>   1800s
>
> 
> Spark-shell try to read 5s and 1800s and integers, they must be changed to
> 5 and 1800 to let spark-shell work, It's suggested to be fixed in future
> versions.
>


Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
Cool!  For all the times i had been modifying the hive-site.xml I had only
propped in the integer values - learn something new every day, eh?!


On Sun Feb 01 2015 at 9:36:23 AM Ted Yu  wrote:

> Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java :
>
>
> METASTORE_CLIENT_CONNECT_RETRY_DELAY("hive.metastore.client.connect.retry.delay",
> "1s",
> new TimeValidator(TimeUnit.SECONDS),
> "Number of seconds for the client to wait between consecutive
> connection attempts"),
>
> It seems having the 's' suffix is legitimate.
>
> On Sun, Feb 1, 2015 at 9:14 AM, Denny Lee  wrote:
>
>> I may be missing something here but typically when the hive-site.xml
>> configurations do not require you to place "s" within the configuration
>> itself.  Both the retry.delay and socket.timeout values are in seconds so
>> you should only need to place the integer value (which are in seconds).
>>
>>
>> On Sun Feb 01 2015 at 2:28:09 AM guxiaobo1982 
>> wrote:
>>
>>> Hi,
>>>
>>> To order to let a local spark-shell connect to  a remote spark
>>> stand-alone cluster and access  hive tables there, I must put the
>>> hive-site.xml file into the local spark installation's conf path, but
>>> spark-shell even can't import the default settings there, I found two
>>> errors:
>>>
>>> 
>>>
>>>   hive.metastore.client.connect.retry.delay
>>>
>>>   5s
>>>
>>> 
>>>
>>> 
>>>
>>>   hive.metastore.client.socket.timeout
>>>
>>>   1800s
>>>
>>> 
>>> Spark-shell try to read 5s and 1800s and integers, they must be changed
>>> to 5 and 1800 to let spark-shell work, It's suggested to be fixed in future
>>> versions.
>>>
>>
>


Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted
is at: OLAP with Cassandra and Spark
http://www.slideshare.net/EvanChan2/2014-07olapcassspark.

On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad  wrote:

> Write out the rdd to a cassandra table.  The datastax driver provides
> saveToCassandra() for this purpose.
>
> On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais <
> adamantios.cor...@gmail.com> wrote:
>
>> Hi,
>>
>> After some research I have decided that Spark (SQL) would be ideal for
>> building an OLAP engine. My goal is to push aggregated data (to Cassandra
>> or other low-latency data storage) and then be able to project the results
>> on a web page (web service). New data will be added (aggregated) once a
>> day, only. On the other hand, the web service must be able to run some
>> fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the
>> results with D3.js. Note that I can already achieve similar speeds while in
>> REPL mode by caching the data. Therefore, I believe that my problem must be
>> re-phrased as follows: "How can I automatically cache the data once a day
>> and make them available on a web service that is capable of running any
>> Spark or Spark (SQL)  statement in order to plot the results with D3.js?"
>>
>> Note that I have already some experience in Spark (+Spark SQL) as well as
>> D3.js but not at all with OLAP engines (at least in their traditional form).
>>
>> Any ideas or suggestions?
>>
>>
>> *// Adamantios*
>>
>>
>>


Re: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Denny Lee
Hi Ningjun,

I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely
for development purposes).  I had most recently installed them utilizing
Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+.  A handy
thread concerning the null\bin\winutils issue is addressed in an earlier
thread at:
http://apache-spark-user-list.1001560.n3.nabble.com/Run-spark-unit-test-on-Windows-7-td8656.html

Hope this helps a little bit!
Denny





On Tue Feb 03 2015 at 8:24:24 AM Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  Hi Gen
>
>
>
> Thanks for your feedback. We do have a business reason to run spark on
> windows. We have an existing application that is built on C# .NET running
> on windows. We are considering adding spark to the application for parallel
> processing of large data. We want spark to run on windows so it integrate
> with our existing app easily.
>
>
>
> Has anybody use spark on windows for production system? Is spark reliable
> on windows?
>
>
>
> Ningjun
>
>
>
> *From:* gen tang [mailto:gen.tan...@gmail.com]
> *Sent:* Thursday, January 29, 2015 12:53 PM
>
>
> *To:* Wang, Ningjun (LNG-NPV)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Fail to launch spark-shell on windows 2008 R2
>
>
>
> Hi,
>
>
>
> Using spark under windows is a really bad idea, because even you solve the
> problems about hadoop, you probably will meet the problem of
> java.net.SocketException. connection reset by peer. It is caused by the
> fact we ask socket port too frequently under windows. In my knowledge, it
> is really difficult to solve. And you will find something really funny: the
> same code sometimes works and sometimes not, even in the shell mode.
>
>
>
> And I am sorry but I don't see the interest to run spark under windows and
> moreover using local file system in a business environment. Do you have a
> cluster in windows?
>
>
>
> FYI, I have used spark prebuilt on hadoop 1 under windows 7 and there is
> no problem to launch, but have problem of java.net.SocketException. If you
> are using spark prebuilt on hadoop 2, you should consider follow the
> solution provided by https://issues.apache.org/jira/browse/SPARK-2356
>
>
>
> Cheers
>
> Gen
>
>
>
>
>
>
>
> On Thu, Jan 29, 2015 at 5:54 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
>
> Install virtual box which run Linux? That does not help us. We have
> business reason to run it on Windows operating system, e.g. Windows 2008 R2.
>
>
>
> If anybody have done that, please give some advise on what version of
> spark, which version of Hadoop do you built spark against, etc…. Note that
> we only use local file system and do not have any hdfs file system at all.
> I don’t understand why spark generate so many error on Hadoop while we
> don’t even need hdfs.
>
>
>
> Ningjun
>
>
>
>
>
> *From:* gen tang [mailto:gen.tan...@gmail.com]
> *Sent:* Thursday, January 29, 2015 10:45 AM
> *To:* Wang, Ningjun (LNG-NPV)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Fail to launch spark-shell on windows 2008 R2
>
>
>
> Hi,
>
>
>
> I tried to use spark under windows once. However the only solution that I
> found is to install virtualbox
>
>
>
> Hope this can help you.
>
> Best
>
> Gen
>
>
>
>
>
> On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
>
> I deployed spark-1.1.0 on Windows 7 and was albe to launch the
> spark-shell. I then deploy it to windows 2008 R2 and launch the
> spark-shell, I got the error
>
>
>
> java.lang.RuntimeException: Error while running command to get file
> permissions : java.io.IOExceptio
>
> n: Cannot run program "ls": CreateProcess error=2, The system cannot find
> the file specified
>
> at java.lang.ProcessBuilder.start(Unknown Source)
>
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
>
> at org.apache.hadoop.util.Shell.run(Shell.java:182)
>
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
>
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
>
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
>
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil
>
> eSystem.java:443)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst
>
> em.java:418)
>
>
>
>
>
>
>
> Here is the detail output
>
>
>
>
>
> C:\spark-1.1.0\bin>   spark-shell
>
> 15/01/29 10:13:13 INFO SecurityManager: Changing view acls to:
> ningjun.wang,
>
> 15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to:
> ningjun.wang,
>
> 15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled;
>
> users with view permissions: Set(ningjun.wang, ); users with modify
> permissions: Set(ningjun.wang, )
>
>
>
> 15/01/29 10:13:13 INFO HttpServer: Starting HTT

Re: Tableau beta connector

2015-02-04 Thread Denny Lee
Some quick context behind how Tableau interacts with Spark / Hive can also
be found at https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
 - its for how to connect from Tableau to the thrift server before the
official Tableau beta connector but should provide some of the additional
context called out.   HTH!

On Wed Feb 04 2015 at 10:47:23 PM İsmail Keskin 
wrote:

> Tableau connects to Spark Thrift Server via an ODBC driver. So, none of
> the RDD stuff applies, you just issue SQL queries from Tableau.
>
> The table metadata can come from Hive Metastore if you place your
> hive-site.xml to configuration directory of Spark.
>
> On Thu, Feb 5, 2015 at 8:11 AM, ashu  wrote:
>
>> Hi,
>> I am trying out the tableau beta connector to Spark SQL. I have few basics
>> question:
>> Will this connector be able to fetch the schemaRDDs into tableau.
>> Will all the schemaRDDs be exposed to tableau?
>> Basically I am not getting what tableau will fetch at data-source? Is it
>> existing files in HDFS? RDDs or something else.
>> Question may be naive but I did not get answer anywhere else. Would really
>> appreciate if someone has already tried it, can help me with this.
>>
>> Thanks,
>> Ashutosh
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Tableau-beta-connector-tp21512.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Tableau beta connector

2015-02-04 Thread Denny Lee
The context is that you would create your RDDs and then persist them in
Hive. Once in Hive, the data is accessible from the Tableau extract through
Spark thrift server.
On Wed, Feb 4, 2015 at 23:36 Ashutosh Trivedi (MT2013030) <
ashutosh.triv...@iiitb.org> wrote:

>  Thanks Denny and Ismail.
>
>
>  Denny ,I went through your blog, It was great help. I guess tableau beta
> connector also following the same procedure,you described in blog. I am
> building the Spark now.
>
> Basically what I don't get is, where to put my data so that tableau can
> extract.
>
>
>  So  Ismail,its just Spark SQL. No RDDs I think I am getting it now . We
> use spark for our big data processing and we want *processed data (Rdd)*
> into tableau. So we should put our data in hive metastore and tableau will
> extract it from there using this connector? Correct me if I am wrong.
>
>
>  I guess I have to look at how thrift server works.
>  --
> *From:* Denny Lee 
> *Sent:* Thursday, February 5, 2015 12:20 PM
> *To:* İsmail Keskin; Ashutosh Trivedi (MT2013030)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Tableau beta connector
>
>  Some quick context behind how Tableau interacts with Spark / Hive can
> also be found at
> https://www.concur.com/blog/en-us/connect-tableau-to-sparksql  - its for
> how to connect from Tableau to the thrift server before the official
> Tableau beta connector but should provide some of the additional context
> called out.   HTH!
>
> On Wed Feb 04 2015 at 10:47:23 PM İsmail Keskin 
> wrote:
>
>> Tableau connects to Spark Thrift Server via an ODBC driver. So, none of
>> the RDD stuff applies, you just issue SQL queries from Tableau.
>>
>>  The table metadata can come from Hive Metastore if you place your
>> hive-site.xml to configuration directory of Spark.
>>
>> On Thu, Feb 5, 2015 at 8:11 AM, ashu  wrote:
>>
>>> Hi,
>>> I am trying out the tableau beta connector to Spark SQL. I have few
>>> basics
>>> question:
>>> Will this connector be able to fetch the schemaRDDs into tableau.
>>> Will all the schemaRDDs be exposed to tableau?
>>> Basically I am not getting what tableau will fetch at data-source? Is it
>>> existing files in HDFS? RDDs or something else.
>>> Question may be naive but I did not get answer anywhere else. Would
>>> really
>>> appreciate if someone has already tried it, can help me with this.
>>>
>>> Thanks,
>>> Ashutosh
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Tableau-beta-connector-tp21512.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Tableau beta connector

2015-02-05 Thread Denny Lee
Could you clarify what you mean by "build another Spark and work through
Spark Submit"?

If you are referring to utilizing Spark spark and thrift, you could start
the Spark service and then have your spark-shell, spark-submit, and/or
thrift service aim at the master you have started.

On Thu Feb 05 2015 at 2:02:04 AM Ashutosh Trivedi (MT2013030) <
ashutosh.triv...@iiitb.org> wrote:

>  Hi Denny , Ismail one last question..
>
>
>  Is it necessary to build another Spark and work through Spark-submit ?
>
>
>  I work on IntelliJ using SBT as build script, I have Hive set up with
> postgres as metastore, I can run the hive server using command
>
> *hive --service metastore*
>
> *hive --service hiveserver2*
>
>
>  After that if I can use hive-context in my code
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
>
>  Do some processing on RDD and persist it on hive using  registerTempTable
>
> and tableau can extract that RDD persisted on hive.
>
>
>  Regards,
>
> Ashutosh
>
>
>  --
> *From:* Denny Lee 
>
> *Sent:* Thursday, February 5, 2015 1:27 PM
> *To:* Ashutosh Trivedi (MT2013030); İsmail Keskin
> *Cc:* user@spark.apache.org
> *Subject:* Re: Tableau beta connector
> The context is that you would create your RDDs and then persist them in
> Hive. Once in Hive, the data is accessible from the Tableau extract through
> Spark thrift server.
> On Wed, Feb 4, 2015 at 23:36 Ashutosh Trivedi (MT2013030) <
> ashutosh.triv...@iiitb.org> wrote:
>
>>  Thanks Denny and Ismail.
>>
>>
>>  Denny ,I went through your blog, It was great help. I guess tableau
>> beta connector also following the same procedure,you described in blog. I
>> am building the Spark now.
>>
>> Basically what I don't get is, where to put my data so that tableau can
>> extract.
>>
>>
>>  So  Ismail,its just Spark SQL. No RDDs I think I am getting it now . We
>> use spark for our big data processing and we want *processed data (Rdd)*
>> into tableau. So we should put our data in hive metastore and tableau will
>> extract it from there using this connector? Correct me if I am wrong.
>>
>>
>>  I guess I have to look at how thrift server works.
>>  --
>> *From:* Denny Lee 
>> *Sent:* Thursday, February 5, 2015 12:20 PM
>> *To:* İsmail Keskin; Ashutosh Trivedi (MT2013030)
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Tableau beta connector
>>
>> Some quick context behind how Tableau interacts with Spark / Hive
>> can also be found at
>> https://www.concur.com/blog/en-us/connect-tableau-to-sparksql  - its for
>> how to connect from Tableau to the thrift server before the official
>> Tableau beta connector but should provide some of the additional context
>> called out.   HTH!
>>
>> On Wed Feb 04 2015 at 10:47:23 PM İsmail Keskin <
>> ismail.kes...@dilisim.com> wrote:
>>
>>> Tableau connects to Spark Thrift Server via an ODBC driver. So, none of
>>> the RDD stuff applies, you just issue SQL queries from Tableau.
>>>
>>>  The table metadata can come from Hive Metastore if you place your
>>> hive-site.xml to configuration directory of Spark.
>>>
>>> On Thu, Feb 5, 2015 at 8:11 AM, ashu  wrote:
>>>
>>>> Hi,
>>>> I am trying out the tableau beta connector to Spark SQL. I have few
>>>> basics
>>>> question:
>>>> Will this connector be able to fetch the schemaRDDs into tableau.
>>>> Will all the schemaRDDs be exposed to tableau?
>>>> Basically I am not getting what tableau will fetch at data-source? Is it
>>>> existing files in HDFS? RDDs or something else.
>>>> Question may be naive but I did not get answer anywhere else. Would
>>>> really
>>>> appreciate if someone has already tried it, can help me with this.
>>>>
>>>> Thanks,
>>>> Ashutosh
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Tableau-beta-connector-tp21512.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>


Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
Quickly reviewing the latest SQL Programming Guide

(in github) I had a couple of quick questions:

1) Do we need to instantiate the SparkContext as per
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Within Spark 1.3 the sqlContext is already available so probably do not
need to make this call.

2) Importing org.apache.spark.sql._ should bring in both SQL data types,
struct types, and row
// Import Spark SQL data types and Row.
import org.apache.spark.sql._

Currently with Spark 1.3 RC1, it appears org.apache.spark.sql._ only brings
in row.

scala> import org.apache.spark.sql._

import org.apache.spark.sql._


scala> val schema =

 |   StructType(

 | schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))

:25: error: not found: value StructType

 StructType(

But if I also import in org.apache.spark.sql.types_

scala> import org.apache.spark.sql.types._

import org.apache.spark.sql.types._


scala> val schema =

 |   StructType(

 | schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))

schema: org.apache.spark.sql.types.StructType =
StructType(StructField(DeviceMake,StringType,true),
StructField(Country,StringType,true))

Wondering if this is by design or perhaps a quick documentation / package
update is warranted.


Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
Oh no worries at all. If you want, I'd be glad to make updates and PR for
anything I find, eh?!
On Fri, Feb 20, 2015 at 12:18 Michael Armbrust 
wrote:

> Yeah, sorry.  The programming guide has not been updated for 1.3.  I'm
> hoping to get to that this weekend / next week.
>
> On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee  wrote:
>
>> Quickly reviewing the latest SQL Programming Guide
>> <https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md>
>> (in github) I had a couple of quick questions:
>>
>> 1) Do we need to instantiate the SparkContext as per
>> // sc is an existing SparkContext.
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>
>> Within Spark 1.3 the sqlContext is already available so probably do not
>> need to make this call.
>>
>> 2) Importing org.apache.spark.sql._ should bring in both SQL data types,
>> struct types, and row
>> // Import Spark SQL data types and Row.
>> import org.apache.spark.sql._
>>
>> Currently with Spark 1.3 RC1, it appears org.apache.spark.sql._ only
>> brings in row.
>>
>> scala> import org.apache.spark.sql._
>>
>> import org.apache.spark.sql._
>>
>>
>> scala> val schema =
>>
>>  |   StructType(
>>
>>  | schemaString.split(" ").map(fieldName =>
>> StructField(fieldName, StringType, true)))
>>
>> :25: error: not found: value StructType
>>
>>  StructType(
>>
>> But if I also import in org.apache.spark.sql.types_
>>
>> scala> import org.apache.spark.sql.types._
>>
>> import org.apache.spark.sql.types._
>>
>>
>> scala> val schema =
>>
>>  |   StructType(
>>
>>  | schemaString.split(" ").map(fieldName =>
>> StructField(fieldName, StringType, true)))
>>
>> schema: org.apache.spark.sql.types.StructType =
>> StructType(StructField(DeviceMake,StringType,true),
>> StructField(Country,StringType,true))
>>
>> Wondering if this is by design or perhaps a quick documentation / package
>> update is warranted.
>>
>>
>>
>>
>>
>


Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Hi Francisco,

Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular)
from SSAS to Spark? As a past SSAS guy you've definitely piqued my
interest.

The one thing that you may run into is that the SQL generated by SSAS can
be quite convoluted. When we were doing the same thing to try to get SSAS
to connect to Hive (ref paper at
http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx)
that was definitely a blocker. Note that Spark SQL is different than HIVEQL
but you may run into the same issue. If so, the trick you may want to use
is similar to the paper - use a SQL Server linked server connection and
have SQL Server be your "translator" for the SQL generated by SSAS.

HTH!
Denny

On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab  wrote:

> Hi Francisco,
> While I haven't tried this, have a look at the contents of
> start-thriftserver.sh - all it's doing is setting up a few variables and
> calling:
>
> /bin/spark-submit --class
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
>
> and passing some additional parameters. Perhaps doing the same would work?
>
> I also believe that this hosts a jdbc server (not odbc), but there's a
> free odbc connector from databricks built by Simba, with which I've been
> able to connect to a spark cluster hosted on linux.
>
> -Ashic.
>
> --
> To: user@spark.apache.org
> From: forch...@gmail.com
> Subject: Spark SQL odbc on Windows
> Date: Sun, 22 Feb 2015 09:45:03 +0100
>
>
> Hello,
> I work on a MS consulting company and we are evaluating including SPARK on
> our BigData offer. We are particulary interested into testing SPARK as
> rolap engine for SSAS but we cannot find a way to activate the odbc server
> (thrift) on a Windows custer. There is no start-thriftserver.sh command
> available for windows.
>
> Somebody knows if there is a way to make this work?
>
> Thanks in advance!!
> Francisco
>


Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Back to thrift, there was an earlier thread on this topic at
http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E
that may be useful as well.

On Sun Feb 22 2015 at 8:42:29 AM Denny Lee  wrote:

> Hi Francisco,
>
> Out of curiosity - why ROLAP mode using multi-dimensional mode (vs
> tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my
> interest.
>
> The one thing that you may run into is that the SQL generated by SSAS can
> be quite convoluted. When we were doing the same thing to try to get SSAS
> to connect to Hive (ref paper at
> http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx)
> that was definitely a blocker. Note that Spark SQL is different than HIVEQL
> but you may run into the same issue. If so, the trick you may want to use
> is similar to the paper - use a SQL Server linked server connection and
> have SQL Server be your "translator" for the SQL generated by SSAS.
>
> HTH!
> Denny
>
> On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab  wrote:
>
>> Hi Francisco,
>> While I haven't tried this, have a look at the contents of
>> start-thriftserver.sh - all it's doing is setting up a few variables and
>> calling:
>>
>> /bin/spark-submit --class org.apache.spark.sql.hive.
>> thriftserver.HiveThriftServer2
>>
>> and passing some additional parameters. Perhaps doing the same would work?
>>
>> I also believe that this hosts a jdbc server (not odbc), but there's a
>> free odbc connector from databricks built by Simba, with which I've been
>> able to connect to a spark cluster hosted on linux.
>>
>> -Ashic.
>>
>> --
>> To: user@spark.apache.org
>> From: forch...@gmail.com
>> Subject: Spark SQL odbc on Windows
>> Date: Sun, 22 Feb 2015 09:45:03 +0100
>>
>>
>> Hello,
>> I work on a MS consulting company and we are evaluating including SPARK
>> on our BigData offer. We are particulary interested into testing SPARK as
>> rolap engine for SSAS but we cannot find a way to activate the odbc server
>> (thrift) on a Windows custer. There is no start-thriftserver.sh command
>> available for windows.
>>
>> Somebody knows if there is a way to make this work?
>>
>> Thanks in advance!!
>> Francisco
>>
>


Re: Spark SQL odbc on Windows

2015-02-23 Thread Denny Lee
Makes complete sense - I became a fan of Spark for pretty much the same
reasons.  Best of luck, eh?!

On Mon Feb 23 2015 at 12:08:49 AM Francisco Orchard 
wrote:

> Hi Denny & Ashic,
>
> You are putting us on the right direction. Thanks!
>
> We will try following your advice and provide feeback to the list.
>
> Regarding your question Denny. We feel  MS is lacking on an scalable
> solution for SSAS (tabular or multidim) so when it comes to big data, the
> only answer they have is their expensive appliance (APS) which can be used
> as a rolap engine. We are interesting into testing how Spark escalate to
> check if it can be offered as an less expensive alternative when a single
> machine is not enough to our client needs. The reason why we do not go with
> tabular in the first place is because its rolap mode (direct query) is
> still too limited. And thanks for writing the klout paper!! We were already
> using it as a guideline for our tests.
>
> Best regards,
> Francisco
> --
> From: Denny Lee 
> Sent: ‎22/‎02/‎2015 17:56
> To: Ashic Mahtab ; Francisco Orchard ;
> Apache Spark 
> Subject: Re: Spark SQL odbc on Windows
>
> Back to thrift, there was an earlier thread on this topic at
> http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E
> that may be useful as well.
>
> On Sun Feb 22 2015 at 8:42:29 AM Denny Lee  wrote:
>
>> Hi Francisco,
>>
>> Out of curiosity - why ROLAP mode using multi-dimensional mode (vs
>> tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my
>> interest.
>>
>> The one thing that you may run into is that the SQL generated by SSAS can
>> be quite convoluted. When we were doing the same thing to try to get SSAS
>> to connect to Hive (ref paper at
>> http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx)
>> that was definitely a blocker. Note that Spark SQL is different than HIVEQL
>> but you may run into the same issue. If so, the trick you may want to use
>> is similar to the paper - use a SQL Server linked server connection and
>> have SQL Server be your "translator" for the SQL generated by SSAS.
>>
>> HTH!
>> Denny
>>
>> On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab  wrote:
>>
>>> Hi Francisco,
>>> While I haven't tried this, have a look at the contents of
>>> start-thriftserver.sh - all it's doing is setting up a few variables and
>>> calling:
>>>
>>> /bin/spark-submit --class org.apache.spark.sql.hive.
>>> thriftserver.HiveThriftServer2
>>>
>>> and passing some additional parameters. Perhaps doing the same would
>>> work?
>>>
>>> I also believe that this hosts a jdbc server (not odbc), but there's a
>>> free odbc connector from databricks built by Simba, with which I've been
>>> able to connect to a spark cluster hosted on linux.
>>>
>>> -Ashic.
>>>
>>> --
>>> To: user@spark.apache.org
>>> From: forch...@gmail.com
>>> Subject: Spark SQL odbc on Windows
>>> Date: Sun, 22 Feb 2015 09:45:03 +0100
>>>
>>>
>>> Hello,
>>> I work on a MS consulting company and we are evaluating including SPARK
>>> on our BigData offer. We are particulary interested into testing SPARK as
>>> rolap engine for SSAS but we cannot find a way to activate the odbc server
>>> (thrift) on a Windows custer. There is no start-thriftserver.sh command
>>> available for windows.
>>>
>>> Somebody knows if there is a way to make this work?
>>>
>>> Thanks in advance!!
>>> Francisco
>>>
>>


Re: Use case for data in SQL Server

2015-02-24 Thread Denny Lee
Hi Suhel,

My team is currently working with a lot of SQL Server databases as one of
our many data sources and ultimately we pull the data into HDFS from SQL
Server.  As we had a lot of SQL databases to hit, we used the jTDS driver
and SQOOP to extract the data out of SQL Server and into HDFS (small hit
against the SQL databases to extract the data out).  The reasons we had
done this were to 1) minimize the impact on our SQL Servers since these
were transactional databases and we didn't want to our analytics queries to
interfere with the transactions and 2) having the data within HDFS allowed
us to centralize our relational source data within one location so we could
join / mash it with other sources of data more easily.  Now that the data
is there, we just run our Spark queries against that and humming nicely.

Saying this - I have not yet had a chance to try the Spark 1.3 JDBC data
sources.

Cheng, to confirm, the reference for JDBC is
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/api/java/org/apache/spark/sql/jdbc/package-tree.html
? In the past I have not been able to get SQL queries to against SQL Server
without the use of the jTDS or Microsoft SQL Server JDBC driver for various
reason (e.g. authentication, T-SQL vs. ANSI-SQL differences, etc.) If I
needed to utilize an additional driver like jTDS, can I "plug it in" with
the JDBC source and/or potentially build something that will work with the
Data Sources API?

Thanks!
Denny




On Tue Feb 24 2015 at 3:20:57 AM Cheng Lian  wrote:

>  There is a newly introduced JDBC data source in Spark 1.3.0 (not the
> JdbcRDD in Spark core), which may be useful. However, currently there's no
> SQL server specific logics implemented. I'd assume standard SQL queries
> should work.
>
>
> Cheng
>
>
> On 2/24/15 7:02 PM, Suhel M wrote:
>
>  Hey,
>
>  I am trying to work out what is the best way we can leverage Spark for
> crunching data that is sitting in SQL Server databases.
> Ideal scenario is being able to efficiently work with big data (10billion+
> rows of activity data).  We need to shape this data for machine learning
> problems and want to do ad-hoc & complex queries and get results in timely
> manner.
>
>  All our data crunching is done via SQL/MDX queries, but these obviously
> take a very long time to run over large data size. Also we currently don't
> have hadoop or any other distributed storage.
>
>  Keen to hear feedback/thoughts/war stories from the Spark community on
> best way to approach this situation.
>
>  Thanks
> Suhel
>
>
>


Re: How to start spark-shell with YARN?

2015-02-24 Thread Denny Lee
It may have to do with the akka heartbeat interval per SPARK-3923 -
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ?

On Tue, Feb 24, 2015 at 16:40 Xi Shen  wrote:

> Hi Sean,
>
> I launched the spark-shell on the same machine as I started YARN service.
> I don't think port will be an issue.
>
> I am new to spark. I checked the HDFS web UI and the YARN web UI. But I
> don't know how to check the AM. Can you help?
>
>
> Thanks,
> David
>
>
> On Tue, Feb 24, 2015 at 8:37 PM Sean Owen  wrote:
>
>> I don't think the build is at issue. The error suggests your App Master
>> can't be contacted. Is there a network port issue? did the AM fail?
>>
>> On Tue, Feb 24, 2015 at 9:15 AM, Xi Shen  wrote:
>>
>>> Hi Arush,
>>>
>>> I got the pre-build from https://spark.apache.org/downloads.html. When
>>> I start spark-shell, it prompts:
>>>
>>> Spark assembly has been built with Hive, including Datanucleus jars
>>> on classpath
>>>
>>> So we don't have pre-build with YARN support? If so, how the
>>> spark-submit work? I checked the YARN log, and job is really submitted and
>>> ran successfully.
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>
>>>
>>>
>>> On Tue Feb 24 2015 at 6:35:38 PM Arush Kharbanda <
>>> ar...@sigmoidanalytics.com> wrote:
>>>
 Hi

 Are you sure that you built Spark for Yarn.If standalone works, not
 sure if its build for Yarn.

 Thanks
 Arush
 On Tue, Feb 24, 2015 at 12:06 PM, Xi Shen 
 wrote:

> Hi,
>
> I followed this guide,
> http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to
> start spark-shell with yarn-client
>
> ./bin/spark-shell --master yarn-client
>
>
> But I got
>
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://sparkYarnAM@10.0.2.15:38171] has failed, address is now gated 
> for [5000] ms. Reason is: [Disassociated].
>
> In the spark-shell, and other exceptions in they yarn log. Please see
> http://stackoverflow.com/questions/28671171/spark-shell-cannot-connect-to-yarn
> for more detail.
>
>
> However, submitting to the this cluster works. Also, spark-shell as
> standalone works.
>
>
> My system:
>
> - ubuntu amd64
> - spark 1.2.1
> - yarn from hadoop 2.6 stable
>
>
> Thanks,
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> 
>   
>
>
 --

 [image: Sigmoid Analytics]
 

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

>>>
>>


Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
The error message you have is:

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory or
unable to create one)

Could you verify that you (the user you are running under) has the rights
to create the necessary folders within HDFS?


On Tue, Feb 24, 2015 at 9:06 PM kundan kumar  wrote:

> Hi ,
>
> I have placed my hive-site.xml inside spark/conf and i am trying to
> execute some hive queries given in the documentation.
>
> Can you please suggest what wrong am I doing here.
>
>
>
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@3340a4b8
>
> scala> hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT, value
> STRING)")
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> 15/02/25 10:30:59 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT
> EXISTS src (key INT, value STRING)
> 15/02/25 10:30:59 INFO ParseDriver: Parse Completed
> 15/02/25 10:30:59 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/02/25 10:30:59 INFO ObjectStore: ObjectStore, initialize called
> 15/02/25 10:30:59 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 15/02/25 10:30:59 INFO Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/02/25 10:31:08 INFO ObjectStore: Setting MetaStore object pin classes
> with
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 15/02/25 10:31:08 INFO MetaStoreDirectSql: MySQL check failed, assuming we
> are not on mysql: Lexical error at line 1, column 5.  Encountered: "@"
> (64), after : "".
> 15/02/25 10:31:09 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/02/25 10:31:09 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/02/25 10:31:15 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/02/25 10:31:15 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/02/25 10:31:17 INFO ObjectStore: Initialized ObjectStore
> 15/02/25 10:31:17 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 0.13.1aa
> 15/02/25 10:31:18 INFO HiveMetaStore: Added admin role in metastore
> 15/02/25 10:31:18 INFO HiveMetaStore: Added public role in metastore
> 15/02/25 10:31:18 INFO HiveMetaStore: No user is added in admin role,
> since config is empty
> 15/02/25 10:31:18 INFO SessionState: No Tez session required at this
> point. hive.execution.engine=mr.
> 15/02/25 10:31:18 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/02/25 10:31:18 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/02/25 10:31:18 INFO Driver: Concurrency mode is disabled, not creating
> a lock manager
> 15/02/25 10:31:18 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/02/25 10:31:18 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/02/25 10:31:18 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT
> EXISTS src (key INT, value STRING)
> 15/02/25 10:31:18 INFO ParseDriver: Parse Completed
> 15/02/25 10:31:18 INFO PerfLogger:  start=1424840478985 end=1424840478986 duration=1
> from=org.apache.hadoop.hive.ql.Driver>
> 15/02/25 10:31:18 INFO PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/02/25 10:31:19 INFO SemanticAnalyzer: Starting Semantic Analysis
> 15/02/25 10:31:19 INFO SemanticAnalyzer: Creating table src position=27
> 15/02/25 10:31:19 INFO HiveMetaStore: 0: get_table : db=default tbl=src
> 15/02/25 10:31:19 INFO audit: ugi=spuser ip=unknown-ip-addr cmd=get_table
> : db=default tbl=src
> 15/02/25 10:31:19 INFO HiveMetaStore: 0: get_database: default
> 15/02/25 10:31:19 INFO audit: ugi=spuser ip=unknown-ip-addr cmd=get_database:
> default
> 15/02/25 10:31:19 INFO Driver: Semantic Analysis Completed
> 15/02/25 10:31:19 INFO PerfLogger:  start=1424840478986 end=1424840479063 duration=77
> from=org.apache.hadoop.hive.ql.Driver>
> 15/02/25 10:31:19 INFO Driver: Returning Hive schema:
> Schema(fieldSchemas:null, properties:null)
> 15/02/25 10:31:19 INFO PerfLogger:  start=1424840478970 end=1424840479069 duration=99
> from

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
That's all you should need to do. Saying this, I did run into an issue
similar to this when I was switching Spark versions which were tied to
different default Hive versions (eg Spark 1.3 by default works with Hive
0.13.1). I'm wondering if you may be hitting this issue due to that?
On Tue, Feb 24, 2015 at 22:40 kundan kumar  wrote:

> Hi Denny,
>
> yes the user has all the rights to HDFS. I am running all the spark
> operations with this user.
>
> and my hive-site.xml looks like this
>
>  
> hive.metastore.warehouse.dir
> /user/hive/warehouse
> location of default database for the
> warehouse
>   
>
> Do I need to do anything explicitly other than placing hive-site.xml in
> the spark.conf directory ?
>
> Thanks !!
>
>
>
> On Wed, Feb 25, 2015 at 11:42 AM, Denny Lee  wrote:
>
>> The error message you have is:
>>
>> FAILED: Execution Error, return code 1 from 
>> org.apache.hadoop.hive.ql.exec.DDLTask.
>> MetaException(message:file:/user/hive/warehouse/src is not a directory
>> or unable to create one)
>>
>> Could you verify that you (the user you are running under) has the rights
>> to create the necessary folders within HDFS?
>>
>>
>> On Tue, Feb 24, 2015 at 9:06 PM kundan kumar 
>> wrote:
>>
>>> Hi ,
>>>
>>> I have placed my hive-site.xml inside spark/conf and i am trying to
>>> execute some hive queries given in the documentation.
>>>
>>> Can you please suggest what wrong am I doing here.
>>>
>>>
>>>
>>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>> hiveContext: org.apache.spark.sql.hive.HiveContext =
>>> org.apache.spark.sql.hive.HiveContext@3340a4b8
>>>
>>> scala> hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT, value
>>> STRING)")
>>> warning: there were 1 deprecation warning(s); re-run with -deprecation
>>> for details
>>> 15/02/25 10:30:59 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT
>>> EXISTS src (key INT, value STRING)
>>> 15/02/25 10:30:59 INFO ParseDriver: Parse Completed
>>> 15/02/25 10:30:59 INFO HiveMetaStore: 0: Opening raw store with
>>> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>>> 15/02/25 10:30:59 INFO ObjectStore: ObjectStore, initialize called
>>> 15/02/25 10:30:59 INFO Persistence: Property datanucleus.cache.level2
>>> unknown - will be ignored
>>> 15/02/25 10:30:59 INFO Persistence: Property
>>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>>> 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
>>> CLASSPATH (or one of dependencies)
>>> 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
>>> CLASSPATH (or one of dependencies)
>>> 15/02/25 10:31:08 INFO ObjectStore: Setting MetaStore object pin classes
>>> with
>>> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
>>> 15/02/25 10:31:08 INFO MetaStoreDirectSql: MySQL check failed, assuming
>>> we are not on mysql: Lexical error at line 1, column 5.  Encountered: "@"
>>> (64), after : "".
>>> 15/02/25 10:31:09 INFO Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/02/25 10:31:09 INFO Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/02/25 10:31:15 INFO Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/02/25 10:31:15 INFO Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/02/25 10:31:17 INFO ObjectStore: Initialized ObjectStore
>>> 15/02/25 10:31:17 WARN ObjectStore: Version information not found in
>>> metastore. hive.metastore.schema.verification is not enabled so recording
>>> the schema version 0.13.1aa
>>> 15/02/25 10:31:18 INFO HiveMetaStore: Added admin role in metastore
>>> 15/02/25 10:31:18 INFO HiveMetaStore: Added public role in metastore
>>> 15/02/25 10:31:18 INFO HiveMetaStore: No user is added in admin role,
>>> si

Re: spark master shut down suddenly

2015-03-04 Thread Denny Lee
It depends on your setup but one of the locations is /var/log/mesos
On Wed, Mar 4, 2015 at 19:11 lisendong  wrote:

> I ‘m sorry, but how to look at the mesos logs?
> where are them?
>
>
>
> 在 2015年3月4日,下午6:06,Akhil Das  写道:
>
>
> You can check in the mesos logs and see whats really happening.
>
> Thanks
> Best Regards
>
> On Wed, Mar 4, 2015 at 3:10 PM, lisendong  wrote:
>
>> 15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not
>> heard
>> from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket
>> connection and attempting reconnect
>> 15/03/04 09:26:36 INFO ConnectionStateManager: State change: SUSPENDED
>> 15/03/04 09:26:36 INFO ZooKeeperLeaderElectionAgent: We have lost
>> leadership
>> 15/03/04 09:26:36 ERROR Master: Leadership has been revoked -- master
>> shutting down.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-shut-down-suddenly-tp21907.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee
Hi Rares,

If you dig into the descriptions for the two jobs, it will probably return
something like:

Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:22)
...

Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:22)
...

The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Basically, line 428 refers to
val initialCount = this.count()

And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()

Basically, the first job is getting the count so you can do the second job
which is to generate the samples.

HTH!
Denny




On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica  wrote:

> Hello,
>
> I am using takeSample from the Scala Spark 1.2.1 shell:
>
> scala> sc.textFile("README.md").takeSample(false, 3)
>
>
> and I notice that two jobs are generated on the Spark Jobs page:
>
> Job Id Description
> 1 takeSample at :13
> 0  takeSample at :13
>
>
> Any ideas why the two jobs are needed?
>
> Thanks!
> Rares
>


Re: Spark sql thrift server slower than hive

2015-03-22 Thread Denny Lee
How are you running your spark instance out of curiosity?  Via YARN or
standalone mode?  When connecting Spark thriftserver to the Spark service,
have you allocated enough memory and CPU when executing with spark?

On Sun, Mar 22, 2015 at 3:39 AM fanooos  wrote:

> We have cloudera CDH 5.3 installed on one machine.
>
> We are trying to use spark sql thrift server to execute some analysis
> queries against hive table.
>
> Without any changes in the configurations, we run the following query on
> both hive and spark sql thrift server
>
> *select * from tableName;*
>
> The time taken by spark is larger than the time taken by hive which is not
> supposed to be the like that.
>
> The hive table is mapped to json files stored on HDFS directory and we are
> using *org.openx.data.jsonserde.JsonSerDe* for
> serialization/deserialization.
>
> Why spark takes much more time to execute the query than hive ?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-sql-thrift-server-slower-than-
> hive-tp22177.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Use pig load function in spark

2015-03-23 Thread Denny Lee
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do
this: https://github.com/sigmoidanalytics/spork


On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin  wrote:

>  Hi, all
>
>
>
> Can spark use pig’s load function to load data?
>
>
>
> Best Regards,
>
> Kevin.
>


Re: Using a different spark jars than the one on the cluster

2015-03-23 Thread Denny Lee
+1  - I currently am doing what Marcelo is suggesting as I have a CDH 5.2
cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in
my cluster.

On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin  wrote:

> Since you're using YARN, you should be able to download a Spark 1.3.0
> tarball from Spark's website and use spark-submit from that
> installation to launch your app against the YARN cluster.
>
> So effectively you would have 1.2.0 and 1.3.0 side-by-side in your cluster.
>
> On Wed, Mar 18, 2015 at 11:09 AM, jaykatukuri  wrote:
> > Hi all,
> > I am trying to run my job which needs spark-sql_2.11-1.3.0.jar.
> > The cluster that I am running on is still on spark-1.2.0.
> >
> > I tried the following :
> >
> > spark-submit --class class-name --num-executors 100 --master yarn
> > application_jar--jars hdfs:///path/spark-sql_2.11-1.3.0.jar
> > hdfs:///input_data
> >
> > But, this did not work, I get an error that it is not able to find a
> > class/method that is in spark-sql_2.11-1.3.0.jar .
> >
> > org.apache.spark.sql.SQLContext.implicits()Lorg/
> apache/spark/sql/SQLContext$implicits$
> >
> > The question in general is how do we use a different version of spark
> jars
> > (spark-core, spark-sql, spark-ml etc) than the one's running on a
> cluster ?
> >
> > Thanks,
> > Jay
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Using-a-different-spark-jars-than-the-
> one-on-the-cluster-tp22125.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Should I do spark-sql query on HDFS or hive?

2015-03-23 Thread Denny Lee
>From the standpoint of Spark SQL accessing the files - when it is hitting
Hive, it is in effect hitting HDFS as well.  Hive provides a great
framework where the table structure is already well defined.But
underneath it, Hive is just accessing files from HDFS so you are hitting
HDFS either way.  HTH!

On Tue, Mar 17, 2015 at 3:41 AM 李铖  wrote:

> Hi,everybody.
>
> I am new in spark. Now I want to do interactive sql query using spark sql.
> spark sql can run under hive or loading files from hdfs.
>
> Which is better or faster?
>
> Thanks.
>


Re: Standalone Scheduler VS YARN Performance

2015-03-24 Thread Denny Lee
By any chance does this thread address look similar:
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
?



On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan <
harut.martiros...@gmail.com> wrote:

> What is performance overhead caused by YARN, or what configurations are
> being changed when the app is ran through YARN?
>
> The following example:
>
> sqlContext.sql("SELECT dayStamp(date),
> count(distinct deviceId) AS c
> FROM full
> GROUP BY dayStamp(date)
> ORDER BY c
> DESC LIMIT 10")
> .collect()
>
> runs on shell when we use standalone scheduler:
> ./spark-shell --master sparkmaster:7077 --executor-memory 20g
> --executor-cores 10  --driver-memory 10g --num-executors 8
>
> and fails due to losing an executor, when we run it through YARN.
> ./spark-shell --master yarn-client --executor-memory 20g --executor-cores
> 10  --driver-memory 10g --num-executors 8
>
> There are no evident logs, just messages that executors are being lost,
> and connection refused errors, (apparently due to executor failures)
> The cluster is the same, 8 nodes, 64Gb RAM each.
> Format is parquet.
>
> --
> RGRDZ Harut
>


Re: Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Denny Lee
Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile
-Phadoop-2.4

Please note earlier in the link the section:

# Apache Hadoop 2.4.X or 2.5.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

Versions of Hadoop after 2.5.X may or may not work with the
-Phadoop-2.4 profile (they were
released after this version of Spark).


HTH!

On Tue, Mar 24, 2015 at 10:28 AM Manoj Samel 
wrote:

>
> http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn
> does not list hadoop 2.5 in Hadoop version table table etc.
>
> I assume it is still OK to compile with  -Pyarn -Phadoop-2.5 for use with
> Hadoop 2.5 (cdh 5.3.2)
>
> Thanks,
>


Re: Errors in SPARK

2015-03-24 Thread Denny Lee
Did you include the connection to a MySQL connector jar so that way
spark-shell / hive can connect to the metastore?

For example, when I run my spark-shell instance in standalone mode, I use:
./spark-shell --master spark://servername:7077 --driver-class-path
/lib/mysql-connector-java-5.1.27.jar



On Fri, Mar 13, 2015 at 8:31 AM sandeep vura  wrote:

> Hi Sparkers,
>
> Can anyone please check the below error and give solution for this.I am
> using hive version 0.13 and spark 1.2.1 .
>
> Step 1 : I have installed hive 0.13 with local metastore (mySQL database)
> Step 2:  Hive is running without any errors and able to create tables and
> loading data in hive table
> Step 3: copied hive-site.xml in spark/conf directory
> Step 4: copied core-site.xml in spakr/conf directory
> Step 5: started spark shell
>
> Please check the below error for clarifications.
>
> scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.Hi
>  veContext@2821ec0c
>
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
> STRING)")
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate or
>
>  g.apache.hadoop.hive.metastore.HiveMetaStoreClient
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav
>
>  a:346)
> at
> org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc
>
>  ala:235)
> at
> org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc
>
>  ala:231)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal
>
>  a:231)
> at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229)
> at
> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext
>
>  .scala:229)
> at
> org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)
> at
> org.apache.spark.sql.hive.HiveMetastoreCatalog.(HiveMetastoreCa
>
>  talog.scala:55)
>
> Regards,
> Sandeep.v
>
>


Re: Errors in SPARK

2015-03-24 Thread Denny Lee
The error you're seeing typically means that you cannot connect to the Hive
metastore itself.  Some quick thoughts:
- If you were to run "show tables" (instead of the CREATE TABLE statement),
are you still getting the same error?

- To confirm, the Hive metastore (MySQL database) is up and running

- Did you download or build your version of Spark?




On Tue, Mar 24, 2015 at 10:48 PM sandeep vura  wrote:

> Hi Denny,
>
> Still facing the same issue.Please find the following errors.
>
> *scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)*
> *sqlContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@4e4f880c*
>
> *scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
> STRING)")*
> *java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient*
>
> Cheers,
> Sandeep.v
>
> On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura 
> wrote:
>
>> No I am just running ./spark-shell command in terminal I will try with
>> above command
>>
>> On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee 
>> wrote:
>>
>>> Did you include the connection to a MySQL connector jar so that way
>>> spark-shell / hive can connect to the metastore?
>>>
>>> For example, when I run my spark-shell instance in standalone mode, I
>>> use:
>>> ./spark-shell --master spark://servername:7077 --driver-class-path /lib/
>>> mysql-connector-java-5.1.27.jar
>>>
>>>
>>>
>>> On Fri, Mar 13, 2015 at 8:31 AM sandeep vura 
>>> wrote:
>>>
>>>> Hi Sparkers,
>>>>
>>>> Can anyone please check the below error and give solution for this.I am
>>>> using hive version 0.13 and spark 1.2.1 .
>>>>
>>>> Step 1 : I have installed hive 0.13 with local metastore (mySQL
>>>> database)
>>>> Step 2:  Hive is running without any errors and able to create tables
>>>> and loading data in hive table
>>>> Step 3: copied hive-site.xml in spark/conf directory
>>>> Step 4: copied core-site.xml in spakr/conf directory
>>>> Step 5: started spark shell
>>>>
>>>> Please check the below error for clarifications.
>>>>
>>>> scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>> sqlContext: org.apache.spark.sql.hive.HiveContext =
>>>> org.apache.spark.sql.hive.Hi
>>>>  veContext@2821ec0c
>>>>
>>>> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
>>>> STRING)")
>>>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>>>> instantiate or
>>>>g.apache.hadoop.hive.
>>>> metastore.HiveMetaStoreClient
>>>> at 
>>>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav
>>>>
>>>>a:346)
>>>> at 
>>>> org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc
>>>>
>>>>ala:235)
>>>> at 
>>>> org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc
>>>>
>>>>ala:231)
>>>> at scala.Option.orElse(Option.scala:257)
>>>> at 
>>>> org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal
>>>>
>>>>a:231)
>>>> at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.
>>>> scala:229)
>>>> at 
>>>> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext
>>>>
>>>>.scala:229)
>>>> at org.apache.spark.sql.hive.HiveContext.hiveconf(
>>>> HiveContext.scala:229)
>>>> at 
>>>> org.apache.spark.sql.hive.HiveMetastoreCatalog.(HiveMetastoreCa
>>>>
>>>>talog.scala:55)
>>>>
>>>> Regards,
>>>> Sandeep.v
>>>>
>>>>
>>
>


Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your
Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html).
Please reference the Spark Properties section noting that you can modify
these properties via the spark-defaults.conf or via SparkConf().

HTH!



On Wed, Mar 25, 2015 at 8:01 AM Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  Hi
>
>
>
> I ran a spark job and got the following error. Can anybody tell me how to
> work around this problem? For example how can I increase
> spark.driver.maxResultSize? Thanks.
>
>  org.apache.spark.SparkException: Job aborted due to stage failure: Total
> size of serialized results
>
> of 128 tasks (1029.1 MB) is bigger than spark.driver.maxResultSize (1024.0
> MB)
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobA
>
> ndIndependentStages(DAGScheduler.scala:1214)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12
>
> 03)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12
>
> 02)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler
>
> .scala:696)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler
>
> .scala:696)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(D
>
> AGScheduler.scala:1420)
>
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala
>
> :1375)
>
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
>at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala
>
> :393)
>
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> 15/03/25 10:48:38 WARN TaskSetManager: Lost task 128.0 in stage 199.0 (TID
> 6324, INT1-CAS01.pcc.lexi
>
> snexis.com): TaskKilled (killed intentionally)
>
>
>
> Ningjun
>
>
>


Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee
Perhaps this email reference may be able to help from a DataFrame
perspective:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E


On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang  wrote:

>  Hi,
>
>
>
> I have a DataFrame object and I want to do types of aggregations like
> count, sum, variance, stddev, etc.
>
>
>
> DataFrame has DSL to do simple aggregations like count and sum.
>
>
>
> How about variance and stddev?
>
>
>
> Thank you for any suggestions!
>
>
>


Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee
BTW, a tool that I have been using to help do the preaggregation of data
using hyperloglog in combination with Spark is atscale (http://atscale.com/).
It builds the aggregations and makes use of the speed of SparkSQL - all
within the context of a model that is accessible by Tableau or Qlik.

On Thu, Mar 26, 2015 at 8:55 AM Jörn Franke  wrote:

> As I wrote previously - indexing is not your only choice, you can
> preaggregate data during load or depending on your needs you  need to think
> about other data structures, such as graphs, hyperloglog, bloom filters
> etc. (challenge to integrate in standard bi tools)
> Le 26 mars 2015 13:34, "kundan kumar"  a écrit :
>
> I was looking for some options and came across JethroData.
>>
>> http://www.jethrodata.com/
>>
>> This stores the data maintaining indexes over all the columns seems good
>> and claims to have better performance than Impala.
>>
>> Earlier I had tried Apache Phoenix because of its secondary indexing
>> feature. But the major challenge I faced there was, secondary indexing was
>> not supported for bulk loading process.
>> Only the sequential loading process supported the secondary indexes,
>> which took longer time.
>>
>>
>> Any comments on this ?
>>
>>
>>
>>
>> On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar 
>> wrote:
>>
>>> I looking for some options and came across
>>>
>>> http://www.jethrodata.com/
>>>
>>> On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke 
>>> wrote:
>>>
 You can also preaggregate results for the queries by the user -
 depending on what queries they use this might be necessary for any
 underlying technology
 Le 26 mars 2015 11:27, "kundan kumar"  a écrit :

 Hi,
>
> I need to store terabytes of data which will be used for BI tools like
> qlikview.
>
> The queries can be on the basis of filter on any column.
>
> Currently, we are using redshift for this purpose.
>
> I am trying to explore things other than the redshift .
>
> Is it possible to gain better performance in spark as compared to
> redshift ?
>
> If yes, please suggest what is the best way to achieve this.
>
>
> Thanks!!
> Kundan
>

>>>
>>


Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee
If you're not using MySQL as your metastore for Hive, out of curiosity what
are you using?

The error you are seeing is common when there isn't the correct driver to
allow Spark to connect to the Hive metastore because the correct driver
isn't there.

As well, I noticed that you're using SPARK_CLASSPATH which has been
deprecated.  Depending on your scenario, you may want to use --jars,
--driver-class-path, or extraClassPath.  A good thread on this topic can be
found at
http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3C01a901d0547c$a23ba480$e6b2ed80$@innowireless.com%3E
.

For example, when I connect to my own Hive metastore via Spark 1.3, I
reference the --driver-class-path where in my case I am using MySQL as my
Hive metastore:

./bin/spark-sql --master spark://$standalone$:7077 --driver-class-path
mysql-connector-$version$.jar

HTH!


On Thu, Mar 26, 2015 at 8:09 PM ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I do not use MySQL, i want to read Hive tables from Spark SQL and
> transform them in Spark SQL. Why do i need a MySQL driver ? If i still need
> it which version should i use.
>
> Assuming i need it, i downloaded the latest version of it from
> http://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.34 and
> ran the following commands, i do not see above exception , however i see a
> new one.
>
>
>
>
>
> export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
> export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
> export
> SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:
> */home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*
> export HADOOP_CONF_DIR=/apache/hadoop/conf
> cd $SPARK_HOME
> ./bin/spark-sql
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> ...
> ...
>
> spark-sql>
>
> spark-sql>
>
> spark-sql>
>
>
> show tables;
>
> 15/03/26 20:03:57 INFO metastore.HiveMetaStore: 0: get_tables: db=default
> pat=.*
>
> 15/03/26 20:03:57 INFO HiveMetaStore.audit: ugi=dvasthi...@corp.ebay.com
> ip=unknown-ip-addr cmd=get_tables: db=default pat=.*
>
> 15/03/26 20:03:58 INFO spark.SparkContext: Starting job: collect at
> SparkPlan.scala:83
>
> 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Got job 1 (collect at
> SparkPlan.scala:83) with 1 output partitions (allowLocal=false)
>
> 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Final stage: Stage
> 1(collect at SparkPlan.scala:83)
>
> 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Parents of final stage:
> List()
>
> 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Missing parents: List()
>
> 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Submitting Stage 1
> (MapPartitionsRDD[3] at map at SparkPlan.scala:83), which has no missing
> parents
>
> 15/03/26 20:03:58 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1
>
> 15/03/26 20:03:58 INFO scheduler.StatsReportListener: Finished stage:
> org.apache.spark.scheduler.StageInfo@2bfd9c4d
>
> 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Job 1 failed: collect at
> SparkPlan.scala:83, took 0.005163 s
>
> 15/03/26 20:03:58 ERROR thriftserver.SparkSQLDriver: Failed in [show
> tables]
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> serialization failed: java.lang.reflect.InvocationTargetException
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>
>
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
>
>
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
>
> org.apache.spark.broadcast.TorrentBroadcast.org
> $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
>
>
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:79)
>
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>
>
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>
> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)
>
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:839)
>
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$sc

Re: Hive Table not from from Spark SQL

2015-03-27 Thread Denny Lee
Upon reviewing your other thread, could you confirm that your Hive
metastore that you can connect to via Hive is a MySQL database?  And to
also confirm, when you're running spark-shell and doing a "show tables"
statement, you're getting the same error?


On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I tried the following
>
> 1)
>
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
> /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:
> *$SPARK_HOME/conf/hive-site.xml*  --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
> --num-executors 1 --driver-memory 4g --driver-java-options
> "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue
> hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp
> spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>
>
> This throws dw_bid not found. Looks like Spark SQL is unable to read my
> existing Hive metastore and creates its own and hence complains that table
> is not found.
>
>
> 2)
>
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
> /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>   --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:
> *$SPARK_HOME/conf/hive-site.xml* --num-executors 1 --driver-memory 4g
> --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
> --executor-cores 1 --queue hdmi-express --class
> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>
> This time i do not get above error, however i get MySQL driver not found
> exception. Looks like this is even before its able to communicate to Hive.
>
> Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke
> the "BONECP" plugin to create a ConnectionPool gave an error : The
> specified datastore driver ("com.mysql.jdbc.Driver") was not found in the
> CLASSPATH. Please check your CLASSPATH specification, and the name of the
> driver.
>
> In both above cases, i do have hive-site.xml in Spark/conf folder.
>
> 3)
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
> /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>   --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar--num-executors
> 1 --driver-memory 4g --driver-java-options "-XX:MaxPermSize=2G"
> --executor-memory 2g --executor-cores 1 --queue hdmi-express --class
> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>
> I do not specify hive-site.xml in --jars or --driver-class-path. Its
> present in spark/conf folder as per
> https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables
> .
>
> In this case i get same error as #1. dw_bid table not found.
>
> I want Spark SQL to know that there are tables in Hive and read that data.
> As per guide it looks like Spark SQL has that support.
>
> Please suggest.
>
> Regards,
> Deepak
>
>
> On Thu, Mar 26, 2015 at 9:01 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
> wrote:
>
>> Stack Trace:
>>
>> 15/03/26 08:25:42 INFO ql.Driver: OK
>> 15/03/26 08:25:42 INFO log.PerfLogger: > from=org.apache.hadoop.hive.ql.Driver>
>> 15/03/26 08:25:42 INFO log.PerfLogger: > start=1427383542966 end=1427383542966 duration=

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-30 Thread Denny Lee
Hi Vincent,

This may be a case that you're missing a semi-colon after your CREATE
TEMPORARY TABLE statement.  I ran your original statement (missing the
semi-colon) and got the same error as you did.  As soon as I added it in, I
was good to go again:

CREATE TEMPORARY TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "/samples/people.json"
);
-- above needed a semi-colon so the temporary table could be created first
SELECT * FROM jsonTable;

HTH!
Denny


On Sun, Mar 29, 2015 at 6:59 AM Vincent He 
wrote:

> No luck, it does not work, anyone know whether there some special setting
> for spark-sql cli so we do not need to write code to use spark sql? Anyone
> have some simple example on this? appreciate any help. thanks in advance.
>
> On Sat, Mar 28, 2015 at 9:05 AM, Ted Yu  wrote:
>
>> See
>> https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
>>
>> I haven't tried the SQL statements in above blog myself.
>>
>> Cheers
>>
>> On Sat, Mar 28, 2015 at 5:39 AM, Vincent He > > wrote:
>>
>>> thanks for your information . I have read it, I can run sample with
>>> scala or python, but for spark-sql shell, I can not get an exmaple running
>>> successfully, can you give me an example I can run with "./bin/spark-sql"
>>> without writing any code? thanks
>>>
>>> On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu  wrote:
>>>
 Please take a look at
 https://spark.apache.org/docs/latest/sql-programming-guide.html

 Cheers



 > On Mar 28, 2015, at 5:08 AM, Vincent He 
 wrote:
 >
 >
 > I am learning spark sql and try spark-sql example,  I running
 following code, but I got exception "ERROR CliDriver:
 org.apache.spark.sql.AnalysisException: cannot recognize input near
 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two
 questions,
 > 1. Do we have a list of the statement supported in spark-sql ?
 > 2. Does spark-sql shell support hiveql ? If yes, how to set?
 >
 > The example I tried:
 > CREATE TEMPORARY TABLE jsonTable
 > USING org.apache.spark.sql.json
 > OPTIONS (
 >   path "examples/src/main/resources/people.json"
 > )
 > SELECT * FROM jsonTable
 > The exception I got,
 > > CREATE TEMPORARY TABLE jsonTable
 >  > USING org.apache.spark.sql.json
 >  > OPTIONS (
 >  >   path "examples/src/main/resources/people.json"
 >  > )
 >  > SELECT * FROM jsonTable
 >  > ;
 > 15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY
 TABLE jsonTable
 > USING org.apache.spark.sql.json
 > OPTIONS (
 >   path "examples/src/main/resources/people.json"
 > )
 > SELECT * FROM jsonTable
 > NoViableAltException(241@[654:1: ddlStatement : (
 createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement |
 createTableStatement | dropTableStatement | truncateTableStatement |
 alterStatement | descStatement | showStatement | metastoreCheck |
 createViewStatement | dropViewStatement | createFunctionStatement |
 createMacroStatement | createIndexStatement | dropIndexStatement |
 dropFunctionStatement | dropMacroStatement | analyzeStatement |
 lockStatement | unlockStatement | lockDatabase | unlockDatabase |
 createRoleStatement | dropRoleStatement | grantPrivileges |
 revokePrivileges | showGrants | showRoleGrants | showRolePrincipals |
 showRoles | grantRole | revokeRole | setRole | showCurrentRole );])
 > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
 > at org.antlr.runtime.DFA.predict(DFA.java:144)
 > at
 org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
 > at
 org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
 > at
 org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
 > at
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
 > at
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
 > at
 org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
 > at
 org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
 > at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
 > at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
 > at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 > at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 >at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 >at
 scala.util.parsi

Creating Partitioned Parquet Tables via SparkSQL

2015-03-31 Thread Denny Lee
Creating Parquet tables via .saveAsTable is great but was wondering if
there was an equivalent way to create partitioned parquet tables.

Thanks!


Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee
Thanks Felix :)

On Wed, Apr 1, 2015 at 00:08 Felix Cheung  wrote:

> This is tracked by these JIRAs..
>
> https://issues.apache.org/jira/browse/SPARK-5947
> https://issues.apache.org/jira/browse/SPARK-5948
>
> --
> From: denny.g@gmail.com
> Date: Wed, 1 Apr 2015 04:35:08 +
> Subject: Creating Partitioned Parquet Tables via SparkSQL
> To: user@spark.apache.org
>
>
> Creating Parquet tables via .saveAsTable is great but was wondering if
> there was an equivalent way to create partitioned parquet tables.
>
> Thanks!
>
>


ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Quick question - the output of a dataframe is in the format of:

[2015-04, ArrayBuffer(A, B, C, D)]

and I'd like to return it as:

2015-04, A
2015-04, B
2015-04, C
2015-04, D

What's the best way to do this?

Thanks in advance!


Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Thanks Michael - that was it!  I was drawing a blank on this one for some
reason - much appreciated!


On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust 
wrote:

> A lateral view explode using HiveQL.  I'm hopping to add explode shorthand
> directly to the df API in 1.4.
>
> On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee  wrote:
>
>> Quick question - the output of a dataframe is in the format of:
>>
>> [2015-04, ArrayBuffer(A, B, C, D)]
>>
>> and I'd like to return it as:
>>
>> 2015-04, A
>> 2015-04, B
>> 2015-04, C
>> 2015-04, D
>>
>> What's the best way to do this?
>>
>> Thanks in advance!
>>
>>
>>
>


Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
Thanks Dean - fun hack :)

On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler  wrote:

> A hack workaround is to use flatMap:
>
> rdd.flatMap{ case (date, array) => for (x <- array) yield (date, x) }
>
> For those of you who don't know Scala, the for comprehension iterates
> through the ArrayBuffer, named "array" and yields new tuples with the date
> and each element. The case expression to the left of the => pattern matches
> on the input tuples.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee  wrote:
>
>> Thanks Michael - that was it!  I was drawing a blank on this one for some
>> reason - much appreciated!
>>
>>
>> On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust 
>> wrote:
>>
>>> A lateral view explode using HiveQL.  I'm hopping to add explode
>>> shorthand directly to the df API in 1.4.
>>>
>>> On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee  wrote:
>>>
>>>> Quick question - the output of a dataframe is in the format of:
>>>>
>>>> [2015-04, ArrayBuffer(A, B, C, D)]
>>>>
>>>> and I'd like to return it as:
>>>>
>>>> 2015-04, A
>>>> 2015-04, B
>>>> 2015-04, C
>>>> 2015-04, D
>>>>
>>>> What's the best way to do this?
>>>>
>>>> Thanks in advance!
>>>>
>>>>
>>>>
>>>
>


Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
 Sweet - I'll have to play with this then! :)
On Fri, Apr 3, 2015 at 19:43 Reynold Xin  wrote:

> There is already an explode function on DataFrame btw
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712
>
> I think something like this would work. You might need to play with the
> type.
>
> df.explode("arrayBufferColumn") { x => x }
>
>
>
> On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee  wrote:
>
>> Thanks Dean - fun hack :)
>>
>> On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler 
>> wrote:
>>
>>> A hack workaround is to use flatMap:
>>>
>>> rdd.flatMap{ case (date, array) => for (x <- array) yield (date, x) }
>>>
>>> For those of you who don't know Scala, the for comprehension iterates
>>> through the ArrayBuffer, named "array" and yields new tuples with the date
>>> and each element. The case expression to the left of the => pattern matches
>>> on the input tuples.
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>> Typesafe <http://typesafe.com>
>>> @deanwampler <http://twitter.com/deanwampler>
>>> http://polyglotprogramming.com
>>>
>>> On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee 
>>> wrote:
>>>
>>>> Thanks Michael - that was it!  I was drawing a blank on this one for
>>>> some reason - much appreciated!
>>>>
>>>>
>>>> On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust 
>>>> wrote:
>>>>
>>>>> A lateral view explode using HiveQL.  I'm hopping to add explode
>>>>> shorthand directly to the df API in 1.4.
>>>>>
>>>>> On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee 
>>>>> wrote:
>>>>>
>>>>>> Quick question - the output of a dataframe is in the format of:
>>>>>>
>>>>>> [2015-04, ArrayBuffer(A, B, C, D)]
>>>>>>
>>>>>> and I'd like to return it as:
>>>>>>
>>>>>> 2015-04, A
>>>>>> 2015-04, B
>>>>>> 2015-04, C
>>>>>> 2015-04, D
>>>>>>
>>>>>> What's the best way to do this?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>


Re: Microsoft SQL jdbc support from spark sql

2015-04-06 Thread Denny Lee
At this time, the JDBC Data source is not extensible so it cannot support
SQL Server.   There was some thoughts - credit to Cheng Lian for this -
 about making the JDBC data source extensible for third party support
possibly via slick.


On Mon, Apr 6, 2015 at 10:41 PM bipin  wrote:

> Hi, I am trying to pull data from ms-sql server. I have tried using the
> spark.sql.jdbc
>
> CREATE TEMPORARY TABLE c
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
> dbtable "Customer"
> );
>
> But it shows java.sql.SQLException: No suitable driver found for
> jdbc:sqlserver
>
> I have jdbc drivers for mssql but i am not sure how to use them I provide
> the jars to the sql shell and then tried the following:
>
> CREATE TEMPORARY TABLE c
> USING com.microsoft.sqlserver.jdbc.SQLServerDriver
> OPTIONS (
> url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
> dbtable "Customer"
> );
>
> But this gives ERROR CliDriver: scala.MatchError: SQLServerDriver:4 (of
> class com.microsoft.sqlserver.jdbc.SQLServerDriver)
>
> Can anyone tell what is the proper way to connect to ms-sql server.
> Thanks
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-
> sql-tp22399.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
That's correct, at this time MS SQL Server is not supported through the
JDBC data source at this time.  In my environment, we've been using Hadoop
streaming to extract out data from multiple SQL Servers, pushing the data
into HDFS, creating the Hive tables and/or converting them into Parquet,
and then Spark can access them directly.   Due to my heavy use of SQL
Server, I've been thinking about seeing if i can help with the extension of
the JDBC data source so it can be supported - but alas, I haven't found the
time yet ;)

On Tue, Apr 7, 2015 at 6:52 AM ARose  wrote:

> I am having the same issue with my java application.
>
> String url = "jdbc:sqlserver://" + host + ":1433;DatabaseName=" +
> database + ";integratedSecurity=true";
> String driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";
>
> SparkConf conf = new
> SparkConf().setAppName(appName).setMaster(master);
> JavaSparkContext sc = new JavaSparkContext(conf);
> SQLContext sqlContext = new SQLContext(sc);
>
> Map options = new HashMap<>();
> options.put("driver", driver);
> options.put("url", url);
> options.put("dbtable", "tbTableName");
>
> DataFrame jdbcDF = sqlContext.load("jdbc", options);
> jdbcDF.printSchema();
> jdbcDF.show();
>
> It prints the schema of the DataFrame just fine, but as soon as it tries to
> evaluate it for the show() call, I get a ClassNotFoundException for the
> driver. But the driver is definitely included as a dependency, so is  MS
> SQL
> Server just not supported?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-
> from-spark-sql-tp22399p22404.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to
Hive 0.12 if you specify it in the profile when building Spark as per
https://spark.apache.org/docs/1.3.0/building-spark.html.

If you are downloading a pre built version of Spark 1.3 - then by default,
it is set to Hive 0.13.1.

HTH!

On Thu, Apr 9, 2015 at 10:03 AM ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Most likely you have an existing Hive installation with data in it. In
> this case i was not able to get Spark 1.3 communicate with existing Hive
> meta store. Hence when i read any table created in hive, Spark SQL used to
> complain "Data table not found"
>
> If you get it working, please share the steps.
>
> On Thu, Apr 9, 2015 at 9:25 PM, Arthur Chan 
> wrote:
>
>> Hi,
>>
>> I use Hive 0.12 for Spark 1.2 at the moment and plan to upgrade to Spark
>> 1.3.x
>>
>> Could anyone advise which Hive version should be used to match Spark
>> 1.3.x?
>> Can I use Hive 1.1.0 for Spark 1.3? or can I use Hive 0.14 for Spark 1.3?
>>
>> Regards
>> Arthur
>>
>
>
>
> --
> Deepak
>
>


Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee
Can you create the database directly within Hive?  If you're getting the
same error within Hive, it sounds like a permissions issue as per Bojan.
More info can be found at:
http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error


On Thu, Apr 9, 2015 at 7:31 AM Bojan Kostic  wrote:

> I think it uses local dir, hdfs dir path starts with hdfs://
>
> Check permissions on folders, and also check logs. There should be more
> info
> about exception.
>
> Best
> Bojan
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-
> tp22435p22439.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Converting Date pattern in scala code

2015-04-14 Thread Denny Lee
If you're doing in Scala per se - then you can probably just reference
JodaTime or Java Date / Time classes.  If are using SparkSQL, then you can
use the various Hive date functions for conversion.

On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA  wrote:

> I need some help to convert the date pattern in my Scala code for Spark
> 1.3. I am reading the dates from two flat files having two different date
> formats.
>
> File 1:
> 2015-03-27
>
> File 2:
> 02-OCT-12
> 09-MAR-13
>
> This format of file 2 is not being recognized by my Spark SQL when I am
> comparing it in a WHERE clause on the date fields. Format of file 1 is
> being recognized better. How to convert the format in file 2 to match with
> the format in file 1?
>
> Regards
> Ananda
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org


Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee
Bummer - out of curiosity, if you were to use the classpath.first or
perhaps copy the jar to the slaves could that actually do the trick?  The
latter isn't really all that efficient but just curious if that could do
the trick.


On Thu, Apr 16, 2015 at 7:14 AM ARose  wrote:

> I take it back. My solution only works when you set the master to "local".
> I
> get the same error when I try to run it on the cluster.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22525.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread Denny Lee
Support for sub queries in predicates hasn't been resolved yet - please
refer to SPARK-4226

BTW, Spark 1.3 default bindings to Hive 0.13.1




On Fri, Apr 17, 2015 at 09:18 ARose  wrote:

> So I'm trying to store the results of a query into a DataFrame, but I get
> the
> following exception thrown:
>
> Exception in thread "main" java.lang.RuntimeException: [1.71] failure:
> ``*''
> expected but `select' found
>
> SELECT DISTINCT OutSwitchID FROM wtbECRTemp WHERE OutSwtichID NOT IN
> (SELECT
> SwitchID FROM tmpCDRSwitchIDs)
>
> And it has a ^ pointing to the second SELECT. But according to this
> (
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
> ),
> subqueries should be supported with Hive 0.13.0.
>
> So which version is Spark using? And if subqueries are not currently
> supported, what would be a suitable alternative to this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Which-version-of-Hive-QL-is-Spark-1-3-0-using-tp22542.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Skipped Jobs

2015-04-19 Thread Denny Lee
The job is skipped because the results are available in memory from a prior
run.  More info at:
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E.
HTH!

On Sun, Apr 19, 2015 at 1:43 PM James King  wrote:

> In the web ui i can see some jobs as 'skipped' what does that mean? why
> are these jobs skipped? do they ever get executed?
>
> Regards
> jk
>


Re: Skipped Jobs

2015-04-19 Thread Denny Lee
Thanks for the correction Mark :)

On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra 
wrote:

> Almost.  Jobs don't get skipped.  Stages and Tasks do if the needed
> results are already available.
>
> On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee  wrote:
>
>> The job is skipped because the results are available in memory from a
>> prior run.  More info at:
>> http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E.
>> HTH!
>>
>> On Sun, Apr 19, 2015 at 1:43 PM James King  wrote:
>>
>>> In the web ui i can see some jobs as 'skipped' what does that mean? why
>>> are these jobs skipped? do they ever get executed?
>>>
>>> Regards
>>> jk
>>>
>>
>


Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee
You may need to specify the hive port itself.  For example, my own Thrift
start command is in the form:

./sbin/start-thriftserver.sh --master spark://$myserver:7077
--driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host
$myserver --hiveconf hive.server2.thrift.port 1

HTH!


On Wed, Apr 22, 2015 at 5:27 AM Yiannis Gkoufas 
wrote:

> Hi Himanshu,
>
> I am using:
>
> ./start-thriftserver.sh --master spark://localhost:7077
>
> Do I need to specify something additional to the command?
>
> Thanks!
>
> On 22 April 2015 at 13:14, Himanshu Parashar 
> wrote:
>
>> what command are you using to start the Thrift server?
>>
>> On Wed, Apr 22, 2015 at 3:52 PM, Yiannis Gkoufas 
>> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to start the thriftserver and I get some errors.
>>> I have hive running and placed hive-site.xml under the conf directory.
>>> From the logs I can see that the error is:
>>>
>>> Call From localhost to localhost:54310 failed
>>>
>>> I am assuming that it tries to connect to the wrong port for the
>>> namenode, which in my case its running on 9000 instead of 54310
>>>
>>> Any help would be really appreciated.
>>>
>>> Thanks a lot!
>>>
>>
>>
>>
>> --
>> [HiM]
>>
>
>


Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee
Similar to what Dean called out, we build Puppet manifests so we could do
the automation - its a bit of work to setup, but well worth the effort.

On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler  wrote:

> It's mostly manual. You could try automating with something like Chef, of
> course, but there's nothing already available in terms of automation.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Fri, Apr 24, 2015 at 10:33 AM, James King 
> wrote:
>
>> Thanks Dean,
>>
>> Sure I have that setup locally and testing it with ZK.
>>
>> But to start my multiple Masters do I need to go to each host and start
>> there or is there a better way to do this.
>>
>> Regards
>> jk
>>
>> On Fri, Apr 24, 2015 at 5:23 PM, Dean Wampler 
>> wrote:
>>
>>> The convention for standalone cluster is to use Zookeeper to manage
>>> master failover.
>>>
>>> http://spark.apache.org/docs/latest/spark-standalone.html
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Fri, Apr 24, 2015 at 5:01 AM, James King 
>>> wrote:
>>>
 I'm trying to find out how to setup a resilient Spark cluster.

 Things I'm thinking about include:

 - How to start multiple masters on different hosts?
 - there isn't a conf/masters file from what I can see


 Thank you.

>>>
>>>
>>
>


Re: how to delete data from table in sparksql

2015-05-14 Thread Denny Lee
Delete from table is available as part of Hive 0.14 (reference: Apache Hive
> Language Manual DML - Delete
)
while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive
0.14 or generate a new table filtering out the values you do not want.

On Thu, May 14, 2015 at 3:26 AM  wrote:

> Hi guys
>
>i got to delete some data from a table by "delete from table where
> name = xxx", however "delete" is not functioning like the DML operation in
> hive.  I got a info like below:
>
> Usage: delete [FILE|JAR|ARCHIVE]  []*
>
> 15/05/14 18:18:24 ERROR processors.DeleteResourceProcessor: Usage: delete
> [FILE|JAR|ARCHIVE]  []*
>
>
>
>I checked the list of "Supported Hive Features" , but not found if
> this dml is supported.
>
>So any comments will be appreciated.
>
> 
>
> Thanks&Best regards!
> San.Luo
>


Hive Skew flag?

2015-05-15 Thread Denny Lee
Just wondering if we have any timeline on when the hive skew flag will be
included within SparkSQL?

Thanks!
Denny


Seattle Spark Meetup: Machine Learning Streams with Spark 1.0

2014-06-05 Thread Denny Lee
If you’re in the Seattle area on 6/24, come join us at Madrona Ventures 
building in downtown Seattle to join the session: Machine Learning Streams with 
Spark 1.0.  

For more information, please check out our meetup event: 
http://www.meetup.com/Seattle-Spark-Meetup/events/187375042/

Enjoy!
Denny



Re: Run spark unit test on Windows 7

2014-07-02 Thread Denny Lee
By any chance do you have HDP 2.1 installed? you may need to install the utils 
and update the env variables per 
http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows


> On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
>  wrote:
> 
> Hi Andrew,
> 
> it's windows 7 and I doesn't set up any env variables here 
> 
> The full stack trace:
> 
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
>   at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at junit.framework.TestCase.runTest(TestCase.java:168)
>   at junit.framework.TestCase.runBare(TestCase.java:134)
>   at junit.framework.TestResult$1.protect(TestResult.java:110)
>   at junit.framework.TestResult.runProtected(TestResult.java:128)
>   at junit.framework.TestResult.run(TestResult.java:113)
>   at junit.framework.TestCase.run(TestCase.java:124)
>   at junit.framework.TestSuite.runTest(TestSuite.java:232)
>   at junit.framework.TestSuite.run(TestSuite.java:227)
>   at 
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
> 
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> 
>> On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or  wrote:
>> Hi Konstatin,
>> 
>> We use hadoop as a library in a few places in Spark. I wonder why the path 
>> includes "null" though.
>> 
>> Could you provide the full stack trace?
>> 
>> Andrew
>> 
>> 
>> 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev 
>> :
>> 
>>> Hi all,
>>> 
>>> I'm trying to run some transformation on Spark, it works fine on cluster 
>>> (YARN, linux machines). However, when I'm trying to run it on local machine 
>>> (Windows 7) under unit test, I got errors:
>>> 
>>> 
>>> java.io.IOException: Could not locate executable null\bin\winutils.exe in 
>>> the Hadoop binaries.
>>> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>>> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>>> at org.apache.hadoop.util.Shell.(Shell.java:326)
>>> at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>>> at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>>> 
>>> My code is following:
>>> 
>>> 
>>> @Test
>>> def testETL() = {
>>> val conf = new SparkConf()
>>> val sc = new SparkContext("local", "test", conf)
>>> try {
>>> val etl = new IxtoolsDailyAgg() // empty constructor
>>> 
>>> val data = sc.parallelize(List("in1", "in2", "in3"))
>>> 
>>> etl.etl(data) // rdd transformation, no access to SparkCon

Re: Run spark unit test on Windows 7

2014-07-02 Thread Denny Lee
You don't actually need it per se - its just that some of the Spark
libraries are referencing Hadoop libraries even if they ultimately do not
call them. When I was doing some early builds of Spark on Windows, I
admittedly had Hadoop on Windows running as well and had not run into this
particular issue.



On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> No, I don’t
>
> why do I need to have HDP installed? I don’t use Hadoop at all and I’d
> like to read data from local filesystem
>
> On Jul 2, 2014, at 9:10 PM, Denny Lee  wrote:
>
> By any chance do you have HDP 2.1 installed? you may need to install the
> utils and update the env variables per
> http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows
>
>
> On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Andrew,
>
> it's windows 7 and I doesn't set up any env variables here
>
> The full stack trace:
>
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in
> the Hadoop binaries.
> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
> at org.apache.hadoop.util.Shell.(Shell.java:326)
>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
> at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>  at org.apache.hadoop.security.Groups.(Groups.java:77)
> at
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>  at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
> at
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>  at
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
> at
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>  at
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
> at org.apache.spark.SparkContext.(SparkContext.scala:228)
>  at org.apache.spark.SparkContext.(SparkContext.scala:97)
> at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
>  at junit.framework.TestCase.runTest(TestCase.java:168)
> at junit.framework.TestCase.runBare(TestCase.java:134)
>  at junit.framework.TestResult$1.protect(TestResult.java:110)
> at junit.framework.TestResult.runProtected(TestResult.java:128)
>  at junit.framework.TestResult.run(TestResult.java:113)
> at junit.framework.TestCase.run(TestCase.java:124)
>  at junit.framework.TestSuite.runTest(TestSuite.java:232)
> at junit.framework.TestSuite.run(TestSuite.java:227)
>  at
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
>  at
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
> at
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
>  at
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
>
>
> Thank you,
> Konstantin Kudryavtsev
>
>
> On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or  wrote:
>
>> Hi Konstatin,
>>
>> We use hadoop as a library in a few places in Spark. I wonder why the
>> path includes "null" though.
>>
>> Could you provide the full stack trace?
>>
>> Andrew
>>
>>
>> 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev <
>> kudryavtsev.konstan...@gmail.com>:
>>
>> Hi all,
>>>
>>> I'm trying to run some transformation on *Spark*, it works fine on
>>> cluster (YARN, linux machines). However, when I'm trying to run it on local
>&g

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Hi Konstantin,

Could you please create a jira item at: 
https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked?

Thanks,
Denny


On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev 
(kudryavtsev.konstan...@gmail.com) wrote:

It sounds really strange...

I guess it is a bug, critical bug and must be fixed... at least some flag must 
be add (unable.hadoop)

I found the next workaround :
1) download compiled winutils.exe from 
http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
2) put this file into d:\winutil\bin
3) add in my test: System.setProperty("hadoop.home.dir", "d:\\winutil\\")

after that test runs

Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee  wrote:
You don't actually need it per se - its just that some of the Spark libraries 
are referencing Hadoop libraries even if they ultimately do not call them. When 
I was doing some early builds of Spark on Windows, I admittedly had Hadoop on 
Windows running as well and had not run into this particular issue.



On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev 
 wrote:
No, I don’t

why do I need to have HDP installed? I don’t use Hadoop at all and I’d like to 
read data from local filesystem

On Jul 2, 2014, at 9:10 PM, Denny Lee  wrote:

By any chance do you have HDP 2.1 installed? you may need to install the utils 
and update the env variables per 
http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows


On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
 wrote:

Hi Andrew,

it's windows 7 and I doesn't set up any env variables here 

The full stack trace:

14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.(Shell.java:326)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.(Groups.java:77)
at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at 
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.(SparkContext.scala:228)
at org.apache.spark.SparkContext.(SparkContext.scala:97)
at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or  wrote:
Hi Konstatin,

We use hadoop as a library in a few places in Spark. I wonder why the path 
includes "null" though.

Could you provide the full stack trace?

Andrew


2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev 
:

Hi all,

I'm trying to run some transformatio

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Thanks!  will take a look at this later today. HTH!



> On Jul 3, 2014, at 11:09 AM, Kostiantyn Kudriavtsev 
>  wrote:
> 
> Hi Denny,
> 
> just created https://issues.apache.org/jira/browse/SPARK-2356
> 
>> On Jul 3, 2014, at 7:06 PM, Denny Lee  wrote:
>> 
>> Hi Konstantin,
>> 
>> Could you please create a jira item at: 
>> https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked?
>> 
>> Thanks,
>> Denny
>> 
>> 
>>> On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev 
>>> (kudryavtsev.konstan...@gmail.com) wrote:
>>> 
>>> It sounds really strange...
>>> 
>>> I guess it is a bug, critical bug and must be fixed... at least some flag 
>>> must be add (unable.hadoop)
>>> 
>>> I found the next workaround :
>>> 1) download compiled winutils.exe from 
>>> http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
>>> 2) put this file into d:\winutil\bin
>>> 3) add in my test: System.setProperty("hadoop.home.dir", "d:\\winutil\\")
>>> 
>>> after that test runs
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
>>> 
>>> On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee  wrote:
>>> You don't actually need it per se - its just that some of the Spark 
>>> libraries are referencing Hadoop libraries even if they ultimately do not 
>>> call them. When I was doing some early builds of Spark on Windows, I 
>>> admittedly had Hadoop on Windows running as well and had not run into this 
>>> particular issue.
>>> 
>>> 
>>> 
>>>> On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev 
>>>>  wrote:
>>>> No, I don’t
>>>> 
>>>> why do I need to have HDP installed? I don’t use Hadoop at all and I’d 
>>>> like to read data from local filesystem
>>>> 
>>>>> On Jul 2, 2014, at 9:10 PM, Denny Lee  wrote:
>>>>> 
>>>>> By any chance do you have HDP 2.1 installed? you may need to install the 
>>>>> utils and update the env variables per 
>>>>> http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows
>>>>> 
>>>>> 
>>>>>> On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
>>>>>>  wrote:
>>>>>> 
>>>>>> Hi Andrew,
>>>>>> 
>>>>>> it's windows 7 and I doesn't set up any env variables here 
>>>>>> 
>>>>>> The full stack trace:
>>>>>> 
>>>>>> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop 
>>>>>> library for your platform... using builtin-java classes where applicable
>>>>>> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in 
>>>>>> the hadoop binary path
>>>>>> java.io.IOException: Could not locate executable null\bin\winutils.exe 
>>>>>> in the Hadoop binaries.
>>>>>> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>>>>>> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>>>>>> at org.apache.hadoop.util.Shell.(Shell.java:326)
>>>>>> at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>>>>>> at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>>>>>> at org.apache.hadoop.security.Groups.(Groups.java:77)
>>>>>> at 
>>>>>> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>>>>>> at 
>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>>>>>> at 
>>>>>> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>>>>>> at 
>>>>>> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>>>>>> at 
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>>>>>> at 
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>>>>>> at org.apache.spark.SparkContext.(SparkContext.scala:228)
>>>>>> at org.apache.spark.SparkContext.

Seattle Spark Meetup slides: xPatterns, Fun Things, and Machine Learning Streams - next is Interactive OLAP

2014-07-07 Thread Denny Lee
Apologies for the delay but we’ve had a bunch of great slides and sessions at 
Seattle Spark Meetup this past couple of months including Claudiu Barbura’s 
"xPatterns on Spark, Shark, Mesos, and Tachyon"; Paco Nathan’s "Fun Things You 
Can Do with Spark 1.0”, and "Machine Learning Streams with Spark 1.0” by the 
fine folks at Ubix!

Come by for the next Seattle Spark Meetup session on Wednesday July 16th, 2014 
at the WhitePages office in downtown Seattle for Evan Chan’s “Interactive OLAP 
Queries using Cassandra and Spark”!

For more information, please reference this blog post http://wp.me/pHDEa-w4 or 
join the Seattle Spark Meetup at http://www.meetup.com/Seattle-Spark-Meetup/

Enjoy!
Denny



  1   2   >