from:"Yana"

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Yana Kadiyska

One thing to note, if you are using Mesos, is that the version of Mesos changed from 0.21 to 1.0.0. So taking a newer Spark might push you into larger infrastructure upgrades On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D wrote: > Hello All, > > Currently our Batch ETL Jobs are in Spark 1.6.

HiveThriftserver does not seem to respect partitions

2017-09-13 Thread Yana Kadiyska

Hi folks, I have created a table in the following manner: CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition ( list of columns ) COMMENT 'User Infomation' PARTITIONED BY (account_id String, product String, group_id String, year String, month String, day String) STORED AS

Trouble with Thriftserver with hsqldb (Spark 2.1.0)

2017-03-06 Thread Yana Kadiyska

Hi folks, trying to run Spark 2.1.0 thrift server against an hsqldb file and it seems to...hang. I am starting thrift server with: sbin/start-thriftserver.sh --driver-class-path ./conf/hsqldb-2.3.4.jar , completely local setup hive-site.xml is like this: hive.metastore.warehouse.d

[Thriftserver2] Controlling number of tasks

2016-08-03 Thread Yana Kadiyska

Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark reads these files, I end up with 315K tasks for a dataframe reading a few days worth of data. I now with a regular Spark job, I can use coalesce to come to a lower number of tasks. Is there a way to tell HiveThriftserver

Re: 101 question on external metastore

2016-01-14 Thread Yana Kadiyska

e were 2 different version of derby and ensuring > the metastore and spark used the same version of Derby made the problem go > away. > > Deenar > > On 6 January 2016 at 02:55, Yana Kadiyska wrote: > >> Deenar, I have not resolved this issue. Why do you think it's fr

Re: 101 question on external metastore

2016-01-05 Thread Yana Kadiyska

ly it is down to different versions of derby in the classpath, but > i am unsure where the other version is coming from. The setup worked > perfectly with spark 1.3.1. > > Deenar > > On 20 December 2015 at 04:41, Deenar Toraskar > wrote: > >> Hi Yana/All >> >

Re: HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska

et, can you confirm that? If it fall into this > category, probably you can set the > “spark.sql.thriftServer.incrementalCollect” to false; > > > > Hao > > > > *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com] > *Sent:* Friday, November 13, 2015 8:30 AM > *To

HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska

Hi folks, I'm starting a HiveServer2 from a HiveContext (HiveThriftServer2.startWithContext(hiveContext)) and then connecting to it via beenline. On the server side, I see the below error which I think is related to https://issues.apache.org/jira/browse/HIVE-6468 But I'd like to know: 1. why I

Re: Subtract on rdd2 is throwing below exception

2015-11-05 Thread Yana Kadiyska

subtract is not the issue. Spark is lazy so a lot of times you'd have many, many lines of code which does not in fact run until you do some action (in your case, subtract). As you can see from the stacktrace, the NPE is from joda which is used in the partitioner (Im suspecting in Cassandra).But the

101 question on external metastore

2015-11-05 Thread Yana Kadiyska

``` I then try to run spark-shell thusly: bin/spark-shell --driver-class-path /home/yana/db-derby-10.12.1.1-bin/lib/derbyclient.jar and I get an ugly stack trace like so... Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver at

Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska

gt; +---+-+---+-+ > |customer_id| uri|browser|epoch| > +---+-+---+-+ > |999|http://foobar|firefox| 1234| > |888|http://foobar| ie|12343| > +---+-+---+-+ > > Cheers > > On Fri, Oct 30, 2015 at 12:11 PM, Ya

how to merge two dataframes

2015-10-30 Thread Yana Kadiyska

Hi folks, I have a need to "append" two dataframes -- I was hoping to use UnionAll but it seems that this operation treats the underlying dataframes as sequence of columns, rather than a map. In particular, my problem is that the columns in the two DFs are not in the same order --notice that my c

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Yana Kadiyska

For this issue in particular ( ERROR XSDB6: Another instance of Derby may have already booted the database /spark/spark-1.4.1/metastore_db) -- I think it depends on where you start your application and HiveThriftserver from. I've run into a similar issue running a driver app first, which would crea

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska

.10.jar >> -rw-r--r-- hbase/hadoop339666 2015-10-26 09:52 >> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-api-jdo-3.2.6.jar >> -rw-r--r-- hbase/hadoop 1809447 2015-10-26 09:52 >> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-rdbms-3.2.9.jar >> &g

Re: Maven build failed (Spark master)

2015-10-26 Thread Yana Kadiyska

In 1.4 ./make_distribution produces a .tgz file in the root directory (same directory that make_distribution is in) On Mon, Oct 26, 2015 at 8:46 AM, Kayode Odeyemi wrote: > Hi, > > The ./make_distribution task completed. However, I can't seem to locate the > .tar.gz file. > > Where does Spark

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska

thank you so much! You are correct. This is the second time I've made this mistake :( On Mon, Oct 26, 2015 at 11:36 AM, java8964 wrote: > Maybe you need the Hive part? > > Yong > > -- > Date: Mon, 26 Oct 2015 11:34:30 -0400 > Subject: Problem with make-distribution.sh

Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska

Hi folks, building spark instructions ( http://spark.apache.org/docs/latest/building-spark.html) suggest that ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn should produce a distribution similar to the ones found on the "Downloads" page. I noticed that the tgz I built u

Re: SQLcontext changing String field to Long

2015-10-11 Thread Yana Kadiyska

h. We have kind of partitioned our data on the basis of batch_ids folder. > > How did you get around it? > > Thanks for help. :) > > On Sat, Oct 10, 2015 at 7:55 AM, Yana Kadiyska > wrote: > >> can you show the output of df.printSchema? Just a guess but I think I ran >> i

Re: spark-submit hive connection through spark Initial job has not accepted any resources

2015-10-10 Thread Yana Kadiyska

"Job has not accepted resources" is a well-known error message -- you can search the Internet. 2 common causes come to mind: 1) you already have an application connected to the master -- by default a driver will grab all resources so unless that application disconnects, nothing else is allowed to c

Re: SQLcontext changing String field to Long

2015-10-10 Thread Yana Kadiyska

can you show the output of df.printSchema? Just a guess but I think I ran into something similar with a column that was part of a path in parquet. E.g. we had an account_id in the parquet file data itself which was of type string but we also named the files in the following manner /somepath/account

Re: Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska

o check to see if offsets 0 through 100 are still actually > present in the kafka logs. > > On Tue, Sep 22, 2015 at 9:38 AM, Yana Kadiyska > wrote: > >> Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka >> queue into HDFS. Being very new to Kafka, not

Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska

Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka queue into HDFS. Being very new to Kafka, not sure if I'm messing something up on that side...My hope is to read the messages presently in the queue (or at least the first 100 for now) Here is what I have: Kafka side: ./bin/

Re: Sending yarn application logs to web socket

2015-09-07 Thread Yana Kadiyska

Hopefully someone will give you a more direct answer but whenever I'm having issues with log4j I always try -Dlog4j.debug=true.This will tell you which log4j settings are getting picked up from where. I've spent countless hours due to typos in the file, for example. On Mon, Sep 7, 2015 at 11:47 AM

Re: Problem with repartition/OOM

2015-09-06 Thread Yana Kadiyska

tions in parallel. It will run out of > memory if (number of partitions) times (Parquet block size) is greater than > the available memory. You can try to decrease the number of partitions. And > could you share the value of "parquet.block.size" and your available memory? >

Problem with repartition/OOM

2015-09-05 Thread Yana Kadiyska

Hi folks, I have a strange issue. Trying to read a 7G file and do failry simple stuff with it: I can read the file/do simple operations on it. However, I'd prefer to increase the number of partitions in preparation for more memory-intensive operations (I'm happy to wait, I just need the job to com

Re: Failing to include multiple JDBC drivers

2015-09-05 Thread Yana Kadiyska

If memory serves me correctly in 1.3.1 at least there was a problem with when the driver was added -- the right classloader wasn't picking it up. You can try searching the archives, but the issue is similar to these threads: http://stackoverflow.com/questions/30940566/connecting-from-spark-pyspark-

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Yana Kadiyska

The PermGen space error is controlled with MaxPermSize parameter. I run with this in my pom, I think copied pretty literally from Spark's own tests... I don't know what the sbt equivalent is but you should be able to pass it...possibly via SBT_OPTS? org.scalatest sca

[SQL/Hive] Trouble with refreshTable

2015-08-25 Thread Yana Kadiyska

I'm having trouble with refreshTable, I suspect because I'm using it incorrectly. I am doing the following: 1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet 2. use registerTempTable to register my dataframe 3. A new file is dropped under /foo/bar/ 4. Call hiveContext.refres

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-24 Thread Yana Kadiyska

When I run spark-shell, I run with "--master > mesos://cluster-1:5050" parameter which is the same with "spark-submit". > Confused here. > > > > 2015-07-22 20:01 GMT-05:00 Yana Kadiyska : > >> Is it complaining about "collect" or "toMap&

Help with Dataframe syntax ( IN / COLLECT_SET)

2015-07-23 Thread Yana Kadiyska

Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In other words, I'd like to figure out how to write the following query: "select collect_set(b),a from mytable where c in (1,2,3) group by a" I've started with someDF .where( -- not sure what do for c here--- .groupBy($

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-22 Thread Yana Kadiyska

Is it complaining about "collect" or "toMap"? In either case this error is indicative of an old version usually -- any chance you have an old installation of Spark somehow? Or scala? You can try running spark-submit with --verbose. Also, when you say it runs with spark-shell do you run spark shell

Re: Select all columns except some

2015-07-16 Thread Yana Kadiyska

Have you tried to examine what clean_cols contains -- I'm suspect of this part mkString(“, “). Try this: val clean_cols : Seq[String] = df.columns... if you get a type error you need to work on clean_cols (I suspect yours is of type String at the moment and presents itself to Spark as a single col

PairRDDFunctions and DataFrames

2015-07-16 Thread Yana Kadiyska

Hi, could someone point me to the recommended way of using countApproxDistinctByKey with DataFrames? I know I can map to pair RDD but I'm wondering if there is a simpler method? If someone knows if this operations is expressible in SQL that information would be most appreciated as well.

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Yana Kadiyska

Have you seen this SO thread: http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin This seems to be more related to the plugin than Spark, looking at the stack trace On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif wrote: > I m still looking forward for the answer. I

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-13 Thread Yana Kadiyska

Oh, this is very interesting -- can you explain about your dependencies -- I'm running Tomcat 7 and ended up using spark-assembly from WEB_INF/lib and removing the javax/servlet package out of it...but it's a pain in the neck. If I'm reading your first message correctly you use hadoop common and sp

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska

t: Row): Validator = { > > var check1: Boolean = if (input.getDouble(shortsale_in_pos) > > 140.0) true else false > > if (check1) this else Nomatch > > } > > } > > > > Saif > > > > *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.co

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska

It's a bit hard to tell from the snippets of code but it's likely related to the fact that when you serialize instances the enclosing class, if any, also gets serialized, as well as any other place where fields used in the closure come from...e.g.check this discussion: http://stackoverflow.com/ques

Re: SparkSQL 'describe table' tries to look at all records

2015-07-13 Thread Yana Kadiyska

Have you seen https://issues.apache.org/jira/browse/SPARK-6910I opened https://issues.apache.org/jira/browse/SPARK-6984 which I think is related to this as well. There are a bunch of issues attached to it but basically yes, Spark interactions with a large metastore are bad...very bad if your me

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska

| 32| 0| | 1| 32| 1| +---+---+---+ On Thu, Jul 9, 2015 at 11:54 AM, ayan guha wrote: > Can you please post result of show()? > On 10 Jul 2015 01:00, "Yana Kadiyska" wrote: > >> Hi folks, I just re-wrote a query from using UNION ALL to use "with >> ro

[SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska

Hi folks, I just re-wrote a query from using UNION ALL to use "with rollup" and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here is my code: case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=>KeyValue(

How to debug java.io.OptionalDataException issues

2015-07-06 Thread Yana Kadiyska

Hi folks, suffering from a pretty strange issue: Is there a way to tell what object is being successfully serialized/deserialized? I have a maven-installed jar that works well when fat jarred within another, but shows the following stack when marked as provided and copied to the runtime classpath.

Re: Difference between spark-defaults.conf and SparkConf.set

2015-07-01 Thread yana

r replying Original message From: Akhil Das Date:07/01/2015 2:27 AM (GMT-05:00) To: Yana Kadiyska Cc: user@spark.apache.org Subject: Re: Difference between spark-defaults.conf and SparkConf.set .addJar works for me when i run it as a stand-alone application (without using sp

RE: 1.4.0

2015-06-30 Thread yana

I wonder if this could be a side effect of Spark-3928. Does ending the path with *.parquet work? Original message From: Exie Date:06/30/2015 9:20 PM (GMT-05:00) To: user@spark.apache.org Subject: 1.4.0 So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would "par

Difference between spark-defaults.conf and SparkConf.set

2015-06-30 Thread Yana Kadiyska

Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath spark.driver.extraClassPath to point to some external JARs. If I set them in spark-defaults.conf everything works perfectly. However, if I remove spark-defaults.conf and just create a SparkConf and call .set(

Re: Debugging Apache Spark clustered application from Eclipse

2015-06-25 Thread Yana Kadiyska

Pass that debug string to your executor like this: --conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address= 7761". When your executor is launched it will send debug information on port 7761. When you attach the Eclipse debugger, you need to have the IP

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Yana Kadiyska

try? > > Thanks > Best Regards > > On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska > wrote: > >> Hi folks, I have been using Spark against an external Metastore service >> which runs Hive with Cdh 4.6 >> >> In Spark 1.2, I was able to successfully connect

Re: Spark stream test throw org.apache.spark.SparkException: Task not serializable when execute in spark shell

2015-06-24 Thread Yana Kadiyska

I can't tell immediately, but you might be able to get more info with the hint provided here: http://stackoverflow.com/questions/27980781/spark-task-not-serializable-with-simple-accumulator (short version, set -Dsun.io.serialization.extendedDebugInfo=true) Also, unless you're simplifying your exam

Can Spark1.4 work with CDH4.6

2015-06-23 Thread Yana Kadiyska

Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was able to successfully connect by building with the following: ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I see that in S

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Yana Kadiyska

Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Yana Kadiyska

When all else fails look at the source ;) Looks like createJDBCTable is deprecated, but otherwise goes to the same implementation as insertIntoJDBC... https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala You can also look at DataFrameWriter in t

ClassNotFound exception from closure

2015-06-16 Thread Yana Kadiyska

Hi folks, running into a pretty strange issue -- I have a ClassNotFound exception from a closure?! My code looks like this: val jRdd1 = table.map(cassRow=>{ val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1)) Row.fromSeq(lst) }) println(s"This one worked .

Re: Reopen Jira or New Jira

2015-06-11 Thread Yana Kadiyska

John, I took the liberty of reopening because I have sufficient JIRA permissions (not sure if you do). It would be good if you can add relevant comments/investigations there. On Thu, Jun 11, 2015 at 8:34 AM, John Omernik wrote: > Hey all, from my other post on Spark 1.3.1 issues, I think we foun

Re: Cassandra Submit

2015-06-10 Thread Yana Kadiyska

; ... 28 more >> >> >> >> The Spark Cassandra Connector is trying to use a method, which does not >> exists. That means your assembly jar has the wrong version of the library >> that SCC is trying to use. Welcome to jar hell! >> >> &

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska

my code from datastax spark-cassandra-connector > <https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-demos/simple-demos/src/main/java/com/datastax/spark/connector/demo/JavaApiDemo.java> > . > > Thanx alot. > yasemin > > 2015-06

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska

160 >> >> I check the port "nc -z localhost 9160; echo $?" it returns me "0". I >> think it close, should I open this port ? >> >> 2015-06-09 16:55 GMT+03:00 Yana Kadiyska : >> >>> Is your cassandra installation actually listeni

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska

# port for Thrift to listen for clients on rpc_port: 9160 On Tue, Jun 9, 2015 at 7:36 AM, Yasemin Kaya wrote: > I couldn't find any solution. I can write but I can't read from Cassandra. > > 2015-06-09 8:52 GMT+03:00 Yasemin Kaya : > >> Thanks alot Mohammed, Gerard

Re: Cassandra Submit

2015-06-08 Thread Yana Kadiyska

yes, whatever you put for listen_address in cassandra.yaml. Also, you should try to connect to your cassandra cluster via bin/cqlsh to make sure you have connectivity before you try to make a a connection via spark. On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya wrote: > Hi, > I run my project on

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska

n compile my app to run this without -Dconfig.file=alt_ > reference1.conf? > > 2015-06-02 15:43 GMT+02:00 Yana Kadiyska : > >> This looks like your app is not finding your Typesafe config. The config >> should usually be placed in particular folder under your app to be seen &

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska

This looks like your app is not finding your Typesafe config. The config should usually be placed in particular folder under your app to be seen correctly. If it's in a non-standard location you can pass -Dconfig.file=alt_reference1.conf to java to tell it where to look. If this is a config that b

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska

Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF df.registerTempTable("table") sqlContext.sql("select percentile(key,0.5) from table").show() On Tue, Jun 2, 2015 at 8:07 AM,

RE: Don't understand "schedule jobs within an Application

2015-06-01 Thread yana

1. Yes if two tasks depend on each other they cant parallelize 2. Imagine something like a web application driver. You only get to have 1 spark context but now you want to run many concurrent jobs. They have nothing 2 do with each other; no reason to keep them sequential. Hope this helps -

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Yana Kadiyska

are you able to connect to your cassandra installation via cassandra_home$ ./bin/cqlsh This exception generally means that your cassandra instance is not reachable/accessible On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco wrote: > Hi all, > I have in a single server installed spark 1.3.1 a

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Yana Kadiyska

> 2015/05/27 > 12:44:03 steve FINISHED 6 s > app-20150527123822- TestRunner: power-iteration-clustering 8 512.0 MB > 2015/05/27 > 12:38:22 steve FINISHED 6 s > > > > 2015-05-27 11:42 GMT-07:00 Stephen Boesch : > > Thanks Yana, >> >>My current ex

Re: hive external metastore connection timeout

2015-05-27 Thread Yana Kadiyska

I have not run into this particular issue but I'm not using latest bits in production. However, testing your theory should be easy -- MySQL is just a database, so you should be able to use a regular mysql client and see how many connections are active. You can then compare to the maximum allowed co

Need some Cassandra integration help

2015-05-26 Thread Yana Kadiyska

Hi folks, for those of you working with Cassandra, wondering if anyone has been successful processing a mix of Cassandra and hdfs data. I have a dataset which is stored partially in HDFS and partially in Cassandra (schema is the same in both places) I am trying to do the following: val dfHDFS = s

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Yana Kadiyska

Todd, I don't have any answers for you...other than the file is actually named spark-defaults.conf (not sure if you made a typo in the email or misnamed the file...). Do any other options from that file get read? I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3 .0-SNAPS

RE: HiveContext fails when querying large external Parquet tables

2015-05-22 Thread yana

There is an open Jira on Spark not pushing predicates to metastore. I have a large dataset with many partitions but doing anything with it 8s very slow...But I am surprised Spark 1.2 worked for you: it has this problem... Original message From: Andrew Otto Date:05/22/2015 3:5

Re: Unable to use hive queries with constants in predicates

2015-05-21 Thread Yana Kadiyska

I have not seen this error but have seen another user have weird parser issues before: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E I would attach a debugger and see what is going on -- if I'm looking a

Re: Storing data in MySQL from spark hive tables

2015-05-20 Thread Yana Kadiyska

I'm afraid you misunderstand the purpose of hive-site.xml. It configures access to the Hive metastore. You can read more here: http://www.hadoopmaterial.com/2013/11/metastore.html. So the MySQL DB in hive-site.xml would be used to store hive-specific data such as schema info, partition info, etc.

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Yana Kadiyska

But if I'm reading his email correctly he's saying that: 1. The master and slave are on the same box (so network hiccups are unlikely culprit) 2. The failures are intermittent -- i.e program works for a while then worker gets disassociated... Is it possible that the master restarted? We used to h

RE: LATERAL VIEW explode issue

2015-05-20 Thread yana

Just a guess but are you using HiveContext in one case vs SqlContext inanother? You dont show a stacktrace but this looks like parser error...Which would make me guess different context or different spark versio on the cluster you are submitting to... Sent on the new Sprint Network from my Sa

Re: store hive metastore on persistent store

2015-05-16 Thread Yana Kadiyska

er all (still if I go to spark-shell and try to print out the SQL >> settings that I put in hive-site.xml, it does not print them). >> >> >> On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska >> wrote: >> >>> My point was more to how to verify that propert

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska

59:33 INFO HiveMetaStore: Added admin role in metastore > 15/05/15 17:59:34 INFO HiveMetaStore: Added public role in metastore > 15/05/15 17:59:34 INFO HiveMetaStore: No user is added in admin role, > since config is empty > 15/05/15 17:59:35 INFO SessionState: No Tez session required at this &

Re: SPARK-4412 regressed?

2015-05-15 Thread Yana Kadiyska

Thanks Sean, with the added permissions I do now have this extra option. On Fri, May 15, 2015 at 11:20 AM, Sean Owen wrote: > (I made you a Contributor in JIRA -- your yahoo-related account of the > two -- so maybe that will let you do so.) > > On Fri, May 15, 2015 at 4:19 PM, Y

SPARK-4412 regressed?

2015-05-15 Thread Yana Kadiyska

Hi, two questions 1. Can regular JIRA users reopen bugs -- I can open a new issue but it does not appear that I can reopen issues. What is the proper protocol to follow if we discover regressions? 2. I believe SPARK-4412 regressed in Spark 1.3.1, according to this SO thread possibly even in 1.3.0

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska

This should work. Which version of Spark are you using? Here is what I do -- make sure hive-site.xml is in the conf directory of the machine you're using the driver from. Now let's run spark-shell from that machine: scala> val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc: org.apache.spark.

[SparkSQL] Partition Autodiscovery (Spark 1.3)

2015-05-12 Thread Yana Kadiyska

Hi folks, I'm trying to use Automatic partition discovery as descibed here: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html /data/year=2014/file.parquet/data/year=2015/file.parquet … SELECT * FROM table WHERE year = 2015 I have an official 1.3.1 CDH4 bui

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska

2015 at 8:43 AM, yana wrote: > I believe this is a regression. Does not work for me either. There is a > Jira on parquet wildcards which is resolved, I'll see about getting it > reopened > > > Sent on the new Sprint Network from my Samsung Galaxy S®4. > > > ---

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread yana

I believe this is a regression. Does not work for me either. There is a Jira on parquet wildcards which is resolved, I'll see about getting it reopened Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Vaxuki Date:05/07/2015 7:38 AM (GMT-05:0

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska

Hi folks, we have been using the a JDBC connection to Spark's Thrift Server so far and using JDBC prepared statements to escape potentially malicious user input. I am trying to port our code directly to HiveContext now (i.e. eliminate the use of Thrift Server) and I am not quite sure how to genera

RE: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-22 Thread yana

You can try pulling the jar with wget and using it with -jars with Spark shell. I used 1.0.3 with Spark 1.3.0 but with a different version of scala. From the stack trace it looks like Spark shell is just not seeing the csv jar... Sent on the new Sprint Network from my Samsung Galaxy S®4. -

RE: Scheduling across applications - Need suggestion

2015-04-22 Thread yana

Yes. Fair schedulwr only helps concurrency within an application. With multiple shells you'd either need something like Yarn/Mesos or careful math on resources as you said Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Arun Patel Date:04/2

[ThriftServer] Urgent -- very slow Metastore query from Spark

2015-04-16 Thread Yana Kadiyska

Hi Sparkers, hoping for insight here: running a simple describe mytable here where mytable is a partitioned Hive table. Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same

[SparkSQL; Thriftserver] Help tracking missing 5 minutes

2015-04-15 Thread Yana Kadiyska

Hi Spark users, Trying to upgrade to Spark1.2 and running into the following seeing some very slow queries and wondering if someone can point me in the right direction for debugging. My Spark UI shows a job with duration 15s (see attached screenshot). Which would be great but client side measurem

[ThriftServer] User permissions warning

2015-04-08 Thread Yana Kadiyska

Hi folks, I am noticing a pesky and persistent warning in my logs (this is from Spark 1.2.1): 15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception trying to get groups for user anonymous org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user at org.apach

Re: Spark Avarage

2015-04-06 Thread Yana Kadiyska

If you're going to do it this way, I would ouput dayOfdate.substring(0,7), i.e. the month part, and instead of weatherCond, you can use (month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD: RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in multiple Spark examples..Y

DataFrame -- help with encoding factor variables

2015-04-06 Thread Yana Kadiyska

Hi folks, currently have a DF that has a factor variable -- say gender. I am hoping to use the RandomForest algorithm on this data an it appears that this needs to be converted to RDD[LabeledPoint] first -- i.e. all features need to be double-encoded. I see https://issues.apache.org/jira/browse/S

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska

Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying to

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska

You might also want to see if TaskScheduler helps with that. I have not used it with Windows 2008 R2 but it generally does allow you to schedule a bat file to run on startup On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > > Thanks for the suggestio

RE: Does SparkSQL support "..... having count (fieldname)" in SQL statement?

2015-03-04 Thread yana

I think the problem is that you are using an alias in the having clause. I am not able to try just now but see if HAVING count (*)> 2 works ( ie dont use cnt) Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: shahab Date:03/04/2015 7:22 AM (G

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska

he current directory, it is under /user/hive. In my case this directory was owned by 'root' and noone else had write permissions. Changing the permissions works if you need to get unblocked quickly...But it does seem like a bug to me... On Fri, Feb 27, 2015 at 11:21 AM, sandeep vura wrot

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska

I think you're mixing two things: the docs say "When* not *configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory.". AFAIK if you want a local metastore, you don't put hive-site.xml anywhere. You only need the file if you're going to p

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska

Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the shuffle setting: spark.sql.shuffle.partitions On Thu, Feb 26, 2015 at 5:51 PM, java8964 wrote: > Imran, thanks for your explaining about the parallelism. That is very > helpful. > > In my test case, I am only use one box cl

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska

Imran, I have also observed the phenomenon of reducing the cores helping with OOM. I wanted to ask this (hopefully without straying off topic): we can specify the number of cores and the executor memory. But we don't get to specify _how_ the cores are spread among executors. Is it possible that wi

[SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yana Kadiyska

Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom build from early December (commit 4259ca8dd12) which works fine, and Spark1.2-RC2. On the latter I get: jdbc:hive2://XXX.208:10001> select from_unixtime(epoch,'-MM-dd-HH'),count(

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Yana Kadiyska

rt my application, I can see that it is using that scheduler in the UI -- go to master UI and click on your application. Then you should see this (note the scheduling mode is shown as Fair): On Wed, Feb 25, 2015 at 4:06 AM, Harika Matha wrote: > Hi Yana, > > I tried running the program

[SparkSQL] Number of map tasks in SparkSQL

2015-02-24 Thread Yana Kadiyska

Shark used to have shark.map.tasks variable. Is there an equivalent for Spark SQL? We are trying a scenario with heavily partitioned Hive tables. We end up with a UnionRDD with a lot of partitions underneath and hence too many tasks: https://github.com/apache/spark/blob/master/sql/hive/src/main/sc

Re: Running multiple threads with same Spark Context

2015-02-24 Thread Yana Kadiyska

It's hard to tell. I have not run this on EC2 but this worked for me: The only thing that I can think of is that the scheduling mode is set to - *Scheduling Mode:* FAIR val pool: ExecutorService = Executors.newFixedThreadPool(poolSize) while_loop to get curr_job pool.execute(new ReportJ

Re: Executor size and checkpoints

2015-02-24 Thread Yana Kadiyska

memory. I thought deleting the checkpoints in a checkpointed application is the last thing that you want to do (as you lose all state). Seems a bit harsh to have to do this just to increase the amount of memory? On Mon, Feb 23, 2015 at 11:12 PM, Tathagata Das wrote: > Hey Yana, > > I

Executor size and checkpoints

2015-02-21 Thread Yana Kadiyska

Hi all, I had a streaming application and midway through things decided to up the executor memory. I spent a long time launching like this: ~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest --executor-memory 2G --master... and observing the executor memory is still at old 512 setting

1 2 3 >

1 - 100 of 213 matches

Mail list logo