ache/spark/d57daf1f7732a7ac54a91fe112deeda0a254f9ef/python/pyspark/sql/types.py
--
Ruslan Dautkhanov
On Wed, Mar 16, 2016 at 4:44 PM, Reynold Xin wrote:
> We probably should have the alias. Is this still a problem on master
> branch?
>
> On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov
Getting
ValueError: Could not parse datatype: bigint
Looks like pyspark.sql.types doesn't know anything about bigint..
Should it be aliased to LongType in pyspark.sql.types?
Thanks
On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov
wrote:
> Hello,
>
> Looking at
>
&g
t;,
IntegerType() for "integer" etc? If it doesn't exist it would be great to
have such a
mapping function.
Thank you.
ps. I have a data frame, and use its dtypes to loop through all columns to
fix a few
columns' data types as a workaround for SPARK-13866.
--
Ruslan Dautkhanov
Spark session dies out after ~40 hours when running against Hadoop Secure
cluster.
spark-submit has --principal and --keytab so kerberos ticket renewal works
fine according to logs.
Some happens with HDFS dfs connection?
These messages come up every 1 second:
See complete stack: http://pastebi
known and well documented.
--
Ruslan Dautkhanov
Turns to be it is a Spark issue
https://issues.apache.org/jira/browse/SPARK-13478
--
Ruslan Dautkhanov
On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov
wrote:
> Hi Romain,
>
> Thank you for your response.
>
> Adding Kerberos support might be as simple as
> https://i
For a Spark job that only does shuffling
(e.g. Spark SQL with joins, group bys, analytical functions, order bys),
but no explicit persistent RDDs nor dataframes (there are no .cache()es in
the code),
what would be the lowest recommended setting
for spark.storage.memoryFraction?
spark.storage.memor
Yep, I tried that. It seems you're right. Got an error that execution
engine has to be set to mr.
hive.execution.engine = mr
I did not keep exact error message/stack. It's probably disabled explicitly.
--
Ruslan Dautkhanov
On Thu, Jan 28, 2016 at 7:03 AM, Todd wrote:
> Did yo
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
There are quite a lot of knobs to tune for Hive on Spark.
Above page recommends following settings:
mapreduce.input.fileinputformat.split.maxsize=75000
> hive.vectorized.execution.enabled=true
> hive.cbo.enable
I took liberty and created a JIRA https://github.com/cloudera/livy/issues/36
Feel free to close it if doesn't belong to Livy project.
I really don't know if this is a Spark or a Livy/Sentry problem.
Any ideas for possible workarounds?
Thank you.
--
Ruslan Dautkhanov
On Mon, Jan 1
roupInformation.doAs() hence
the error.
So Sentry isn't compatible with Spark in kerberized clusters? Is any
workaround for this problem?
--
Ruslan Dautkhanov
On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux wrote:
> Livy does not support any Kerberos yet
> https://issues.cloudera.org
) is
allowed to impersonate to other users.
So very convenient for Spark Notebooks.
Any information to help solve this will be highly appreciated.
--
Ruslan Dautkhanov
Livy build test from master fails with below problem. Can't track it down.
YARN shows Livy Spark yarn application as running.
Although attempt to connect to application master shows connection refused:
HTTP ERROR 500
> Problem accessing /proxy/application_1448640910222_0046/. Reason:
> Connection
Spark
> 1.3.1 does not provide integration with Phoenix for kerberized cluster.
>
> Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured
> cluster or not?
>
> Thanks,
> Akhilesh
>
> On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov
> wrote:
>
pired)
kerberos ticket for authentication to pass.
--
Ruslan Dautkhanov
On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:
> Hi,
>
> I am running spark job on yarn in cluster mode in secured cluster. I am
> trying to run Spark on Hbase using Phoen
-table-in-hive/34059289#34059289
--
Ruslan Dautkhanov
On Mon, Dec 7, 2015 at 11:27 AM, Test One wrote:
> I'm using spark-avro with SparkSQL to process and output avro files. My
> data has the following schema:
>
> root
> |-- memberUuid: string (nullable = true)
> |-
An interesting compaction approach of small files is discussed recently
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
AFAIK Spark supports views too.
--
Ruslan Dautkhanov
On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
nyig
java#welcome-to-livy-the-rest-spark-server>
"
Although that post is from April 2015, not sure if it's still accurate.
--
Ruslan Dautkhanov
On Thu, Nov 26, 2015 at 12:04 AM, Deenar Toraskar wrote:
> Hi
>
> I had the same question. Anyone having used Livy and/opr SparkJ
more even distriubution you could use a hash function
from that not just a remainder.
--
Ruslan Dautkhanov
On Mon, Nov 23, 2015 at 6:35 AM, Patrick McGloin
wrote:
> I will answer my own question, since I figured it out. Here is my answer
> in case anyone else has the same issue.
You could write your own UDF isdate().
--
Ruslan Dautkhanov
On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani wrote:
> Hi Ted Yu,
>
> Thanks for your response. Is any other way to achieve in Spark Query?
>
>
> Regards,
> Ravi
>
> On Tue, Nov 17, 2015 at 10:26 AM,
thought it's primary use
is for Hue and similar services which uses impersonation quite heavily in
kerberized cluster.
--
Ruslan Dautkhanov
On Wed, Nov 4, 2015 at 1:40 PM, Ted Yu wrote:
> 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0]
> hdfs.KeyPr
https://issues.apache.org/jira/browse/SPARK-8992
Should be in 1.6?
--
Ruslan Dautkhanov
On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss wrote:
> Hi,
>
> I have data as follows:
>
> A, 2015, 4
> A, 2014, 12
> A, 2013, 1
> B, 2015, 24
> B, 2013 4
>
>
> I need
Thank you Richard and Matthew.
DataFrameWriter first appeared in Spark 1.4. Sorry, I should have mentioned
earlier, we're on CDH 5.4 / Spark 1.3. No options for this version?
Best regards,
Ruslan Dautkhanov
On Mon, Oct 5, 2015 at 4:00 PM, Richard Hillegas wrote:
> Hi Ruslan,
>
>
/apache/spark/sql/SQLContext.html
and can't find anything relevant.
Thanks!
--
Ruslan Dautkhanov
T2: max_fails=3;
to "Machine B, where Spark is installed"
3)
You may want to adjust "location /static/" that fits your Spark Web UI..
4)
With a few more config lines you can setup SSL offloading too.
--
Ruslan Dautkhanov
On Thu, Sep 17, 2015 at 3:06 AM, Renato Perini
wr
).
--
Ruslan Dautkhanov
On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov
wrote:
> Wanted to take something like this
>
> https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
> and create a Hive UDAF to create an aggregate function that returns a data
> typ
Wanted to take something like this
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
and create a Hive UDAF to create an aggregate function that returns a data
type guess.
Am I inventing a wheel?
Does Spark have something like this already built-in?
Would be very useful f
Similar setup for Hue
http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/
Might give you an idea.
--
Ruslan Dautkhanov
On Thu, Sep 17, 2015 at 9:50 AM, mjordan79 wrote:
> Hello!
> I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI.
> I have 2 machines:
Thank you Alexander.
Sounds like quite a lot of good and exciting changes slated for Spark's ANN.
Looking forward to it.
--
Ruslan Dautkhanov
On Wed, Sep 9, 2015 at 7:10 PM, Ulanov, Alexander
wrote:
> Thank you, Feynman, this is helpful. The paper that I linked claims a big
> s
Sathish,
Thanks for pointing to that.
https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm
That must be only part of Oracle's BDA codebase, not open-source Hive,
right?
--
Ruslan Dautkhanov
On Thu, Sep 10, 2015 at 6:59 AM, Sathish Kumaran Vairavelu <
vsathishkuma...@g
a_bind_variables.html
This point is more relevant for OLTP-like queries which Spark is probably
not yet good at (e.g. return a few rows quickly/ winthin a few ms).
--
Ruslan Dautkhanov
On Thu, Sep 10, 2015 at 12:07 PM, Michael Armbrust
wrote:
> Either that or use the DataFrame API, which
You can also sqoop oracle data in
$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl
--username MOVIEDEMO --password welcome1 --table ACTIVITY
http://www.rittmanmead.com/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/
--
Ruslan Dautkhanov
On Tue
Found a dropout commit from avulanov:
https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d
It probably hasn't made its way to MLLib (yet?).
--
Ruslan Dautkhanov
On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang wrote:
> Unfortunately, not yet... Deep
/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
should read B&C :)
--
Ruslan Dautkhanov
On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang
wrote:
> Backprop is used to compute the gradient here
> <https://github.com/apache/spark/blob/master/mllib/s
Read response from Cheng Lian on Aug/27th - it
looks the same problem.
Workarounds
1. write that parquet file in Spark;
2. upgrade to Spark 1.5.
--
Ruslan Dautkhanov
On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov wrote:
> No, it was created in Hive by CTAS, but any help is apprecia
That parquet table wasn't created in Spark, is it?
There was a recent discussion on this list that complex data types in Spark
prior to 1.5 often incompatible with Hive for example, if I remember
correctly.
On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote:
> I am trying to read an (array typed) p
http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
Implementation seems missing backpropagation?
Was there is a good reason to omit BP?
What are the drawbacks of a pure feedforward-only ANN?
Thanks!
--
Ruslan Dautkhanov
est/topics/sg_hdfs_sentry_sync.html
--
Ruslan Dautkhanov
On Thu, Sep 3, 2015 at 1:46 PM, Daniel Schulz
wrote:
> Hi Matei,
>
> Thanks for your answer.
>
> My question is regarding simple authenticated Spark-on-YARN only, without
> Kerberos. So when I run Spark on YARN and H
https://issues.apache.org/jira/browse/SPARK-7660 ?
--
Ruslan Dautkhanov
On Thu, Aug 20, 2015 at 1:49 PM, Kohki Nishio wrote:
> Right after upgraded to 1.4.1, we started seeing this exception and yes we
> picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6). Is there
> a
There is no Spark master in YARN mode. It's standalone mode terminology.
In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
will be restarted
automatically by RM up to yarn.resourcemanager.am.max-retries
times (default is 2).
--
Ruslan Dautkhanov
On Fri, Jul 17,
Spark?
--
Ruslan Dautkhanov
On Tue, Aug 11, 2015 at 11:30 AM, Hien Luu
wrote:
> We are in the middle of figuring that out. At the high level, we want to
> combine the best parts of existing workflow solutions.
>
> On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone wrote:
>
>>
2.6.
--
Ruslan Dautkhanov
On Mon, Aug 10, 2015 at 3:53 PM, YaoPau wrote:
> I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
> iPython Notebook. I'm getting collect() to work just fine, but take()
> errors. (I'm having issues with collect() on
is not network bandwidth-bound, I can see it'll be a few
percent to no improvement.
--
Ruslan Dautkhanov
On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus wrote:
> H
>
> 2% huh.
>
>
> -- ttfn
> Simon Edelhaus
> California 2015
>
> On Sat, Aug 1,
You should also take into account amount of memory that you plan to use.
It's advised not to give too much memory for each executor .. otherwise GC
overhead will go up.
Btw, why prime numbers?
--
Ruslan Dautkhanov
On Wed, Jul 29, 2015 at 3:31 AM, ponkin wrote:
> Hi Rahul,
>
>
ks are with their Tez
project,
but Tez has components that resemble closer to "buffer cache" or "in-memory
columnar storage" caching
from traditional RDBMS systems, and may get better and/or more predictable
performance on
BI queries.
--
Ruslan Dautkhanov
On Mon, Jul 20, 2
Or Spark on HBase )
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
--
Ruslan Dautkhanov
On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu wrote:
> bq. that is, key-value stores
>
> Please consider HBase for this purpose :-)
>
> On Tue, Jul 14, 2015 at 5:55 PM
>> the executor receives a SIGTERM (from whom???)
>From YARN Resource Manager.
Check if yarn fair scheduler preemption and/or speculative execution are
turned on,
then it's quite possible and not a bug.
--
Ruslan Dautkhanov
On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim wrot
Should be part of Spark 1.4
https://issues.apache.org/jira/browse/SPARK-1442
I don't see it in the documentation though
https://spark.apache.org/docs/latest/sql-programming-guide.html
--
Ruslan Dautkhanov
On Mon, Jul 6, 2015 at 5:06 AM, gireeshp
wrote:
> Is there any equivalent of
You can see what Spark SQL functions are supported in Spark by doing the
following in a notebook:
%sql show functions
https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html
I think Spark SQL support is currently around Hive ~0.11?
--
Ruslan
Good question. I'd like to know the same. Although I think you'll loose
supportability.
--
Ruslan Dautkhanov
On Wed, Jul 8, 2015 at 2:03 AM, Ashish Dutt wrote:
>
> Hi,
> I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4.
> I checked the docu
Don't think it is a Zeppelin problem.. RDDs are "immutable".
Unless you integrate something like IndexedRDD
http://spark-packages.org/package/amplab/spark-indexedrdd
into Zeppelin I think it's not possible.
--
Ruslan Dautkhanov
On Wed, Jul 8, 2015 at 3:24 PM, Brandon Whit
Hi Akhil,
It's interesting if RDDs are stored internally in a columnar format as well?
Or it is only when an RDD is cached in SQL context, it is converted to
columnar format.
What about data frames?
Thanks!
--
Ruslan Dautkhanov
On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das
wrote:
>
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_resource_pools.html
--
Ruslan Dautkhanov
On Thu, Jul 2, 2015 at 4:20 PM, Alexander Waldin
wrote:
> Hi,
>
> I'd like to specify the total sum of cores / mem
Scala used to run on .NET
http://www.scala-lang.org/old/node/10299
--
Ruslan Dautkhanov
On Thu, Jul 2, 2015 at 1:26 PM, pedro wrote:
> You might try using .pipe() and installing your .NET program as a binary
> across the cluster (or using addFile). Its not ideal to pipe things in/out
&
You can run
hadoop checknative -a
and see if bzip2 is detected correctly.
--
Ruslan Dautkhanov
On Fri, Jun 26, 2015 at 10:18 AM, Marcelo Vanzin
wrote:
> What master are you using? If this is not a "local" master, you'll need to
> set LD_LIBRARY_PATH on the
https://spark.apache.org/docs/latest/streaming-flume-integration.html
Yep, avro sink is the correct one.
--
Ruslan Dautkhanov
On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid
wrote:
> Hi!
>
>
> I want to integrate flume with spark streaming. I want to know which sink
> ty
Vadim,
You could edit /etc/fstab, then issue mount -o remount to give more shared
memory online.
Didn't know Spark uses shared memory.
Hope this helps.
On Fri, Jun 19, 2015, 8:15 AM Vadim Bichutskiy
wrote:
> Hello Spark Experts,
>
> I've been running a standalone Spark cluster on EC2 for a fe
Got it. Thanks!
--
Ruslan Dautkhanov
On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng wrote:
> ChiSqSelector calls an RDD of labeled points, where the label is the
> target. See
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mlli
Thank you Xiangrui.
Oracle's attribute importance mining function have a target variable.
"Attribute importance is a supervised function that ranks attributes
according to their significance in predicting a target."
MLlib's ChiSqSelector does not have a target variable.
-
ce in predicting a target.
Best regards,
Ruslan Dautkhanov
?
Best reagrds,
Ruslan Dautkhanov
+ OS + etc >> amount of memory node has.
--
Ruslan Dautkhanov
On Thu, Jun 4, 2015 at 8:59 AM, Chao Chen wrote:
> Hi all,
> I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and
> Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory.
> One is configured
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf
--
Ruslan Dautkhanov
On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg
wrote:
> Thank you, Tathagata, Cody, O
ction *spark.shuffle.safetyFraction)/
spark.executor.cores. Memory fraction and safety fraction default to 0.2
and 0.8 respectively.
I'd test spark.executor.cores with 2,4,8 and 16 and see what makes your job
run faster..
--
Ruslan Dautkhanov
On Wed, May 27, 2015 at 6:46 PM, Mulugeta Mammo
wrote:
>
yarn application logs.
--
Ruslan Dautkhanov
On Thu, May 21, 2015 at 5:08 AM, Oleg Ruchovets
wrote:
> Doesn't work for me so far ,
>using command but got such output. What should I check to fix the
> issue? Any configuration parameters ...
>
>
> [root@sdo-
Oleg,
You can see applicationId in your Spark History Server.
Go to http://historyserver:18088/
Also check
https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application
It should be no different with PySpark.
--
Ruslan Dautkhanov
On Wed, May 20, 2015 at 2:12 PM, Oleg
You could use
yarn logs -applicationId application_1383601692319_0008
--
Ruslan Dautkhanov
On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets
wrote:
> Hi ,
>
> I am executing PySpark job on yarn ( hortonworks distribution).
>
> Could someone pointing me where is th
Had the same question on stackoverflow recently
http://stackoverflow.com/questions/30008127/how-to-read-a-nested-collection-in-spark
Lomig Mégard had a detailed answer of how to do this without using LATERAL
VIEW.
On Mon, May 11, 2015 at 8:05 AM, Ashish Kumar Singh
wrote:
> Hi ,
> I am trying
68 matches
Mail list logo