PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
I am trying to define a UDF to calculate octet_length of a string but I am having some trouble getting it right. Does anyone have a working version of this already/any pointers? I am using Spark 1.5.2/Python 2.7. Thanks - To

Re: PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
Meant to include: I have this function which seems to work, but I am not sure if it is always correct: def octet_length(s): return len(s.encode(‘utf8’)) sqlContext.registerFunction('octet_length', lambda x: octet_length(x)) > On Mar 8, 2016, at 12:30 PM, Cramblit, Ross (Reuters News) > w

Dropping nested dataframe column

2016-03-10 Thread Ross.Cramblit
Is there any support for dropping a nested column in a dataframe? I have tried dropping with the Column reference as well as a string of the column name, but the returned dataframe is unchanged. >>> df = sqlContext.jsonRDD(sc.parallelize(['{"properties": {"col1": "a", >>> "col2": "b"}}'])) >>>

Re: confusing about Spark SQL json format

2016-03-31 Thread Ross.Cramblit
You are correct that it does not take the standard JSON file format. From the Spark Docs: "Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most

Re: Window function in Spark SQL

2015-12-11 Thread Ross.Cramblit
Hey Sourav, Window functions require using a HiveContext rather than the default SQLContext. See here: http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext HiveContext provides all the same functionality of SQLContext, as well as extra features like Window fu

Re: troubleshooting "Missing an output location for shuffle"

2015-12-14 Thread Ross.Cramblit
Hey Velijko, I ran into this error a few days ago and it turned out I was out of disk space on a couple nodes. I am not sure if this was the direct cause of the error, but it stopped throwing when I cleared out some unneeded large files. On Dec 14, 2015, at 5:32 PM, Veljko Skarich mailto:veljk

Spark-avro issue in 1.5.2

2016-02-24 Thread Ross.Cramblit
I’m trying to save a data frame in Avro format but am getting the following error: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter; I found the following workaround https://github.com/databricks/spark

Re: Spark-avro issue in 1.5.2

2016-02-24 Thread Ross.Cramblit
Hadoop 2.6.0 included? spark-assembly-1.5.2-hadoop2.6.0.jar On Feb 24, 2016, at 4:08 PM, Koert Kuipers mailto:ko...@tresata.com>> wrote: does your spark version come with batteries (hadoop included) or is it build with hadoop provided and you are adding hadoop binaries to classpath On Wed, Feb

Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
Hello Spark community - I am running a Spark SQL query to calculate the difference in time between consecutive events, using lag(event_time) over window - SELECT device_id, unix_time, event_id, unix_time - lag(unix_time) OVER (PARTITION BY device_id ORDER B

Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
I am using Spark 1.5.0 on Yarn On Nov 2, 2015, at 3:16 PM, Yin Huai mailto:yh...@databricks.com>> wrote: Hi Ross, What version of spark are you using? There were two issues that affected the results of window function in Spark 1.5 branch. Both of issues have been fixed and will be released wi

PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL transforms on a JSON data set that I load into a data frame. The data set is not large (~100GB) and most stages execute without any issues. However, some more complex stages tend to lose executors/nodes regularly. What wou

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any log files.’ Where can I enable log aggregation? On Nov 19, 2015, at 11:07 AM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: Do you have YARN log aggregation enabled ? You can try retrieving log for the container using t

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Thank you Ted and Sandy for getting me pointed in the right direction. From the logs: WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 25.4 GB of 25.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. On Nov 19, 2015, at 12:20 PM, Ted

Spark SQL Save CSV with JSON Column

2015-11-24 Thread Ross.Cramblit
I am generating a set of tables in pyspark SQL from a JSON source dataset. I am writing those tables to disk as CSVs using df.write.format(com.databricks.spark.csv).save(…). I have a schema like: root |-- col_1: string (nullable = true) |-- col_2: string (nullable = true) |-- col_3: timestamp

Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it turns out there are a number of duplicate JSON objects in the source data. I am trying to find the best way to remove these duplicates before using the dataframe. With both df.dropDuplicates() and df.sqlContext.sql(

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have looked through the logs and do not see any WARNING or ERRORs - the executors just seem to stop logging. I am running Spark 1.5.2 on YARN. On Dec 7, 2015, at 1:20 PM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: bq. complete a shuffle stage due to lost executors Have you taken a look at t

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Here is the trace I get from the command line: [Stage 4:> (60 + 60) / 200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 10.0.0.138:33822 15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnS

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Okay maybe these errors are more helpful - WARN server.TransportChannelHandler: Exception in connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(