I am trying to define a UDF to calculate octet_length of a string but I am
having some trouble getting it right. Does anyone have a working version of
this already/any pointers?
I am using Spark 1.5.2/Python 2.7.
Thanks
-
To
Meant to include:
I have this function which seems to work, but I am not sure if it is always
correct:
def octet_length(s):
return len(s.encode(‘utf8’))
sqlContext.registerFunction('octet_length', lambda x: octet_length(x))
> On Mar 8, 2016, at 12:30 PM, Cramblit, Ross (Reuters News)
> w
Is there any support for dropping a nested column in a dataframe? I have tried
dropping with the Column reference as well as a string of the column name, but
the returned dataframe is unchanged.
>>> df = sqlContext.jsonRDD(sc.parallelize(['{"properties": {"col1": "a",
>>> "col2": "b"}}']))
>>>
You are correct that it does not take the standard JSON file format. From the
Spark Docs:
"Note that the file that is offered as a json file is not a typical JSON file.
Each line must contain a separate, self-contained valid JSON object. As a
consequence, a regular multi-line JSON file will most
Hey Sourav,
Window functions require using a HiveContext rather than the default
SQLContext. See here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext
HiveContext provides all the same functionality of SQLContext, as well as extra
features like Window fu
Hey Velijko,
I ran into this error a few days ago and it turned out I was out of disk space
on a couple nodes. I am not sure if this was the direct cause of the error, but
it stopped throwing when I cleared out some unneeded large files.
On Dec 14, 2015, at 5:32 PM, Veljko Skarich
mailto:veljk
I’m trying to save a data frame in Avro format but am getting the following
error:
java.lang.NoSuchMethodError:
org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
I found the following workaround
https://github.com/databricks/spark
Hadoop 2.6.0 included?
spark-assembly-1.5.2-hadoop2.6.0.jar
On Feb 24, 2016, at 4:08 PM, Koert Kuipers
mailto:ko...@tresata.com>> wrote:
does your spark version come with batteries (hadoop included) or is it build
with hadoop provided and you are adding hadoop binaries to classpath
On Wed, Feb
Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between
consecutive events, using lag(event_time) over window -
SELECT device_id,
unix_time,
event_id,
unix_time - lag(unix_time)
OVER
(PARTITION BY device_id ORDER B
I am using Spark 1.5.0 on Yarn
On Nov 2, 2015, at 3:16 PM, Yin Huai
mailto:yh...@databricks.com>> wrote:
Hi Ross,
What version of spark are you using? There were two issues that affected the
results of window function in Spark 1.5 branch. Both of issues have been fixed
and will be released wi
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
transforms on a JSON data set that I load into a data frame. The data set is
not large (~100GB) and most stages execute without any issues. However, some
more complex stages tend to lose executors/nodes regularly. What wou
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu
mailto:yuzhih...@gmail.com>> wrote:
Do you have YARN log aggregation enabled ?
You can try retrieving log for the container using t
Thank you Ted and Sandy for getting me pointed in the right direction. From the
logs:
WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits.
25.4 GB of 25.3 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
On Nov 19, 2015, at 12:20 PM, Ted
I am generating a set of tables in pyspark SQL from a JSON source dataset. I am
writing those tables to disk as CSVs using
df.write.format(com.databricks.spark.csv).save(…). I have a schema like:
root
|-- col_1: string (nullable = true)
|-- col_2: string (nullable = true)
|-- col_3: timestamp
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it
turns out there are a number of duplicate JSON objects in the source data. I am
trying to find the best way to remove these duplicates before using the
dataframe.
With both df.dropDuplicates() and df.sqlContext.sql(
I have looked through the logs and do not see any WARNING or ERRORs - the
executors just seem to stop logging.
I am running Spark 1.5.2 on YARN.
On Dec 7, 2015, at 1:20 PM, Ted Yu
mailto:yuzhih...@gmail.com>> wrote:
bq. complete a shuffle stage due to lost executors
Have you taken a look at t
Here is the trace I get from the command line:
[Stage 4:> (60 + 60) /
200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnS
Okay maybe these errors are more helpful -
WARN server.TransportChannelHandler: Exception in connection from
ip-10-0-0-138.ec2.internal/10.0.0.138:39723
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(
18 matches
Mail list logo