Encoding issue reading text file

2018-10-18 Thread Masf
Hi everyone, I´m trying to read a text file with UTF-16LE but I´m getting weird characters like this: �� W h e n My code is this one: sparkSession .read .format("text") .option("charset", "UTF-16LE") .load("textfile.txt") I´m using Spark 2.3.1. Any idea to fix it

Dataset error with Encoder

2018-05-12 Thread Masf
Hi, I have the following issue, case class Item (c1: String, c2: String, c3: Option[BigDecimal]) import sparkSession.implicits._ val result = df.as[Item].groupByKey(_.c1).mapGroups((key, value) => { value }) But I get the following error in compilation time: Unable to find encoder for type stor

Hbase and Spark

2017-01-29 Thread Masf
I´m trying to build an application where is necessary to do bulkGets and bulkLoad on Hbase. I think that I could use this component https://github.com/hortonworks-spark/shc *Is it a good option??* But* I can't import it in my project*. Sbt cannot resolve hbase connector This is my build.sbt:

Testing with spark testing base

2015-12-05 Thread Masf
Hi. I'm testing "spark testing base". For example: class MyFirstTest extends FunSuite with SharedSparkContext{ def tokenize(f: RDD[String]) = { f.map(_.split("").toList) } test("really simple transformation"){ val input = List("hi", "hi miguel", "bye") val expected = List(List(

Re: Debug Spark

2015-12-02 Thread Masf
i.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ >> >> Thanks >> Best Regards >> >> On Sun, Nov 29, 2015 at 9:48 PM, Masf wrote: >> >>> Hi >>> >>> Is it possible to debug spark locally with IntelliJ or another IDE? >>> >>> Thanks >>> >>> -- >>> Regards. >>> Miguel Ángel >>> >> >> > -- Saludos. Miguel Ángel

Re: Debug Spark

2015-11-29 Thread Masf
Hi Ardo Some tutorial to debug with Intellij? Thanks Regards. Miguel. On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR wrote: > hi, > > IntelliJ is just great for that! > > cheers, > Ardo. > > On Sun, Nov 29, 2015 at 5:18 PM, Masf wrote: > >> Hi >> &

Debug Spark

2015-11-29 Thread Masf
Hi Is it possible to debug spark locally with IntelliJ or another IDE? Thanks -- Regards. Miguel Ángel

Re: SQLContext load. Filtering files

2015-08-27 Thread Masf
filter function > <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext> > as the second argument. > > Thanks > Best Regards > > On Wed, Aug 19, 2015 at 10:46 PM, Masf wrote: > >> Hi. >> >> I'd like

Spark 1.3. Insert into hive parquet partitioned table from DataFrame

2015-08-20 Thread Masf
Hi. I have a dataframe and I want to insert these data into parquet partitioned table in Hive. In Spark 1.4 I can use df.write.partitionBy("x","y").format("parquet").mode("append").saveAsTable("tbl_parquet") but in Spark 1.3 I can't. How can I do it? Thanks -- Regards Miguel

SQLContext load. Filtering files

2015-08-19 Thread Masf
Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem is

Dataframe Partitioning

2015-05-28 Thread Masf
Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do a inner join between these dataframes, the result contains 200 partitions. *Why?* df1.join(df2, df1("id") === df2("id"), "Inner") => returns 200 partitions Thanks!!! -- Regards. Miguel Ángel

Re: Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I think that it's possible to do: *df.select($"*", lit(null).as("col17", lit(null).as("col18", lit(null).as("col19",, lit(null).as("col26")* Any other advice? Miguel. On Wed, May 27, 2015 at 5:02 PM, Masf wrote: > Hi. >

Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I have a DataFrame with 16 columns (df1) and another with 26 columns(df2). I want to do a UnionAll. So, I want to add 10 columns to df1 in order to have the same number of columns in both dataframes. Is there some alternative to "withColumn"? Thanks -- Regards. Miguel Ángel

Re: DataFrame. Conditional aggregation

2015-05-27 Thread Masf
3) minimum, > sum(case when endrscp>100 then 1 else 0 end test from j' > > Let me know if this works. > On 26 May 2015 23:47, "Masf" wrote: > >> Hi >> I don't know how it works. For example: >> >> val result = joinedData.groupBy("co

Re: DataFrame. Conditional aggregation

2015-05-26 Thread Masf
o it? Thanks Regards. Miguel. On Tue, May 26, 2015 at 12:35 AM, ayan guha wrote: > Case when col2>100 then 1 else col2 end > On 26 May 2015 00:25, "Masf" wrote: > >> Hi. >> >> In a dataframe, How can I execution a conditional sentence in a >>

DataFrame. Conditional aggregation

2015-05-25 Thread Masf
Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM(IF table.col2 > 100 THEN 1 ELSE table.col1) FROM table GROUP BY name Thanks -- Regards. Miguel

Inserting Nulls

2015-05-05 Thread Masf
Hi. I have a spark application where I store the results into table (with HiveContext). Some of these columns allow nulls. In Scala, this columns are represented through Option[Int] or Option[Double].. Depend on the data type. For example: *val hc = new HiveContext(sc)* *var col1: Option[Ingeger

Re: Parquet number of partitions

2015-05-05 Thread Masf
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x=> number of partitions. Depend on your case, repartition could be a heavy task Regards. Miguel

Re: Opening many Parquet files = slow

2015-04-15 Thread Masf
Hi guys Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files (250MB/file), it lasts 4 minutes. I have a cluster with 4 nodes and it seems me too slow. The "load" function is not available in Spark 1.2, so I can't test it Regards. Miguel. On Mon, Apr 13, 2015 at 8:12 PM,

Re: Increase partitions reading Parquet File

2015-04-14 Thread Masf
00)? > > --- Original Message --- > > From: "Masf" > Sent: April 9, 2015 11:45 PM > To: user@spark.apache.org > Subject: Increase partitions reading Parquet File > > Hi > > I have this statement: > > val file = > SqlContext.parquetfile("hd

Spark SQL. Memory consumption

2015-04-02 Thread Masf
Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 < -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(I

Re: Error reading smallin in hive table with parquet format

2015-04-02 Thread Masf
2015 at 7:53 AM, Masf wrote: > >> >> Hi. >> >> In Spark SQL 1.2.0, with HiveContext, I'm executing the following >> statement: >> >> CREATE TABLE testTable STORED AS PARQUET AS >> SELECT >> field1 >> FROM table1 >> >>

Error reading smallin in hive table with parquet format

2015-04-01 Thread Masf
Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS SELECT field1 FROM table1 *field1 is SMALLINT. If table1 is in text format all it's ok, but if table1 is in parquet format, spark returns the following error*: 15/04/

Re: Error in Delete Table

2015-03-31 Thread Masf
Hi Ted. Spark 1.2.0 an Hive 0.13.1 Regards. Miguel Angel. On Tue, Mar 31, 2015 at 10:37 AM, Ted Yu wrote: > Which Spark and Hive release are you using ? > > Thanks > > > > > On Mar 27, 2015, at 2:45 AM, Masf wrote: > > > > Hi. > > > > In HiveC

Re: Too many open files

2015-03-30 Thread Masf
y/limits.conf set the next values: > > Have you done the above modification on all the machines in your Spark > cluster ? > > If you use Ubuntu, be sure that the /etc/pam.d/common-session file > contains the following line: > > session required pam_limits.so > > > On M

Re: Too many open files

2015-03-30 Thread Masf
the machines to get the ulimit effect (or > relogin). What operation are you doing? Are you doing too many > repartitions? > > Thanks > Best Regards > > On Mon, Mar 30, 2015 at 4:52 PM, Masf wrote: > >> Hi >> >> I have a problem with temp data in Spark.

Too many open files

2015-03-30 Thread Masf
Hi I have a problem with temp data in Spark. I have fixed spark.shuffle.manager to "SORT". In /etc/secucity/limits.conf set the next values: * softnofile 100 * hardnofile 100 In spark-env.sh set ulimit -n 100 I've restarted the spark service and it

Error in Delete Table

2015-03-27 Thread Masf
Hi. In HiveContext, when I put this statement "DROP TABLE IF EXISTS TestTable" If TestTable doesn't exist, spark returns an error: ERROR Hive: NoSuchObjectException(message:default.TestTable table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
ow function support in 1.4.0. But it's not a promise > yet. > > Cheng > > On 3/26/15 7:27 PM, Arush Kharbanda wrote: > > Its not yet implemented. > > https://issues.apache.org/jira/browse/SPARK-1442 > > On Thu, Mar 26, 2015 at 4:39 PM, Masf wrote: > >

Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Some tutorial or documentation where I can see all features supported by Spark SQ

Re: Issues with SBT and Spark

2015-03-19 Thread Masf
Hi Spark 1.2.1 uses Scala 2.10. Because of this, your program fails with scala 2.11 Regards On Thu, Mar 19, 2015 at 8:17 PM, Vijayasarathy Kannan wrote: > My current simple.sbt is > > name := "SparkEpiFast" > > version := "1.0" > > scalaVersion := "2.11.4" > > libraryDependencies += "org.apach

Re: Spark SQL. Cast to Bigint

2015-03-17 Thread Masf
it). > Can you try HiveContext for now? > > On Fri, Mar 13, 2015 at 4:48 AM, Masf wrote: > >> Hi. >> >> I have a query in Spark SQL and I can not covert a value to BIGINT: >> CAST(column AS BIGINT) or >> CAST(0 AS BIGINT) >> >> The output is:

Hive error on partitioned tables

2015-03-17 Thread Masf
Hi. I'm running Spark 1.2.0. I have HiveContext and I execute the following query: select sum(field1 / 100) from table1 group by field2; field1 in hive metastore is a smallint. The schema detected by hivecontext is a int32: fileSchema: message schema { optional int32 field1; ...

Re: Parquet and repartition

2015-03-16 Thread Masf
t; means here. > > On Mon, Mar 16, 2015 at 11:11 AM, Masf wrote: > > Hi all. > > > > When I specify the number of partitions and save this RDD in parquet > format, > > my app fail. For example > > > > selectTest.coalesce(28).saveAsParquetFile("hdfs

Parquet and repartition

2015-03-16 Thread Masf
Hi all. When I specify the number of partitions and save this RDD in parquet format, my app fail. For example selectTest.coalesce(28).saveAsParquetFile("hdfs://vm-clusterOutput") However, it works well if I store data in text selectTest.coalesce(28).saveAsTextFile("hdfs://vm-clusterOutput") M

Spark SQL. Cast to Bigint

2015-03-13 Thread Masf
Hi. I have a query in Spark SQL and I can not covert a value to BIGINT: CAST(column AS BIGINT) or CAST(0 AS BIGINT) The output is: Exception in thread "main" java.lang.RuntimeException: [34.62] failure: ``DECIMAL'' expected but identifier BIGINT found Thanks!! Regards. Miguel Ángel

Re: Read parquet folders recursively

2015-03-12 Thread Masf
read recursively. > > > You could give it a try > https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz > > Thanks > Best Regards > > On Wed, Mar 11, 2015 at 9:45 PM, Masf wrote: > >> Hi all >> >> Is it possible to read recursively folders to read parquet files? >> >> >> Thanks. >> >> -- >> >> >> Saludos. >> Miguel Ángel >> > > -- Saludos. Miguel Ángel

Read parquet folders recursively

2015-03-11 Thread Masf
Hi all Is it possible to read recursively folders to read parquet files? Thanks. -- Saludos. Miguel Ángel