[jira] [Created] (ZEPPELIN-5222) Zeppelin hangs on simple query with datasets

Pasha Finkeshteyn (Jira) Mon, 25 Jan 2021 01:31:15 -0800

Pasha Finkeshteyn created ZEPPELIN-5222:
-------------------------------------------


             Summary: Zeppelin hangs on simple query with datasets
                 Key: ZEPPELIN-5222
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-5222
             Project: Zeppelin
          Issue Type: Bug
          Components: spark
    Affects Versions: 0.9.0, 0.8.2
         Environment: OS: Linux (tried Manjaro and Ubuntu)
Zeppelin version: 0.9 release and 0.8.2
Java version: 8 and 11
            Reporter: Pasha Finkeshteyn


Query


{code:scala}
case class Movie(movieId: Long, title: String, genres: String)

case class MovieWithGenresAndYear(movieId: Long, title: String, genres: 
List[String], year: Integer)
case class MovieExploded(movieId: Long, title: String, genres: List[String])

case class MovieAggregate(year: Int, count: Long)

import spark.implicits._

val df = spark
        .read
        .option("header", true)
        .option("inferSchema", true)
        .option("mode", "DROPMALFORMED")
        .csv("/home/finkel/Downloads/ml-latest/movies.csv")
        .as[Movie]
        .map(it => MovieExploded(it.movieId, it.title, 
it.genres.split('|').map(_.trim).toList))
        .map {
            case MovieExploded(movieId, title, genres) =>
                if (!title.matches("\"?.*\\(\\d{4}\\)\\s*\"?")) 
MovieWithGenresAndYear(movieId, title, genres, null)
                else {
                    val lastOpen = title.lastIndexOf('(')
                    val year = title.substring(lastOpen + 1).replace(")", 
"").replace("\"", "").trim.toInt
                    MovieWithGenresAndYear(movieId, title.substring(0, 
lastOpen), genres, year)
                }
        }
        .filter(_.year != null)
        .groupByKey(_.year)
        .mapGroups((k, v) =>
            (k, v.size)
        )
        .show(300, false)
{code}


It hangs forever with a simple data


{code:csv}
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
{code}

The very same query works momentarily in Spark Shell.

Can't reproduce on Mac



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ZEPPELIN-5222) Zeppelin hangs on simple query with datasets

Reply via email to