Pasha Finkeshteyn created ZEPPELIN-5222:
-------------------------------------------
Summary: Zeppelin hangs on simple query with datasets
Key: ZEPPELIN-5222
URL: https://issues.apache.org/jira/browse/ZEPPELIN-5222
Project: Zeppelin
Issue Type: Bug
Components: spark
Affects Versions: 0.9.0, 0.8.2
Environment: OS: Linux (tried Manjaro and Ubuntu)
Zeppelin version: 0.9 release and 0.8.2
Java version: 8 and 11
Reporter: Pasha Finkeshteyn
Query
{code:scala}
case class Movie(movieId: Long, title: String, genres: String)
case class MovieWithGenresAndYear(movieId: Long, title: String, genres:
List[String], year: Integer)
case class MovieExploded(movieId: Long, title: String, genres: List[String])
case class MovieAggregate(year: Int, count: Long)
import spark.implicits._
val df = spark
.read
.option("header", true)
.option("inferSchema", true)
.option("mode", "DROPMALFORMED")
.csv("/home/finkel/Downloads/ml-latest/movies.csv")
.as[Movie]
.map(it => MovieExploded(it.movieId, it.title,
it.genres.split('|').map(_.trim).toList))
.map {
case MovieExploded(movieId, title, genres) =>
if (!title.matches("\"?.*\\(\\d{4}\\)\\s*\"?"))
MovieWithGenresAndYear(movieId, title, genres, null)
else {
val lastOpen = title.lastIndexOf('(')
val year = title.substring(lastOpen + 1).replace(")",
"").replace("\"", "").trim.toInt
MovieWithGenresAndYear(movieId, title.substring(0,
lastOpen), genres, year)
}
}
.filter(_.year != null)
.groupByKey(_.year)
.mapGroups((k, v) =>
(k, v.size)
)
.show(300, false)
{code}
It hangs forever with a simple data
{code:csv}
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
{code}
The very same query works momentarily in Spark Shell.
Can't reproduce on Mac
--
This message was sent by Atlassian Jira
(v8.3.4#803005)