Re: Reading a TSV file

Mich Talebzadeh Sat, 10 Sep 2016 05:58:34 -0700

Thanks Jacek.

The old stuff with databricks


scala> val df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header",
"true").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
df: org.apache.spark.sql.DataFrame = [Transaction Date: string, Transaction
Type: string ... 7 more fields]

Now I can do

scala> val df2 = spark.read.option("header",
true).csv("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
df2: org.apache.spark.sql.DataFrame = [Transaction Date: string,
Transaction Type: string ... 7 more fields]

About Schema stuff that apparently Spark works out itself

scala> df.printSchema
root
 |-- Transaction Date: string (nullable = true)
 |-- Transaction Type: string (nullable = true)
 |-- Sort Code: string (nullable = true)
 |-- Account Number: integer (nullable = true)
 |-- Transaction Description: string (nullable = true)
 |-- Debit Amount: double (nullable = true)
 |-- Credit Amount: double (nullable = true)
 |-- Balance: double (nullable = true)
 |-- _c8: string (nullable = true)

scala> df2.printSchema
root
 |-- Transaction Date: string (nullable = true)
 |-- Transaction Type: string (nullable = true)
 |-- Sort Code: string (nullable = true)
 |-- Account Number: string (nullable = true)
 |-- Transaction Description: string (nullable = true)
 |-- Debit Amount: string (nullable = true)
 |-- Credit Amount: string (nullable = true)
 |-- Balance: string (nullable = true)
 |-- _c8: string (nullable = true)

Cheers











Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 September 2016 at 13:12, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Mich,
>
> CSV is now one of the 7 formats supported by SQL in 2.0. No need to
> use "com.databricks.spark.csv" and --packages. A mere format("csv") or
> csv(path: String) would do it. The options are same.
>
> p.s. Yup, when I read TSV I thought about time series data that I
> believe got its own file format and support @ spark-packages.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Sep 10, 2016 at 8:00 AM, Mich Talebzadeh
> <mich.talebza...@gmail.com> wrote:
> > I gather the title should say CSV as opposed to tsv?
> >
> > Also when the term spark-csv is used is it a reference to databricks
> stuff?
> >
> > val df = spark.read.format("com.databricks.spark.csv").option(
> "inferSchema",
> > "true").option("header", "true").load......
> >
> > or it is something new in 2 like spark-sql etc?
> >
> > Thanks
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
> > On 10 September 2016 at 12:37, Jacek Laskowski <ja...@japila.pl> wrote:
> >>
> >> Hi,
> >>
> >> If Spark 2.0 supports a format, use it. For CSV it's csv() or
> >> format("csv"). It should be supported by Scala and Java. If the API's
> >> broken for Java (but works for Scala), you'd have to create a "bridge"
> >> yourself or report an issue in Spark's JIRA @
> >> https://issues.apache.org/jira/browse/SPARK.
> >>
> >> Have you run into any issues with CSV and Java? Share the code.
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> ----
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Sat, Sep 10, 2016 at 7:30 AM, Muhammad Asif Abbasi
> >> <asif.abb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I would like to know what is the most efficient way of reading tsv in
> >> > Scala,
> >> > Python and Java with Spark 2.0.
> >> >
> >> > I believe with Spark 2.0 CSV is a native source based on Spark-csv
> >> > module,
> >> > and we can potentially read a "tsv" file by specifying
> >> >
> >> > 1. Option ("delimiter","\t") in Scala
> >> > 2. sep declaration in Python.
> >> >
> >> > However I am unsure what is the best way to achieve this in Java.
> >> > Furthermore, are the above most optimum ways to read a tsv file?
> >> >
> >> > Appreciate a response on this.
> >> >
> >> > Regards.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >
>

Re: Reading a TSV file

Reply via email to