Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Han-Cheol Cho
at 3:09 PM, Takeshi Yamamuro wrote: > Hi, > > Since the csv source currently supports ascii-compatible charset, so I > guess shift-jis also works well. > You could check Hyukjin's comment in https://issues.apache.org/ > jira/browse/SPARK-21289 for more info. > > > On W

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-15 Thread Han-Cheol Cho
でやってみよう| +--+ spark.read.option("encoding", "sjis").option("multiLine", true).csv("b.txt").show(1) +--+ | _c0| +--+ |8月データだけでやってみよう| +------+ ``` I am still digging the root cause and will share it later :-) Best

Reading CSV with multiLine option invalidates encoding option.

2017-08-15 Thread Han-Cheol Cho
&& file.start == 0 UnivocityParser.parseIterator(lines, shouldDropHeader, parser, schema) } It seems like a bug. Is there anyone who had the same problem before? Best wishes, Han-Cheol -- == Han-Cheol Cho, Ph.D. Data scientist, Data Science Team, Data Lab

strange usage of tempfile.mkdtemp() in PySpark mllib.recommendation doctest

2017-03-02 Thread Han-Cheol Cho
data in HDFS. After all, the doctest removes only LOCAL temp directory using shutil.rmtree(). Shouldn't we delete the temporary directory in HDFS too? Best wishes, HanCheol Han-Cheol Cho Data Laboratory / Data Scientist 〒160-0022 東京都新宿区新宿6-27-30 新宿イーストサイドスクエア13階 Email hancheol@nhn-techorus.com

A question about inconsistency during dataframe creation with RDD/dict in PySpark

2017-02-01 Thread Han-Cheol Cho
dd.collect() [{'k2': 'v1.2', 'k1': 'v1.1'}, {'k1': 'v2.1'}] spark.createDataFrame(urdd).show() +++ | k1| k2| +++ |v1.1|v1.2| |v2.1|null| +++ urdd.toDF

null values returned by max() over a window function

2016-11-28 Thread Han-Cheol Cho
Tablet 6500 null+--+--+---+---+As you can see, the last column calculates the max value among the current row,left two rows and right two rows partitioned by category row.However, the result for the last two rows in each category partition is nu