Re: Problem while loading saved data

Amila De Silva Fri, 04 Sep 2015 15:31:29 -0700

Hi Ewan,

To start up the cluster I simply ran ./sbin/start-master.sh from master
node and ./sbin/start-slave.sh <master-spark-URL> from the slave. I didn't
configure hdfs explicitly.
Is there something additional that has to be done?



On Fri, Sep 4, 2015 at 12:42 AM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> From that, I'd guesd that HDFS isn't setup between the nodes, or for some
> reason writes are defaulting to file:///path/ rather than hdfs:///path/
>
>
>
>
> ------ Original message------
>
> *From: *Amila De Silva
>
> *Date: *Thu, 3 Sep 2015 17:12
>
> *To: *Ewan Leith;
>
> *Cc: *user@spark.apache.org;
>
> *Subject:*Re: Problem while loading saved data
>
>
> Hi Ewan,
>
> Yes, 'people.parquet' is from the first attempt and in that attempt it
> tried to save the same people.json.
>
> It seems that the same folder is created on both the nodes and contents of
> the files are distributed between the two servers.
>
> On the master node(this is the same node which runs IPython Notebook) this
> is what I have:
>
> people.parquet
> └── _SUCCESS
>
> On the slave I get,
> people.parquet
> └── _temporary
>     └── 0
>         ├── task_201509030057_4699_m_000000
>         │   └──
> part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
>         ├── task_201509030057_4699_m_000001
>         │   └──
> part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
>         └── _temporary
>
> I have zipped and attached both the folders.
>
> On Thu, Sep 3, 2015 at 5:58 PM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
>> Your error log shows you attempting to read from 'people.parquet2' not
>> ‘people.parquet’ as you’ve put below, is that just from a different attempt?
>>
>>
>>
>> Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and
>> _metadata files under people.parquet that you’ve listed below, which would
>> normally be created when the write completes, can you show us your write
>> output?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Ewan
>>
>>
>>
>>
>>
>>
>>
>> *From:* Amila De Silva [mailto:jaa...@gmail.com]
>> *Sent:* 03 September 2015 05:44
>> *To:* Guru Medasani <gdm...@gmail.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Problem while loading saved data
>>
>>
>>
>> Hi Guru,
>>
>>
>>
>> Thanks for the reply.
>>
>>
>>
>> Yes, I checked if the file exists. But instead of a single file what I
>> found was a directory having the following structure.
>>
>>
>>
>> people.parquet
>>
>> └── _temporary
>>
>>     └── 0
>>
>>         ├── task_201509030057_4699_m_000000
>>
>>         │   └──
>> part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
>>
>>         ├── task_201509030057_4699_m_000001
>>
>>         │   └──
>> part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
>>
>>         └── _temporary
>>
>>
>>
>>
>>
>> On Thu, Sep 3, 2015 at 7:13 AM, Guru Medasani <gdm...@gmail.com> wrote:
>>
>> Hi Amila,
>>
>>
>>
>> Error says that the ‘people.parquet’ file does not exist. Can you
>> manually check to see if that file exists?
>>
>>
>>
>> Py4JJavaError: An error occurred while calling o53840.parquet.
>>
>> : java.lang.AssertionError: assertion failed: No schema defined, and no 
>> Parquet data file or summary file found under 
>> file:/home/ubuntu/ipython/people.parquet2.
>>
>>
>>
>>
>>
>> Guru Medasani
>>
>> gdm...@gmail.com
>>
>>
>>
>>
>>
>>
>>
>> On Sep 2, 2015, at 8:25 PM, Amila De Silva <jaa...@gmail.com> wrote:
>>
>>
>>
>> Hi All,
>>
>>
>>
>> I have a two node spark cluster, to which I'm connecting using IPython
>> notebook.
>>
>> To see how data saving/loading works, I simply created a dataframe using
>> people.json using the Code below;
>>
>>
>>
>> df = sqlContext.read.json("examples/src/main/resources/people.json")
>>
>>
>>
>> Then called the following to save the dataframe as a parquet.
>>
>> df.write.save("people.parquet")
>>
>>
>>
>> Tried loading the saved dataframe using;
>>
>> df2 = sqlContext.read.parquet('people.parquet');
>>
>>
>>
>> But this simply fails giving the following exception
>>
>>
>>
>> ---------------------------------------------------------------------------
>>
>> Py4JJavaError                             Traceback (most recent call last)
>>
>> <ipython-input-97-35f91873c48f> in <module>()
>>
>> ----> 1 df2 = sqlContext.read.parquet('people.parquet2');
>>
>>
>>
>> /srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path)
>>
>>     154         [('name', 'string'), ('year', 'int'), ('month', 'int'), 
>> ('day', 'int')]
>>
>>     155         """
>>
>> --> 156         return 
>> self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path)))
>>
>>     157
>>
>>     158     @since(1.4)
>>
>>
>>
>> /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
>> __call__(self, *args)
>>
>>     536         answer = self.gateway_client.send_command(command)
>>
>>     537         return_value = get_return_value(answer, self.gateway_client,
>>
>> --> 538                 self.target_id, self.name)
>>
>>     539
>>
>>     540         for temp_arg in temp_args:
>>
>>
>>
>> /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
>> get_return_value(answer, gateway_client, target_id, name)
>>
>>     298                 raise Py4JJavaError(
>>
>>     299                     'An error occurred while calling {0}{1}{2}.\n'.
>>
>> --> 300                     format(target_id, '.', name), value)
>>
>>     301             else:
>>
>>     302                 raise Py4JError(
>>
>>
>>
>> Py4JJavaError: An error occurred while calling o53840.parquet.
>>
>> : java.lang.AssertionError: assertion failed: No schema defined, and no 
>> Parquet data file or summary file found under 
>> file:/home/ubuntu/ipython/people.parquet2.
>>
>>        at scala.Predef$.assert(Predef.scala:179)
>>
>>        at 
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429)
>>
>>        at 
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)
>>
>>        at 
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)
>>
>>        at scala.Option.orElse(Option.scala:257)
>>
>>        at 
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
>>
>>        at org.apache.spark.sql.parquet.ParquetRelation2.org 
>> <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126)
>>
>>        at org.apache.spark.sql.parquet.ParquetRelation2.org 
>> <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124)
>>
>>        at 
>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)
>>
>>        at 
>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)
>>
>>        at scala.Option.getOrElse(Option.scala:120)
>>
>>        at 
>> org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165)
>>
>>        at 
>> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506)
>>
>>        at 
>> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505)
>>
>>        at 
>> org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30)
>>
>>        at 
>> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)
>>
>>        at 
>> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)
>>
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>        at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>>        at java.lang.reflect.Method.invoke(Method.java:601)
>>
>>        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>
>>        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>>
>>        at py4j.Gateway.invoke(Gateway.java:259)
>>
>>        at 
>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>
>>        at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>
>>        at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>
>>        at java.lang.Thread.run(Thread.java:722)
>>
>>
>>
>>
>>
>> I'm using spark-1.4.1-bin-hadoop2.6 with java 1.7.
>>
>>
>>
>>  Thanks
>>
>> Amila
>>
>>
>>
>>
>>
>
>

Re: Problem while loading saved data

Reply via email to