Hi Ewan, To start up the cluster I simply ran ./sbin/start-master.sh from master node and ./sbin/start-slave.sh <master-spark-URL> from the slave. I didn't configure hdfs explicitly. Is there something additional that has to be done?
On Fri, Sep 4, 2015 at 12:42 AM, Ewan Leith <ewan.le...@realitymine.com> wrote: > From that, I'd guesd that HDFS isn't setup between the nodes, or for some > reason writes are defaulting to file:///path/ rather than hdfs:///path/ > > > > > ------ Original message------ > > *From: *Amila De Silva > > *Date: *Thu, 3 Sep 2015 17:12 > > *To: *Ewan Leith; > > *Cc: *user@spark.apache.org; > > *Subject:*Re: Problem while loading saved data > > > Hi Ewan, > > Yes, 'people.parquet' is from the first attempt and in that attempt it > tried to save the same people.json. > > It seems that the same folder is created on both the nodes and contents of > the files are distributed between the two servers. > > On the master node(this is the same node which runs IPython Notebook) this > is what I have: > > people.parquet > └── _SUCCESS > > On the slave I get, > people.parquet > └── _temporary > └── 0 > ├── task_201509030057_4699_m_000000 > │ └── > part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet > ├── task_201509030057_4699_m_000001 > │ └── > part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet > └── _temporary > > I have zipped and attached both the folders. > > On Thu, Sep 3, 2015 at 5:58 PM, Ewan Leith <ewan.le...@realitymine.com> > wrote: > >> Your error log shows you attempting to read from 'people.parquet2' not >> ‘people.parquet’ as you’ve put below, is that just from a different attempt? >> >> >> >> Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and >> _metadata files under people.parquet that you’ve listed below, which would >> normally be created when the write completes, can you show us your write >> output? >> >> >> >> >> >> Thanks, >> >> Ewan >> >> >> >> >> >> >> >> *From:* Amila De Silva [mailto:jaa...@gmail.com] >> *Sent:* 03 September 2015 05:44 >> *To:* Guru Medasani <gdm...@gmail.com> >> *Cc:* user@spark.apache.org >> *Subject:* Re: Problem while loading saved data >> >> >> >> Hi Guru, >> >> >> >> Thanks for the reply. >> >> >> >> Yes, I checked if the file exists. But instead of a single file what I >> found was a directory having the following structure. >> >> >> >> people.parquet >> >> └── _temporary >> >> └── 0 >> >> ├── task_201509030057_4699_m_000000 >> >> │ └── >> part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet >> >> ├── task_201509030057_4699_m_000001 >> >> │ └── >> part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet >> >> └── _temporary >> >> >> >> >> >> On Thu, Sep 3, 2015 at 7:13 AM, Guru Medasani <gdm...@gmail.com> wrote: >> >> Hi Amila, >> >> >> >> Error says that the ‘people.parquet’ file does not exist. Can you >> manually check to see if that file exists? >> >> >> >> Py4JJavaError: An error occurred while calling o53840.parquet. >> >> : java.lang.AssertionError: assertion failed: No schema defined, and no >> Parquet data file or summary file found under >> file:/home/ubuntu/ipython/people.parquet2. >> >> >> >> >> >> Guru Medasani >> >> gdm...@gmail.com >> >> >> >> >> >> >> >> On Sep 2, 2015, at 8:25 PM, Amila De Silva <jaa...@gmail.com> wrote: >> >> >> >> Hi All, >> >> >> >> I have a two node spark cluster, to which I'm connecting using IPython >> notebook. >> >> To see how data saving/loading works, I simply created a dataframe using >> people.json using the Code below; >> >> >> >> df = sqlContext.read.json("examples/src/main/resources/people.json") >> >> >> >> Then called the following to save the dataframe as a parquet. >> >> df.write.save("people.parquet") >> >> >> >> Tried loading the saved dataframe using; >> >> df2 = sqlContext.read.parquet('people.parquet'); >> >> >> >> But this simply fails giving the following exception >> >> >> >> --------------------------------------------------------------------------- >> >> Py4JJavaError Traceback (most recent call last) >> >> <ipython-input-97-35f91873c48f> in <module>() >> >> ----> 1 df2 = sqlContext.read.parquet('people.parquet2'); >> >> >> >> /srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path) >> >> 154 [('name', 'string'), ('year', 'int'), ('month', 'int'), >> ('day', 'int')] >> >> 155 """ >> >> --> 156 return >> self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path))) >> >> 157 >> >> 158 @since(1.4) >> >> >> >> /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> __call__(self, *args) >> >> 536 answer = self.gateway_client.send_command(command) >> >> 537 return_value = get_return_value(answer, self.gateway_client, >> >> --> 538 self.target_id, self.name) >> >> 539 >> >> 540 for temp_arg in temp_args: >> >> >> >> /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in >> get_return_value(answer, gateway_client, target_id, name) >> >> 298 raise Py4JJavaError( >> >> 299 'An error occurred while calling {0}{1}{2}.\n'. >> >> --> 300 format(target_id, '.', name), value) >> >> 301 else: >> >> 302 raise Py4JError( >> >> >> >> Py4JJavaError: An error occurred while calling o53840.parquet. >> >> : java.lang.AssertionError: assertion failed: No schema defined, and no >> Parquet data file or summary file found under >> file:/home/ubuntu/ipython/people.parquet2. >> >> at scala.Predef$.assert(Predef.scala:179) >> >> at >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429) >> >> at >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369) >> >> at >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369) >> >> at scala.Option.orElse(Option.scala:257) >> >> at >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) >> >> at org.apache.spark.sql.parquet.ParquetRelation2.org >> <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126) >> >> at org.apache.spark.sql.parquet.ParquetRelation2.org >> <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124) >> >> at >> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165) >> >> at >> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165) >> >> at scala.Option.getOrElse(Option.scala:120) >> >> at >> org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165) >> >> at >> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506) >> >> at >> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505) >> >> at >> org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30) >> >> at >> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438) >> >> at >> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264) >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> at java.lang.reflect.Method.invoke(Method.java:601) >> >> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) >> >> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) >> >> at py4j.Gateway.invoke(Gateway.java:259) >> >> at >> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) >> >> at py4j.commands.CallCommand.execute(CallCommand.java:79) >> >> at py4j.GatewayConnection.run(GatewayConnection.java:207) >> >> at java.lang.Thread.run(Thread.java:722) >> >> >> >> >> >> I'm using spark-1.4.1-bin-hadoop2.6 with java 1.7. >> >> >> >> Thanks >> >> Amila >> >> >> >> >> > >