I agree you’re inviting consistency issues if you maintained a separate note 
id-to-note name mapping file.

But I’m still not comfortable with note ids in the name of the notebook itself. 
 Those names would look ugly if you shared your notebooks on github for 
example.  You don’t see Jupyter notebooks with names like that.  If you have to 
keep the note ids with the notebooks could you not simply put the note id at 
the top of the notebook as Ruslan suggested? Then you’d only have to read the 
first line of each notebook.

Presumably if you copied the notebooks to another Zeppelin server they would be 
restored with the same note ids there too? And hopefully there would be no id 
clash with notebooks already on that server…

From: Jeff Zhang <zjf...@gmail.com>
Sent: 14 August 2018 03:49
To: users@zeppelin.apache.org
Subject: EXT: Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of 
[NOTEID]/note.json


Thanks for the discussion.
>>> I'm afraid about non-latin symbols in folder and note name. And what about 
>>> hieroglyphs?
AFAIK, linux allow all the characters to be file name except `\0` and '/'.  I 
can create file name with Chinese character in linux, I guess you can use 
Russian as well.

>>> If I understand correctly, this is being done solely to speed up loading 
>>> list of notebooks? What if a list of notebook names, their ids, folder 
>>> structure, etc can be *cached* in a separate small json file? Or perhaps in 
>>> a small embedded key-value store, like www.mapdb.org<http://www.mapdb.org/> 
>>> would do? Just thinking out loud. This would require a way to lazily 
>>> re-sync the cache.

This not only to speed up the loading but also make the system architecture 
easy to maintain. Because for now we have to build the folder structure of 
notes in memory, many code in zeppelin is doing this (Personally I don't think 
we need any code for this function if we could get the folder structure from 
the note file storage system). Use another storage to keep the mapping of note 
name and note id will bring another classic problem of distributed system: 
consistency. How do we make sure the consistency between the real note file and 
this mapping component. If we create/rename/remove note, we have to both update 
the notebook repo and the mapping storage. Any bug in code would bring 
inconsistency issue based on my experience.




Ruslan Dautkhanov 
<dautkha...@gmail.com<mailto:dautkha...@gmail.com>>于2018年8月14日周二 上午3:58写道:
Thanks for bringing this up for discussion. My 2 cents below.

I am with Maksim and Felix on concerns with special characters now allowed in 
notebook names, and also concerns with different charsets. Russian language, 
for example, most commonly use iso-8859-5, koi-8r/u, windows-1251 charsets etc. 
This seems like will bring whole new set of localization issues.

If I understand correctly, this is being done solely to speed up loading list 
of notebooks? What if a list of notebook names, their ids, folder structure, 
etc can be *cached* in a separate small json file? Or perhaps in a small 
embedded key-value store, like www.mapdb.org<http://www.mapdb.org> would do? 
Just thinking out loud. This would require a way to lazily re-sync the cache.

Another way to speed up json reads is to somehow force "name" attribute to be 
at the top of the json document that's written to disk. Then re-implement json 
files reader to read just header of the file and do a partial json parse ( or 
in the lack of options, grab "name" attribute from the json file header by a 
regex for example).

Back to filenames and charsets, I think issue may be more complicated, if you 
store notebooks on a remote filesystem (nfs/ samba etc), and what if remote 
server and local nfs client have differences in default fs charsets?

Ideally would be if all filesystems would use UTF-8 for example, but I am not 
certain that's a good assumption to make. Also exposing notebook names can 
bring some other issues, like I know some users occasionally add 
trailing/leading spaces etc.


On Mon, Aug 13, 2018 at 10:38 AM Belousov Maksim Eduardovich 
<m.belou...@tinkoff.ru<mailto:m.belou...@tinkoff.ru>> wrote:
The use of Russian and other specific letters in the note name is big advantage 
of Zeppelin. I would not like to give up this functionality.

I support the idea about `zpln` file extension.
The folder structure also sounds good.

I'm afraid about non-latin symbols in folder and note name. And what about 
hieroglyphs?

Apache Zeppelin may be the first to use Russian letters in file system in our 
company.
I see a lot of risks to use non-latin symbols and a lot of issues to make new 
folder structure stable.





________________________________
От: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
Отправлено: 13 августа 2018 г. 12:50
Кому: users@zeppelin.apache.org<mailto:users@zeppelin.apache.org>
Тема: Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of 
[NOTEID]/note.json

>>> Do we need the note id in the file name at all? What’s wrong with just 
>>> note_name.zpln?
The reason I keep note id is because currently we use noteId to identify one 
note. e.g. we use note id in both websocket api and rest api. It is almost 
impossible to remove noteId for the current architecture. If we put note id 
into file content of note_name.zpln, then we have to read the note file every 
time, then we meet the issues I mentioned above again.

>>> If the file content is json then why not use note_name.json instead of 
>>> .zpln? That would make it easier for editors to know how to load/highlight 
>>> the file contents.
I am not strongly biased on *.zpln. But I think one purpose is to help third 
parties to identify zeppelin note properly. e.g. github can identify jupyter 
notebook (*.ipynb) and render it properly.

>>> Is there any reason for not using real folders or directories for 
>>> organising the notebooks rather than embedding the folder hierarchy in the 
>>> names of the notebooks?  If someone wants to ‘move’ the notebooks to 
>>> another folder they’d have to manually rename all the files/notebooks at 
>>> present.  That’s not very user-friendly.

Actually my proposal is to use real folders. What user see in zeppelin note 
menu is the actual notes folder structure. If they want to move the notebooks 
to another folder, they can change the folder name just like what user did in 
file system.





Partridge, Lucas (GE Aviation) 
<lucas.partri...@ge.com<mailto:lucas.partri...@ge.com>>于2018年8月13日周一 下午4:43写道:
Hi Jeff,
I have some questions about this proposal (I can’t edit the design doc):


  1.  Do we need the note id in the file name at all? What’s wrong with just 
note_name.zpln?
  2.  If the file content is json then why not use note_name.json instead of 
.zpln? That would make it easier for editors to know how to load/highlight the 
file contents.
  3.  Is there any reason for not using real folders or directories for 
organising the notebooks rather than embedding the folder hierarchy in the 
names of the notebooks?  If someone wants to ‘move’ the notebooks to another 
folder they’d have to manually rename all the files/notebooks at present.  
That’s not very user-friendly.

Thanks, Lucas.
From: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
Sent: 13 August 2018 09:06
To: users@zeppelin.apache.org<mailto:users@zeppelin.apache.org>
Cc: dev <d...@zeppelin.apache.org<mailto:d...@zeppelin.apache.org>>
Subject: EXT: Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of 
[NOTEID]/note.json

In that case, zeppelin should fail to create note.

Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>于2018年8月13日周一 
下午3:47写道:
Perhaps one concern is users having characters in note name that are invalid 
for file name/file path?


________________________________
From: Mohit Jaggi <mohitja...@gmail.com<mailto:mohitja...@gmail.com>>
Sent: Sunday, August 12, 2018 6:02 PM
To: users@zeppelin.apache.org<mailto:users@zeppelin.apache.org>
Cc: dev
Subject: Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of 
[NOTEID]/note.json

sounds like a good idea!

On Sun, Aug 12, 2018 at 5:34 PM Jeff Zhang 
<zjf...@gmail.com<mailto:zjf...@gmail.com>> wrote:
Motivation

   The motivation of ZEPPELIN-2619 is to change the notes storage structure. 
Previously we store it using {noteId}/note.json, we’d like to change it into 
{note_name}_{note_id}.zpln. There are several reasons for this change.


  1.  {noteId}/note.json is not scalable. We put all notes in one root folder 
in flat structure. And when zeppelin server starts, we need to read all 
note.json to get the note file name and build the note folder structure 
(Because we need to get the note name which is stored in note.json to build the 
notebook menu). This would be a nightmare when you have large amounts of notes.
  2.  {noteId}/note.json is not maintainable. It is difficult for a 
developer/administrator to find note file based on note name.
  3.  {noteId}/note.json has no folder structure. Currently zeppelin have to 
build the folder structure internally in memory according note name which is a 
big overhead.

New Approach

   As I mentioned above, I propose to change the note storage structure to 
{note_name}_{note_id}.zpln.  note_name could contains folders, e.g. 
folder_1/mynote_abcd.zpln

This kind of note storage structure could bring several benefits.

  1.  We don’t need to load all notes when zeppelin starts. We just need to 
list each folder to get the note name and note_id.
  2.  It is much maintainable so that it is easy to find the note file based on 
note name.
  3.  It has the folder structure already. That can be mapped to the note 
folder structure.

Side Effect

This approach only works for file system storage, so that means we have to drop 
support for MongoNotebookRepo. I think it is ok because I didn’t see any users 
talk about this in community, so I assume no one is using it.



This is overall design, welcome any comments and feedback. Thanks.



Here's the google docs, you can also comment it here.

https://docs.google.com/document/d/126egAQmhQOL4ynxJ3AQJQRBBLdW8TATYcGkDL1DNZoE/edit?usp=sharing



Reply via email to