Re: Use guix to distribute data & reproducible (data) science

2018-02-18 Thread Ricardo Wurmus
Amirouche Boubekki writes: > Then, in a follow up mail, you reply to Konrad: > >>> Konrad Hinsen skribis: >> >> [...] >> >>> It would be nice if big datasets could conceptually be handled in the >>> same way while being stored elsewhere - a bit like git-annex does for >>> git. And for parallel

Re: Use guix to distribute data & reproducible (data) science

2018-02-18 Thread Ludovic Courtès
Hi Amirouche, Amirouche Boubekki skribis: > On 2018-02-09 18:13, ludovic.cour...@inria.fr wrote: >> Hi! >> >> Amirouche Boubekki skribis: >> >>> tl;dr: Distribution of data and software seems similar. >>>Data is more and more important in software and reproducible >>>science. Da

Re: Use guix to distribute data & reproducible (data) science

2018-02-17 Thread Roel Janssen
Amirouche Boubekki writes: > Hello again Ludovic, > > On 2018-02-09 18:13, ludovic.cour...@inria.fr wrote: >> Hi! >> >> Amirouche Boubekki skribis: >> >>> tl;dr: Distribution of data and software seems similar. >>>Data is more and more important in software and reproducible >>>

Re: Use guix to distribute data & reproducible (data) science

2018-02-16 Thread Amirouche Boubekki
Hello again Ludovic, On 2018-02-09 18:13, ludovic.cour...@inria.fr wrote: Hi! Amirouche Boubekki skribis: tl;dr: Distribution of data and software seems similar. Data is more and more important in software and reproducible science. Data science ecosystem lakes resources sharing

Re: Use guix to distribute data & reproducible (data) science

2018-02-16 Thread Konrad Hinsen
Hi George, myg...@gmail.com writes: >> The three missing pieces are: >> >> - Dealing with measurements, which might involve interacting with >>experimental equipment or databases. Moreover, since data from >>such sources can change, its hash in the store must be computed >>from the c

Re: Use guix to distribute data & reproducible (data) science

2018-02-16 Thread myglc2
Hi Konrad, On 02/16/2018 at 10:28 Konrad Hinsen writes: > Whether for software or for data, dependencies are DAGs whose terminal > nodes are measuremnts (for data) or human-supplied information (code, > parameters, methodological choices). Guix handles the latter very well. > > The three missing

Re: Use guix to distribute data & reproducible (data) science

2018-02-16 Thread Amirouche Boubekki
On Thu, Feb 15, 2018 at 6:11 PM zimoun wrote: > Hi, > > Thank you for this food for thought. > > > I agree that the frontier between code and data is arbitary. > > However, I am not sure to get the picture about the data management in > the context of Reproducible Science. What is the issue ? > >

Re: Use guix to distribute data & reproducible (data) science

2018-02-16 Thread Konrad Hinsen
Hi, > In other words, on the paper, what are the benefits of a management of > some piece of data in the store ? For example for the applications of > weights of a trained neural network; or of the positions of the atoms in > protein structure. Provenance tracking. In a complex data processing wo

Re: Use guix to distribute data & reproducible (data) science

2018-02-15 Thread zimoun
Hi, Thank you for this food for thought. I agree that the frontier between code and data is arbitary. However, I am not sure to get the picture about the data management in the context of Reproducible Science. What is the issue ? So, I catch your invitation to explore your idea. :-) Let thin

Re: Use guix to distribute data & reproducible (data) science

2018-02-14 Thread Ludovic Courtès
Hello, Konrad Hinsen skribis: > It would be nice if big datasets could conceptually be handled in the > same way while being stored elsewhere - a bit like git-annex does for > git. And for parallel computing, we could have special build daemons. Exactly. I think we need a git-annex/git-lfs-lik

Re: Use guix to distribute data & reproducible (data) science

2018-02-12 Thread Konrad Hinsen
Hi everyone, zimoun writes: > From my point of view, there is 2 kind of datasets: > a- the ones which are part of the software, e.g., used to pass the > tests. Therefore, they are usually small, not always; > b- the ones which are applied to the software and somehow they are > not in the sourc

Re: Use guix to distribute data & reproducible (data) science

2018-02-10 Thread zimoun
Hi, Thank you for the topic feeding my thoughts. And thank you Ricardo for your explanations. > What I was thinking about, is use guix to distribute data packages just like > we distribute softwares from pypi. The advantage of using guix seems > obvious, > but apparantly it's not desirable or pos

Re: Use guix to distribute data & reproducible (data) science

2018-02-10 Thread Amirouche Boubekki
On Fri, Feb 9, 2018 at 8:16 PM Konrad Hinsen wrote: > Hi, > > On 09/02/2018 18:13, Ludovic Courtès wrote: > > > Amirouche Boubekki skribis: > > > >> tl;dr: Distribution of data and software seems similar. > >> Data is more and more important in software and reproducible > >> scie

Re: Use guix to distribute data & reproducible (data) science

2018-02-09 Thread Ricardo Wurmus
zimoun writes: > I do not know so much, but a idea should to write a workflow: you > fetch the data, you clean them and you check by hashing that the > result is the expected one. Only the softwares used to do that are in > the store. The input and output data are not, but your workflow check >

Re: Use guix to distribute data & reproducible (data) science

2018-02-09 Thread zimoun
Hi, > I'd say it depends on the data and how it is used inside and outside of a > workflow. Some data could very well stored in the store, and then > distributed via standard channels (Zenodo, ...) after export by "guix pack". > For big datasets, some other mechanism is required. I am not sure to

Re: Use guix to distribute data & reproducible (data) science

2018-02-09 Thread Konrad Hinsen
Hi, On 09/02/2018 18:13, Ludovic Courtès wrote: Amirouche Boubekki skribis: tl;dr: Distribution of data and software seems similar. Data is more and more important in software and reproducible science. Data science ecosystem lakes resources sharing. I think guix can h

Re: Use guix to distribute data & reproducible (data) science

2018-02-09 Thread zimoun
Dear, >From my understanding, what you are describing is what bioinfo guys call a workflow: 1- fetch data here and there 2- clean and prepare data 3- compute stuff with these data 4- obtain an answer and loop several times on several data sets. Guix Workflow Language allows to implement the

Re: Use guix to distribute data & reproducible (data) science

2018-02-09 Thread Ludovic Courtès
Hi! Amirouche Boubekki skribis: > tl;dr: Distribution of data and software seems similar. >Data is more and more important in software and reproducible >science. Data science ecosystem lakes resources sharing. >I think guix can help. I think some of us especially Guix-HP