Thank you both - issue has just been opened.

Merci Hervé for pointing out the direct use of the `List()` constructor.

Laurent

________________________________________
From: Michael Lawrence <lawrence.mich...@gene.com>
Sent: 21 October 2020 19:13
To: Pages, Herve
Cc: Laurent Gatto; bioc-devel@r-project.org
Subject: Re: [Bioc-devel] merging DFrames

Laurent,

Thanks for bringing this up and offering to help. Yes, please raise an issue. 
There's an opportunity to implement faster matching than base::merge(), using 
stuff like matchIntegerQuads(), findMatches(), and grouping().

grouping() can be really fast for character vectors, since it takes advantage 
of string internalization. For example, let's say you're merging on three 
character vector keys. Concatenate the keys of 'y' onto they keys of 'x'. Then 
call grouping(k1, k2, k3) and you effectively have a matching. Should be way 
faster than the paste() approach used by base::merge(). Would be interesting to 
see.

Michael

On Wed, Oct 21, 2020 at 9:37 AM Pages, Herve 
<hpa...@fredhutch.org<mailto:hpa...@fredhutch.org>> wrote:
Hi Laurent,

I think the current implementation was just an expedient to have
something that works (in most cases). I don't know if a proper
implementation that doesn't go thru data.frame is on the TODO list. Michael?

I suggest you open an issue on GitHub under S4Vectors.

Cheers,
H.

PS: Note that you can pass the list elements directly to the List()
constructor, no need to construct an ordinary list first:

   List(1, 1:2, 1:3)  # same as List(list(1, 1:2, 1:3)))


On 10/21/20 08:35, Laurent Gatto wrote:
> When merging DFrame instances, the *List types are lost:
>
> The following two instances have NumericList columns (y and z)
> d1 <- DataFrame(x = letters[1:3], y = List(list(1, 1:2, 1:3)))
> d2 <- DataFrame(x = letters[1:3], z = List(list(1:3, 1:2, 1)))
>
> d1
> ## DataFrame with 3 rows and 2 columns
> ##             x             y
> ##   <character> <NumericList>
> ## 1           a             1
> ## 2           b           1,2
> ## 3           c         1,2,3
>
> That are however converted to list when merged
>
> merge(d1, d2, by = "x")
> ## DataFrame with 3 rows and 3 columns
> ##             x      y      z
> ##   <character> <list> <list>
> ## 1           a      1  1,2,3
> ## 2           b    1,2    1,2
> ## 3           c  1,2,3      1
>
> Looking at merge,DataTable,DataTable (form with merge,DFrame,DFrame 
> inherits), this makes sense given that they are converted to data.frames, 
> merged with merge,data.frame,data.frame and the results is coerced back to 
> DFrame:
>
>> getMethod("merge", c("DataTable", "DataTable"))
> Method Definition:
>
> function (x, y, ...)
> {
>      .local <- function (x, y, by, ...)
>      {
>          if (is(by, "Hits")) {
>              return(.mergeByHits(x, y, by, ...))
>          }
>          as(merge(as(x, "data.frame"), as(y, "data.frame"), by,
>              ...), class(x))
>      }
>      .local(x, y, ...)
> }
> <bytecode: 0x556dd0032ca8>
> <environment: namespace:S4Vectors>
>
> Signatures:
>          x           y
> target  "DataTable" "DataTable"
> defined "DataTable" "DataTable"
>
> I would like not to loose the *List classes in the individual DFrames.
>
> Am I missing something? Is this something that is on the todo list, or that I 
> could help with?
>
> Best wishes,
>
> Laurent
>
>
> _______________________________________________
> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=TUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg&s=uqmel2bDfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U&e=<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA%26m%3DTUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg%26s%3Duqmel2bDfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U%26e%3D&data=04%7C01%7Claurent.gatto%40uclouvain.be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C637388972091221595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NH8unxkgycej2AJIyCJxrE6J8OJVFKrciV48ra3vxJs%3D&reserved=0>
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org<mailto:hpa...@fredhutch.org>
Phone:  (206) 667-5791
Fax:    (206) 667-1319
_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=04%7C01%7Claurent.gatto%40uclouvain.be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C637388972091231547%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K5CFh04oSsBCszhNqzazM76%2BU1We8HtvlXjIftHT41g%3D&reserved=0>


--
Michael Lawrence
Senior Scientist, Data Science and Statistical Computing
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
micha...@gene.com<mailto:micha...@gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to