Thank you both - issue has just been opened. Merci Hervé for pointing out the direct use of the `List()` constructor.
Laurent ________________________________________ From: Michael Lawrence <lawrence.mich...@gene.com> Sent: 21 October 2020 19:13 To: Pages, Herve Cc: Laurent Gatto; bioc-devel@r-project.org Subject: Re: [Bioc-devel] merging DFrames Laurent, Thanks for bringing this up and offering to help. Yes, please raise an issue. There's an opportunity to implement faster matching than base::merge(), using stuff like matchIntegerQuads(), findMatches(), and grouping(). grouping() can be really fast for character vectors, since it takes advantage of string internalization. For example, let's say you're merging on three character vector keys. Concatenate the keys of 'y' onto they keys of 'x'. Then call grouping(k1, k2, k3) and you effectively have a matching. Should be way faster than the paste() approach used by base::merge(). Would be interesting to see. Michael On Wed, Oct 21, 2020 at 9:37 AM Pages, Herve <hpa...@fredhutch.org<mailto:hpa...@fredhutch.org>> wrote: Hi Laurent, I think the current implementation was just an expedient to have something that works (in most cases). I don't know if a proper implementation that doesn't go thru data.frame is on the TODO list. Michael? I suggest you open an issue on GitHub under S4Vectors. Cheers, H. PS: Note that you can pass the list elements directly to the List() constructor, no need to construct an ordinary list first: List(1, 1:2, 1:3) # same as List(list(1, 1:2, 1:3))) On 10/21/20 08:35, Laurent Gatto wrote: > When merging DFrame instances, the *List types are lost: > > The following two instances have NumericList columns (y and z) > d1 <- DataFrame(x = letters[1:3], y = List(list(1, 1:2, 1:3))) > d2 <- DataFrame(x = letters[1:3], z = List(list(1:3, 1:2, 1))) > > d1 > ## DataFrame with 3 rows and 2 columns > ## x y > ## <character> <NumericList> > ## 1 a 1 > ## 2 b 1,2 > ## 3 c 1,2,3 > > That are however converted to list when merged > > merge(d1, d2, by = "x") > ## DataFrame with 3 rows and 3 columns > ## x y z > ## <character> <list> <list> > ## 1 a 1 1,2,3 > ## 2 b 1,2 1,2 > ## 3 c 1,2,3 1 > > Looking at merge,DataTable,DataTable (form with merge,DFrame,DFrame > inherits), this makes sense given that they are converted to data.frames, > merged with merge,data.frame,data.frame and the results is coerced back to > DFrame: > >> getMethod("merge", c("DataTable", "DataTable")) > Method Definition: > > function (x, y, ...) > { > .local <- function (x, y, by, ...) > { > if (is(by, "Hits")) { > return(.mergeByHits(x, y, by, ...)) > } > as(merge(as(x, "data.frame"), as(y, "data.frame"), by, > ...), class(x)) > } > .local(x, y, ...) > } > <bytecode: 0x556dd0032ca8> > <environment: namespace:S4Vectors> > > Signatures: > x y > target "DataTable" "DataTable" > defined "DataTable" "DataTable" > > I would like not to loose the *List classes in the individual DFrames. > > Am I missing something? Is this something that is on the todo list, or that I > could help with? > > Best wishes, > > Laurent > > > _______________________________________________ > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=TUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg&s=uqmel2bDfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U&e=<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA%26m%3DTUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg%26s%3Duqmel2bDfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U%26e%3D&data=04%7C01%7Claurent.gatto%40uclouvain.be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C637388972091221595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NH8unxkgycej2AJIyCJxrE6J8OJVFKrciV48ra3vxJs%3D&reserved=0> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org<mailto:hpa...@fredhutch.org> Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=04%7C01%7Claurent.gatto%40uclouvain.be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C637388972091231547%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K5CFh04oSsBCszhNqzazM76%2BU1We8HtvlXjIftHT41g%3D&reserved=0> -- Michael Lawrence Senior Scientist, Data Science and Statistical Computing Genentech, A Member of the Roche Group Office +1 (650) 225-7760 micha...@gene.com<mailto:micha...@gene.com> Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel