On Sep 26, 2017, at 4:41 PM, Thomale, Jason <[email protected]> wrote:
>>>> Does anybody here know how to access a Python compressed sparse row format >>>> (CSR) object? [1] >>>> >>>> [1] CSR - http://bit.ly/2fPj42V >>> >>> Do you have a link to the code you're using? >> >> Yes, thank you. See —> >> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py —ELM > > I'm not familiar with the APIs in question, but--if I'm looking at this > right, your CSR matrix (tfidf) looks like it would have columns corresponding > with topics and rows corresponding with documents. If that's the case, you > could maybe do something like this: > > 1. Use tfidf.getcol() to get the column corresponding > to your chosen topic. Looks like that should give you a > 1-dimensional matrix of all document scores for that > topic. > > 2. Cast that to an array of scores using .toarray(), > and then a list with .tolist(). (I think?) > > 3. Use a list comprehension and "enumerate" to generate > explicit doc IDs based on each document's position in > the list, creating a list of 2-element lists or tuples, > (doc_id, score). While you're at it, you could filter > the list comprehension to give you only the documents > with scores that are greater than 0, or some other > threshold. > > 4. Pass the results through the built-in "sorted" > function to sort your list of tuples based on score. > > >>> topic = 9497 > >>> score_thresh = 0 > >>> topic_scores = tfidf.getcol(topic).toarray().tolist() > >>> docs_and_scores = [(score[0], score[1]) for score in > >>> enumerate(topic_scores) if item[1] > score_thresh] > >>> most_relevant_docs = sorted(docs_and_scores, key=lambda x: x[1]) > > The resulting "most_relevant_docs" variable should be a list of tuples that > looks something like this (for example): > [(102, 0.9), (33, 0.875), (365, 0.874), ...] > > Not sure if that's helpful...? There's probably a more numpy/scipy way of > doing the above using actual numpy array methods (especially the 4th line). Jason, this is REALLY close, and I have begun to include it at the very end of my code. Thank you! ‘More later. code4lib++ —Eric Morgan
