Hi there, I have a number of questions related to the Pandas exercises found from the book, Python for Data Analysis by Wes McKinney. Particularly, these exercises are from Chapter 6 of the book. It'd be much appreciated if you could answer the following questions!
1. [code] Input: pd.read_csv('ch06/ex2.csv', header=None) Output: X.1 X.2 X.3 X.4 X.5 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo [/code] Does the header appear as "X.#" by default when it is set to be None? 2. [code] Input: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000) Input: chunker Output: <pandas.io.parsers.TextParser at 0x8398150> [/code] Please explain the idea of chunksize and the output meaning. 3. [code] The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000) tot = Series([]) for piece in chunker: tot = tot.add(piece['key'].value_counts(), fill_value=0) tot = tot.order(ascending=False) We have then: In [877]: tot[:10] Out[877]: E 368 X 364 L 346 O 343 Q 340 M 338 J 337 F 335 K 334 H 330 [/code] I couldn't run the Series function successfully... is there something missing in this code? 4. [code] Data can also be exported to delimited format. Let's consider one of the CSV files read above: In [878]: data = pd.read_csv('ch06/ex5.csv') In [879]: data Out[879]: something a b c d message 0 one 1 2 3 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11 12 foo Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value: In [883]: data.to_csv(sys.stdout, na_rep='NULL') ,something,a,b,c,d,message 0,one,1,2,3.0,4,NULL 1,two,5,6,NULL,8,world 2,three,9,10,11.0,12,foo [/code] Error occured as I tried to run this code with sys.stdout. 5. [code] class of csv.Dialect: class my_dialect(csv.Dialect): lineterminator = '\n' delimiter = ';' quotechar = '"' reader = csv.reader(f, dialect=my_dialect) [/code] An error occurred when I tried to run this code: "quotechar must be an 1-character integer... please explain. 6. [code] with open('mydata.csv', 'w') as f: writer = csv.writer(f, dialect=my_dialect) writer.writerow(('one', 'two', 'three')) writer.writerow(('1', '2', '3')) writer.writerow(('4', '5', '6')) writer.writerow(('7', '8', '9')) [/code] An error occurred when I ran this code. Please explain the cause of the error. 7. [code] But these are objects representing HTML elements; to get the URL and link text you have to use each element's get method (for the URL) and text_content method (for the display text): In [908]: lnk = links[28] In [909]: lnk Out[909]: <Element a at 0x6c48dd0> In [910]: lnk.get('href') Out[910]: 'http://biz.yahoo.com/special.html' In [911]: lnk.text_content() Out[911]: 'Special Editions' Thus, getting a list of all URLs in the document is a matter of writing this list comprehension: In [912]: urls = [lnk.get('href') for lnk in doc.findall('.//a')] In [913]: urls[-10:] Out[913]: ['http://info.yahoo.com/privacy/us/yahoo/finance/details.html', 'http://info.yahoo.com/relevantads/', 'http://docs.yahoo.com/info/terms/', 'http://docs.yahoo.com/info/copyright/copyright.html', 'http://help.yahoo.com/l/us/yahoo/finance/forms_index.html', 'http://help.yahoo.com/l/us/yahoo/finance/quotes/fitadelay.html', 'http://help.yahoo.com/l/us/yahoo/finance/quotes/fitadelay.html', [/code] An error related to line 912 occurred as I tried to run the code. Please explain. 8. [code] Using lxml.objectify, we parse the file and get a reference to the root node of the XML file with getroot: from lxml import objectify path = 'Performance_MNR.xml' parsed = objectify.parse(open(path)) root = parsed.getroot() [/code] An error occured when I tried to run the code to access the XML file. Please explain. 9. [code] ML data can get much more complicated than this example. Each tag can have metadata, too. Consider an HTML link tag which is also valid XML: from StringIO import StringIO tag = '<a href="http://www.google.com">Google</a>' root = objectify.parse(StringIO(tag)).getroot() You can now access any of the fields (like href) in the tag or the link text: In [930]: root Out[930]: <Element a at 0x88bd4b0> In [931]: root.get('href') Out[931]: 'http://www.google.com' In [932]: root.text Out[932]: 'Google' [/code] The outputs for line 930 and 931 are the same as line 932 (i.e., Google). Please explain 10. [code] One of the easiest ways to store data efficiently in binary format is using Python's builtin pickle serialization. Conveniently, pandas objects all have a save method which writes the data to disk as a pickle: In [933]: frame = pd.read_csv('ch06/ex1.csv') In [934]: frame Out[934]: a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo In [935]: frame.save('ch06/frame_pickle') You read the data back into Python with pandas.load, another pickle convenience function: In [936]: pd.load('ch06/frame_pickle') Out[936]: a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo [/code] I couldn't run this code successfully. Please explain. 11. [code] HDF5 to provide multiple flexible data containers, table indexing, querying capability, and some support for out-of-core computations. pandas has a minimal dict-like HDFStore class, which uses PyTables to store pandas objects: In [937]: store = pd.HDFStore('mydata.h5') In [938]: store['obj1'] = frame In [939]: store['obj1_col'] = frame['a'] In [940]: store Out[940]: <class 'pandas.io.pytables.HDFStore'> File path: mydata.h5 obj1 DataFrame obj1_col Series Objects contained in the HDF5 file can be retrieved in a dict-like fashion: In [941]: store['obj1'] Out[941]: a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo [/code] Do I need to import Pytables in order to successfully run this code? 12. [code] We can then make a list of the tweet fields of interest then pass the results list to DataFrame: In [951]: tweet_fields = ['created_at', 'from_user', 'id', 'text'] In [952]: tweets = DataFrame(data['results'], columns=tweet_fields) In [953]: tweets Out[953]: <class 'pandas.core.frame.DataFrame'> Int64Index: 15 entries, 0 to 14 Data columns: created_at 15 non-null values from_user 15 non-null values id 15 non-null values text 15 non-null values dtypes: int64(1), object(3) Each row in the DataFrame now has the extracted data from each tweet: In [121]: tweets.ix[7] Out[121]: created_at Thu, 23 Jul 2012 09:54:00 +0000 from_user deblike id 227419585803059201 text pandas: powerful Python data analysis toolkit Name: 7 [/code] An error message occured: KeyError: 'Results' Please explain. 13. [code] Storing and Loading Data in MongoDB NoSQL databases take many different forms. Some are simple dict-like key-value stores like BerkeleyDB or Tokyo Cabinet, while others are document-based, with a dict-like object being the basic unit of storage. I've chosen MongoDB (http://mongodb.org) for my example. I started a MongoDB instance locally on my machine, and connect to it on the default port using pymongo, the official driver for MongoDB: import pymongo con = pymongo.Connection('localhost', port=27017) [/code] I have trouble importing pymongo as it is a database package. How do I import MongoDB properly into python 2.7? Again, your help will be greatly appreciated! -- https://mail.python.org/mailman/listinfo/python-list