Questions on Pandas

Tommy C Thu, 25 Jun 2015 21:37:06 -0700

Hi there, I have a number of questions related to the Pandas exercises found 
from the book, Python for Data Analysis by Wes McKinney. Particularly, these 
exercises are from Chapter 6 of the book. It'd be much appreciated if you could 
answer the following questions!


1.
[code]
Input: pd.read_csv('ch06/ex2.csv', header=None)
Output:
 X.1 X.2 X.3 X.4 X.5
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

[/code]

Does the header appear as "X.#" by default when it is set to be None?

2.
[code]
Input: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
Input: chunker
Output: <pandas.io.parsers.TextParser at 0x8398150>

[/code]

Please explain the idea of chunksize and the output meaning.


3.
[code]
The TextParser object returned by read_csv allows you to iterate over the parts 
of the
file according to the chunksize. For example, we can iterate over ex6.csv, 
aggregating
the value counts in the 'key' column like so:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
 tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.order(ascending=False)
We have then:
In [877]: tot[:10]
Out[877]:
E 368
X 364
L 346
O 343
Q 340
M 338
J 337
F 335
K 334
H 330

[/code]

I couldn't run the Series function successfully... is there something missing 
in this code?

4.
[code]
Data can also be exported to delimited format. Let's consider one of the CSV 
files read
above:
In [878]: data = pd.read_csv('ch06/ex5.csv')
In [879]: data
Out[879]:
 something a b c d message
0 one 1 2 3 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11 12 foo

Missing values appear as empty strings in the output. You might want to denote 
them
by some other sentinel value:
In [883]: data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


[/code]

Error occured as I tried to run this code with sys.stdout.


5.
[code]
class of csv.Dialect:
class my_dialect(csv.Dialect):
 lineterminator = '\n'
 delimiter = ';'
 quotechar = '"'
reader = csv.reader(f, dialect=my_dialect)

[/code]

An error occurred when I tried to run this code: "quotechar must be an 
1-character integer... please explain.


6.
[code]
with open('mydata.csv', 'w') as f:
 writer = csv.writer(f, dialect=my_dialect)
 writer.writerow(('one', 'two', 'three'))
 writer.writerow(('1', '2', '3'))
 writer.writerow(('4', '5', '6'))
 writer.writerow(('7', '8', '9'))
[/code]

An error occurred when I ran this code. Please explain the cause of the error.


7.
[code]
But these are objects representing HTML elements; to get the URL and link text 
you
have to use each element's get method (for the URL) and text_content method (for
the display text):
In [908]: lnk = links[28]
In [909]: lnk
Out[909]: <Element a at 0x6c48dd0>
In [910]: lnk.get('href')
Out[910]: 'http://biz.yahoo.com/special.html'
In [911]: lnk.text_content()
Out[911]: 'Special Editions'
Thus, getting a list of all URLs in the document is a matter of writing this 
list comprehension:
In [912]: urls = [lnk.get('href') for lnk in doc.findall('.//a')]
In [913]: urls[-10:]
Out[913]:
['http://info.yahoo.com/privacy/us/yahoo/finance/details.html',
 'http://info.yahoo.com/relevantads/',
 'http://docs.yahoo.com/info/terms/',
 'http://docs.yahoo.com/info/copyright/copyright.html',
 'http://help.yahoo.com/l/us/yahoo/finance/forms_index.html',
 'http://help.yahoo.com/l/us/yahoo/finance/quotes/fitadelay.html',
 'http://help.yahoo.com/l/us/yahoo/finance/quotes/fitadelay.html',

[/code]

An error related to line 912 occurred as I tried to run the code. Please 
explain.

8.
[code]
Using lxml.objectify, we parse the file and get a reference to the root node of 
the XML
file with getroot:
from lxml import objectify
path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

[/code]

An error occured when I tried to run the code to access the XML file. Please 
explain.


9.
[code]
ML data can get much more complicated than this example. Each tag can have 
metadata,
too. Consider an HTML link tag which is also valid XML:
from StringIO import StringIO
tag = '<a href="http://www.google.com";>Google</a>'
root = objectify.parse(StringIO(tag)).getroot()
You can now access any of the fields (like href) in the tag or the link text:
In [930]: root
Out[930]: <Element a at 0x88bd4b0>
In [931]: root.get('href')
Out[931]: 'http://www.google.com'
In [932]: root.text
Out[932]: 'Google'
[/code]

The outputs for line 930 and 931 are the same as line 932 (i.e., Google). 
Please explain


10.

[code]
One of the easiest ways to store data efficiently in binary format is using 
Python's builtin
pickle serialization. Conveniently, pandas objects all have a save method which
writes the data to disk as a pickle:
In [933]: frame = pd.read_csv('ch06/ex1.csv')
In [934]: frame
Out[934]:
 a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
In [935]: frame.save('ch06/frame_pickle')
You read the data back into Python with pandas.load, another pickle convenience
function:
In [936]: pd.load('ch06/frame_pickle')
Out[936]:
 a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

[/code]

I couldn't run this code successfully. Please explain.


11.
[code]
HDF5 to provide multiple flexible data containers, table indexing, querying 
capability,
and some support for out-of-core computations.
pandas has a minimal dict-like HDFStore class, which uses PyTables to store 
pandas
objects:
In [937]: store = pd.HDFStore('mydata.h5')
In [938]: store['obj1'] = frame
In [939]: store['obj1_col'] = frame['a']
In [940]: store
Out[940]:
<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5
obj1 DataFrame
obj1_col Series
Objects contained in the HDF5 file can be retrieved in a dict-like fashion:
In [941]: store['obj1']
Out[941]:
 a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

[/code]

Do I need to import Pytables in order to successfully run this code?


12.
[code]
We can then make a list of the tweet fields of interest then pass the results 
list to DataFrame:
In [951]: tweet_fields = ['created_at', 'from_user', 'id', 'text']
In [952]: tweets = DataFrame(data['results'], columns=tweet_fields)
In [953]: tweets
Out[953]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns:
created_at 15 non-null values
from_user 15 non-null values
id 15 non-null values
text 15 non-null values
dtypes: int64(1), object(3)
Each row in the DataFrame now has the extracted data from each tweet:
In [121]: tweets.ix[7]
Out[121]:
created_at Thu, 23 Jul 2012 09:54:00 +0000
from_user deblike
id 227419585803059201
text pandas: powerful Python data analysis toolkit
Name: 7
[/code]

An error message occured: 
KeyError: 'Results'

Please explain.

13.
[code]
Storing and Loading Data in MongoDB
NoSQL databases take many different forms. Some are simple dict-like key-value 
stores
like BerkeleyDB or Tokyo Cabinet, while others are document-based, with a 
dict-like
object being the basic unit of storage. I've chosen MongoDB 
(http://mongodb.org) for
my example. I started a MongoDB instance locally on my machine, and connect to 
it
on the default port using pymongo, the official driver for MongoDB:
import pymongo
con = pymongo.Connection('localhost', port=27017)
[/code]

I have trouble importing pymongo as it is a database package. How do I import 
MongoDB properly into python 2.7?

Again, your help will be greatly appreciated!
-- 
https://mail.python.org/mailman/listinfo/python-list

Questions on Pandas

Reply via email to