Re: best way to read a huge ascii file.
Hi, Yes, working with binary formats is the way to go when you have large data. But for further reference, Dask[1] fits perfectly for your use case, see below how I process a 7Gb text file under 17 seconds (in a laptop: mbp + quad-core + ssd). # Create roughly ~7Gb worth text data. In [40]: import numpy as np In [41]: x = np.random.random((60, 500)) In [42]: %time np.savetxt('data.txt', x) CPU times: user 4min 28s, sys: 14.8 s, total: 4min 43s Wall time: 5min In [43]: %time y = np.loadtxt('data.txt') CPU times: user 6min 31s, sys: 1min, total: 7min 31s Wall time: 7min 44s # Then we proceed to use dask to read the big file. The key here is to # use a block size so we process the file in ~120Mb chunks (approx. one line). # Dask uses by default the line separator \n to ensure the partitions don't break # the lines. In [1]: import dask.bag In [2]: data = dask.bag.read_text('data.txt', blocksize=120*1024*1024) In [3]: data dask.bag # Rather than passing the entire 100+Mb line to np.loadtxt, we slice the first 128 bytes # which is enough to grab the first 4 columns. # You could further speed up this by not reading the entire line but instead read just # 128 bytes from each line offset. In [4]: from io import StringIO In [5]: def to_array(line): ...: return np.loadtxt(StringIO(line[:128]))[:4] ...: ...: In [6]: %time y = np.asarray(data.map(to_array).compute()) y.shape CPU times: user 190 ms, sys: 60.8 ms, total: 251 ms Wall time: 16.9 s In [7]: y.shape (60, 4) In [8]: y[:2, :] array([[ 0.17329305, 0.36584998, 0.01356046, 0.6814617 ], [ 0.3352684 , 0.83274823, 0.24399607, 0.30103352]]) You can also use dask to convert the entire file to hdf5. Regards, [1] http://dask.pydata.org/ Rolando On Wed, Nov 30, 2016 at 1:16 PM, Heli wrote: > Hi all, > > Writing my ASCII file once to either of pickle or npy or hdf data types > and then working afterwards on the result binary file reduced the read time > from 80(min) to 2 seconds. > > Thanks everyone for your help. > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
# http://gist.github.com/271661 import lxml.html import re src = """ lksjdfls kdjff lsdfs sdjfls sdfsdwelcome hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 """ regex = re.compile('amazon_(\d+)') doc = lxml.html.document_fromstring(src) for div in doc.xpath('//div[starts-with(@id, "amazon_")]'): match = regex.match(div.get('id')) if match: print match.groups()[0] On Thu, Jan 7, 2010 at 4:42 PM, Aahz wrote: > In article > <19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com>, > Oltmans wrote: >> >>I've written this regex that's kind of working >>re.findall("\w+\s*\W+amazon_(\d+)",str) >> >>but I was just wondering that there might be a better RegEx to do that >>same thing. Can you kindly suggest a better/improved Regex. Thank you >>in advance. > > 'Some people, when confronted with a problem, think "I know, I'll use > regular expressions." Now they have two problems.' > --Jamie Zawinski > > Take the advice other people gave you and use BeautifulSoup. > -- > Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ > > "If you think it's expensive to hire a professional to do the job, wait > until you hire an amateur." --Red Adair > -- > http://mail.python.org/mailman/listinfo/python-list > -- Rolando Espinoza La fuente www.rolandoespinoza.info -- http://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup
Hi, Also you can check a high-level framework for scrapping: http://scrapy.org/ In their docs includes an example of extracting torrents data from mininova http://doc.scrapy.org/intro/overview.html You will need to understand regular expressions, xpath expressions, callbacks, etc. In the faq explains how does Scrapy compare to BeatufilSoup. http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-beautifulsoul-or-lxml Regards, On Wed, Jan 13, 2010 at 8:46 AM, yamamoto wrote: > Hi, > I am new to Python. I'd like to extract "a" tag from a website by > using "beautifulsoup" module. > but it doesnt work! > [snip] -- Rolando Espinoza La fuente www.rolandoespinoza.info -- http://mail.python.org/mailman/listinfo/python-list
Re: ConfigParser is not parsing
read() does not return the config object >>> import ConfigParser >>> config = ConfigParser.SafeConfigParser() >>> config.read('S3Files.conf') ['S3Files.conf'] >>> config.sections() ['main'] >>> config.get('main', 'taskName') 'FileConfigDriver' Regards, Rolando Espinoza La fuente www.rolandoespinoza.info On Fri, Feb 12, 2010 at 10:18 PM, felix gao wrote: > Hi all, > I am trying to get the some configuration file read in by Python, however, > after the read command it return a list with the filename that I passed in. > what is going on? > Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> import ConfigParser >>>> p = ConfigParser.SafeConfigParser() >>>> cfg = p.read("S3Files.conf") >>>> cfg > ['S3Files.conf'] > > cat S3Files.conf > [main] > taskName=FileConfigDriver > lastProcessed=2010-01-31 > dateFromat=%Y-%m-%d > skippingValue=86400 > skippingInterval=seconds > Thanks in advance. > > > > -- > http://mail.python.org/mailman/listinfo/python-list > > -- http://mail.python.org/mailman/listinfo/python-list
Re: isinstance(False, int)
On Fri, Mar 5, 2010 at 2:00 PM, Steve Holden wrote: [...] > > Just a brainfart from the BDFL - he decided (around 2.2.3, IIRC) that it > would be a good ideal for Booleans to be a subclass of integers. > I would never figured out >>> bool.__bases__ (,) Doesn't have side effects not knowing that False/True are ints? Regards, Rolando -- http://mail.python.org/mailman/listinfo/python-list
Re: isinstance(False, int)
On Fri, Mar 5, 2010 at 2:32 PM, mk wrote: > Arnaud Delobelle wrote: > > 1 == True >> >> True > > 0 == False >> >> True >> >> So what's your question? > > Well nothing I'm just kind of bewildered: I'd expect smth like that in Perl, > but not in Python.. Although I can understand the rationale after skimming > PEP 285, I still don't like it very much. > So, the pythonic way to check for True/False should be: >>> 1 is True False >>> 0 is False False instead of ==, right? Regards, Rolando -- http://mail.python.org/mailman/listinfo/python-list
Re: installing something to a virtualenv when it's already in site-packages
On Fri, Mar 19, 2010 at 4:05 AM, nbv4 wrote: > I have ipython installed via apt. I can go to the command line and > type 'ipython' and it will work. If I try to install ipython to a > virtualenv, I get this: > > $ pip install -E env/ ipython > Requirement already satisfied: ipython in /usr/share/pyshared > Installing collected packages: ipython > Successfully installed ipython > > I want ipython in both site-packages as well as in my virtualenv. This > is bad because when I activate the virtualenv, site-packages > disappears and ipython is not available. A work around is to uninstall > ipython from apt, install to the virtualenv, then reinstall in apt. Is > there a better way? I use -U (--upgrade) to force the installation within virtualenv. e.g: $ pip install -E env/ -U ipython Regards, Rolando -- http://mail.python.org/mailman/listinfo/python-list
Re: How to access args as a list?
On Sat, Apr 3, 2010 at 6:28 PM, kj wrote: > Is there a way to refer, within the function, to all its arguments > as a single list? (I.e. I'm looking for Python's equivalent of > Perl's @_ variable.) > def spam(*args, **kwargs): print args print kwargs class Spam: def __init__(self, *args, **kwargs): print args print kwargs That's what are you looking for? Regards, ~Rolando -- http://mail.python.org/mailman/listinfo/python-list
Re: Incorrect scope of list comprehension variables
On Sun, Apr 4, 2010 at 5:20 PM, Paul Rubin wrote: [...] > > d[r] = list(r for r in [4,5,6]) > This have a slightly performance difference. I think mainly the generator's next() call. In [1]: %timeit list(r for r in range(1)) 100 loops, best of 3: 2.78 ms per loop In [2]: %timeit [r for r in range(1)] 100 loops, best of 3: 1.93 ms per loop ~Rolando -- http://mail.python.org/mailman/listinfo/python-list
Re: Difficulty w/json keys
On Fri, Apr 23, 2010 at 10:20 AM, Red wrote: [...] > for line in f: > j = json.loads(line) > if 'text' in j: > if 'lang' in j: > lang = j['lang'] > print "language", lang > text = j['text'] "lang" key is in "user" dict >>> tweet['text'] 'tech managers what size for your teams? better to have 10-20 ppl per manager or 2-5 and have the managers be more hands on?' >>> tweet['lang'] [...] KeyError: 'lang' >>> tweet['user']['lang'] 'en' ~Rolando -- http://mail.python.org/mailman/listinfo/python-list
understanding the mro (long)
TL;DR: if you want to stay sane, don't inherit two classes that share same inheritance graph I recently got puzzled by a bug from a legacy lib (ClientForm) which have this code: class ParseError(sgmllib.SGMLParseError, HTMLParser.HTMLParseError, ): pass And fails because takes __init__ from sgmllib and __str__ from HTMLParser where __str__ uses attributes set by HTMLParser's init. At first look, I thought was just matter to swap the inherit classes. But a deeper look take me to the python's mro reading: http://www.python.org/download/releases/2.3/mro/ And to reproduce the error I code this: class Foo(object): def __init__(self, msg): self.msg = msg def __str__(self): return 'Foo: ' + self.msg class Bar(Exception): def __init__(self, msg): self.msg = msg def __str__(self): return 'Bar: ' + self.msg class A(Exception): pass class B(RuntimeError): pass class AFoo(A, Foo): pass class ABar(A, Bar): pass class BFoo(B, Foo): pass class BBar(B, Bar): pass print AFoo('ok') # ok print ABar('ok') # Bar: ok print BFoo('ok') # ok print BBar('fail') # AttributeError: ... not attribute 'msg' # EOF After running the code I was still confused. So I read carefully again the mro stuff. And ended doing this inheritance tree: object (__init__, __str__) |\ |Foo (__init__, __str__) | BaseException (__init__, __str__) | | | Exception (__init__) /| \ A| Bar (__init__, __str__) | StandardError (__init__) | | | RuntimeError (__init__) / B Then I figure out the method resolution following the inheritance graph: * AFoo(A, Foo): __init__ from Exception __str__ from BaseException * ABar(A, Bar): __init__ from Bar __str__ from Bar * BFoo(B, Foo): __init__ from RuntimeError __str__ from BaseException * BBar(B, Bar): __init__ from RuntimeError __str__ from Bar Finally everything make sense. And make think about be careful when doing multiple inheritance. Any thoughts? ~Rolando -- http://mail.python.org/mailman/listinfo/python-list
Re: understanding the mro (long)
On Sat, Jul 24, 2010 at 12:28 AM, Benjamin Kaplan wrote: [...] > > And second, not to in any way diminish the work you did tracing out > the inheritance tree and working through the inheritance, but Python > has easier ways of doing it :) > BBar.__mro__ > (, , 'exceptions.RuntimeError'>, , '__main__.Bar'>, , 'exceptions.BaseException'>, ) Yes, actually I looked at __mro__ to confirm that I was right. '__str__' in BBar.__dict__ > False '__str__' in Bar.__dict__ > True I see! I couldn't figure out how to find if a method is defined within given class. for cls in BBar.__mro__ : > if '__str__' in cls.__dict__ : > print cls > break > > > This is good one! It could save time figuring out where a method comes from. Anyway, was a good exercise to figure out the mro by hand :) Thanks for your comments Benjamin and Steven. ~Rolando -- http://mail.python.org/mailman/listinfo/python-list
Re: python interview quuestions
On Fri, Aug 6, 2010 at 10:31 PM, Tim Chase wrote: [...] >> More over, it can be done in just a single line of Python. >> >> 7 if you're not very familiar with Python. > > While it *can* be done in one line, I'm not sure it's the most legible > solution. Though I must say I like this one-line python version: > > for i in range(1, 101): print ((i%3==0 and 'fizz' or '') + (i%5==0 and > 'buzz' or '')) or i > > (adjust "3" and "5" for your local flavor of fizzbuzz) > > I'm not sure I'd hire a candidate that proposed this as a solution in > earnest, but I'd have fun chatting with them :) I didn't believe it could take more than 5 minutes, but this took me ~10 minutes, though I'm familiar with python and I did the FizzBuzz one-liners before: http://gist.github.com/518370 Well.. I tried to use generators to make it "cool" but changed it for a test-friendly approach. I'll find hard to remember the one-liners in an interview and get it right. Rolando Espinoza La fuente www.insophia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Source code for itertools
On Mon, Aug 30, 2010 at 11:06 PM, vsoler wrote: > On 31 ago, 04:42, Paul Rubin wrote: >> vsoler writes: >> > I was expecting an itertools.py file, but I don't see it in your list. >> >> ./python3.1-3.1.2+20100829/Modules/itertoolsmodule.c >> >> looks promising. Lots of stdlib modules are written in C for speed or >> access to system facilities. > > Lawrence, Paul, > > You seem to be running a utility I am not familiar with. Perhaps this > is because I am using Windows, and most likely you are not. > > How could I have found the answer in a windows environment? Hard question. They are using standard unix utilities. But you can find the source file of a python module within python: >>> import itertools >>> print(itertools.__file__) /usr/lib/python2.6/lib-dynload/itertools.so Yours should point to a windows path. If the file ends with a ".py", you can open the file with any editor. If ends with ".so" or something else likely is a compiled module in C and you should search in the source distribution, not the binary distribution. Hope it helps. Regards, Rolando Espinoza La fuente www.insophia.com -- http://mail.python.org/mailman/listinfo/python-list