Re: best way to read a huge ascii file.

2016-11-30 Thread Rolando Espinoza
Hi,

Yes, working with binary formats is the way to go when you have large data.
But for further
reference, Dask[1] fits perfectly for your use case, see below how I
process a 7Gb
text file under 17 seconds (in a laptop: mbp + quad-core + ssd).

# Create roughly ~7Gb worth text data.

In [40]: import numpy as np

In [41]: x = np.random.random((60, 500))

In [42]: %time np.savetxt('data.txt', x)
CPU times: user 4min 28s, sys: 14.8 s, total: 4min 43s
Wall time: 5min

In [43]: %time y = np.loadtxt('data.txt')
CPU times: user 6min 31s, sys: 1min, total: 7min 31s
Wall time: 7min 44s

# Then we proceed to use dask to read the big file. The key here is to
# use a block size so we process the file in ~120Mb chunks (approx. one
line).
# Dask uses by default the line separator \n to ensure the partitions don't
break
# the lines.

In [1]: import dask.bag

In [2]: data = dask.bag.read_text('data.txt', blocksize=120*1024*1024)

In [3]: data
dask.bag

# Rather than passing the entire 100+Mb line to np.loadtxt, we slice the
first 128 bytes
# which is enough to grab the first 4 columns.
# You could further speed up this by not reading the entire line but
instead read just
# 128 bytes from each line offset.

In [4]: from io import StringIO

In [5]: def to_array(line):
...: return np.loadtxt(StringIO(line[:128]))[:4]
...:
...:

In [6]: %time y = np.asarray(data.map(to_array).compute())
y.shape
CPU times: user 190 ms, sys: 60.8 ms, total: 251 ms
Wall time: 16.9 s

In [7]: y.shape
(60, 4)

In [8]: y[:2, :]

array([[ 0.17329305,  0.36584998,  0.01356046,  0.6814617 ],
   [ 0.3352684 ,  0.83274823,  0.24399607,  0.30103352]])

You can also use dask to convert the entire file to hdf5.

Regards,

[1] http://dask.pydata.org/

Rolando

On Wed, Nov 30, 2016 at 1:16 PM, Heli  wrote:

> Hi all,
>
>  Writing my ASCII file once to either of pickle or npy or hdf data types
> and then working afterwards on the result binary file reduced the read time
> from 80(min) to 2 seconds.
>
> Thanks everyone for your help.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2010-01-07 Thread Rolando Espinoza La Fuente
# http://gist.github.com/271661

import lxml.html
import re

src = """
lksjdfls  kdjff lsdfs  sdjfls sdfsdwelcome
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

regex = re.compile('amazon_(\d+)')

doc = lxml.html.document_fromstring(src)

for div in doc.xpath('//div[starts-with(@id, "amazon_")]'):
match = regex.match(div.get('id'))
if match:
print match.groups()[0]



On Thu, Jan 7, 2010 at 4:42 PM, Aahz  wrote:
> In article 
> <19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com>,
> Oltmans   wrote:
>>
>>I've written this regex that's kind of working
>>re.findall("\w+\s*\W+amazon_(\d+)",str)
>>
>>but I was just wondering that there might be a better RegEx to do that
>>same thing. Can you kindly suggest a better/improved Regex. Thank you
>>in advance.
>
> 'Some people, when confronted with a problem, think "I know, I'll use
> regular expressions."  Now they have two problems.'
> --Jamie Zawinski
>
> Take the advice other people gave you and use BeautifulSoup.
> --
> Aahz (a...@pythoncraft.com)           <*>         http://www.pythoncraft.com/
>
> "If you think it's expensive to hire a professional to do the job, wait
> until you hire an amateur."  --Red Adair
> --
> http://mail.python.org/mailman/listinfo/python-list
>



-- 
Rolando Espinoza La fuente
www.rolandoespinoza.info
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup

2010-01-13 Thread Rolando Espinoza La Fuente
Hi,

Also you can check a high-level framework for scrapping:
http://scrapy.org/

In their docs includes an example of extracting torrents data from mininova
http://doc.scrapy.org/intro/overview.html

You will need to understand regular expressions, xpath expressions,
callbacks, etc.
In the faq explains how does Scrapy compare to BeatufilSoup.
http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-beautifulsoul-or-lxml

Regards,

On Wed, Jan 13, 2010 at 8:46 AM, yamamoto  wrote:
> Hi,
> I am new to Python. I'd like to extract "a" tag from a website by
> using "beautifulsoup" module.
> but it doesnt work!
>
[snip]

-- 
Rolando Espinoza La fuente
www.rolandoespinoza.info
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ConfigParser is not parsing

2010-02-12 Thread Rolando Espinoza La Fuente
read() does not return the config object

>>> import ConfigParser
>>> config = ConfigParser.SafeConfigParser()
>>> config.read('S3Files.conf')
['S3Files.conf']
>>> config.sections()
['main']
>>> config.get('main', 'taskName')
'FileConfigDriver'

Regards,

Rolando Espinoza La fuente
www.rolandoespinoza.info



On Fri, Feb 12, 2010 at 10:18 PM, felix gao  wrote:
> Hi all,
> I am trying to get the some configuration file read in by Python, however,
> after the read command it return a list with the filename that I passed in.
> what is going on?
> Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import ConfigParser
>>>> p = ConfigParser.SafeConfigParser()
>>>> cfg = p.read("S3Files.conf")
>>>> cfg
> ['S3Files.conf']
>
>  cat S3Files.conf
> [main]
> taskName=FileConfigDriver
> lastProcessed=2010-01-31
> dateFromat=%Y-%m-%d
> skippingValue=86400
> skippingInterval=seconds
> Thanks in advance.
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: isinstance(False, int)

2010-03-05 Thread Rolando Espinoza La Fuente
On Fri, Mar 5, 2010 at 2:00 PM, Steve Holden  wrote:
[...]
>
> Just a brainfart from the BDFL - he decided (around 2.2.3, IIRC) that it
> would be a good ideal for Booleans to be a subclass of integers.
>

I would never figured out

>>> bool.__bases__
(,)

Doesn't have side effects not knowing  that False/True are ints?

Regards,

Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: isinstance(False, int)

2010-03-05 Thread Rolando Espinoza La Fuente
On Fri, Mar 5, 2010 at 2:32 PM, mk  wrote:
> Arnaud Delobelle wrote:
>
> 1 == True
>>
>> True
>
> 0 == False
>>
>> True
>>
>> So what's your question?
>
> Well nothing I'm just kind of bewildered: I'd expect smth like that in Perl,
> but not in Python.. Although I can understand the rationale after skimming
> PEP 285, I still don't like it very much.
>

So, the pythonic way to check for True/False should be:

>>> 1 is True
False

>>> 0 is False
False

instead of ==, right?

Regards,

Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: installing something to a virtualenv when it's already in site-packages

2010-03-19 Thread Rolando Espinoza La Fuente
On Fri, Mar 19, 2010 at 4:05 AM, nbv4  wrote:
> I have ipython installed via apt. I can go to the command line and
> type 'ipython' and it will work. If I try to install ipython to a
> virtualenv, I get this:
>
> $ pip install -E env/ ipython
> Requirement already satisfied: ipython in /usr/share/pyshared
> Installing collected packages: ipython
> Successfully installed ipython
>
> I want ipython in both site-packages as well as in my virtualenv. This
> is bad because when I activate the virtualenv, site-packages
> disappears and ipython is not available. A work around is to uninstall
> ipython from apt, install to the virtualenv, then reinstall in apt. Is
> there a better way?

I use -U (--upgrade) to force the installation within virtualenv. e.g:

$ pip install -E env/ -U ipython

Regards,

Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to access args as a list?

2010-04-03 Thread Rolando Espinoza La Fuente
On Sat, Apr 3, 2010 at 6:28 PM, kj  wrote:
> Is there a way to refer, within the function, to all its arguments
> as a single list?  (I.e. I'm looking for Python's equivalent of
> Perl's @_ variable.)
>

def spam(*args, **kwargs):
print args
print kwargs

class Spam:
def __init__(self, *args, **kwargs):
print args
print kwargs

That's what are you looking for?

Regards,

~Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Incorrect scope of list comprehension variables

2010-04-06 Thread Rolando Espinoza La Fuente
On Sun, Apr 4, 2010 at 5:20 PM, Paul Rubin  wrote:
[...]
>
>    d[r] = list(r for r in [4,5,6])
>

This have a slightly performance difference. I think mainly the
generator's next() call.

In [1]: %timeit list(r for r in range(1))
100 loops, best of 3: 2.78 ms per loop

In [2]: %timeit [r for r in range(1)]
100 loops, best of 3: 1.93 ms per loop

~Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Difficulty w/json keys

2010-04-23 Thread Rolando Espinoza La Fuente
On Fri, Apr 23, 2010 at 10:20 AM, Red  wrote:
[...]
> for line in f:
>        j = json.loads(line)
>        if 'text' in j:
>                if 'lang' in j:
>                        lang = j['lang']
>                        print "language", lang
>                text = j['text']

"lang" key is in "user" dict

>>> tweet['text']
'tech managers what size for your teams? better to have 10-20 ppl per
manager or 2-5 and have the managers be more hands on?'

>>> tweet['lang']
[...]
KeyError: 'lang'

>>> tweet['user']['lang']
'en'

~Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


understanding the mro (long)

2010-07-23 Thread Rolando Espinoza La Fuente
TL;DR: if you want to stay sane, don't inherit two classes that share
same inheritance graph

I recently got puzzled by a bug from a legacy lib (ClientForm)
which have this code:

class ParseError(sgmllib.SGMLParseError,
 HTMLParser.HTMLParseError,
 ):
pass

And fails because takes __init__ from sgmllib and __str__ from HTMLParser
where __str__ uses attributes set by HTMLParser's init.

At first look, I thought was just matter to swap the inherit classes.
But a deeper
look take me to the python's mro reading:
http://www.python.org/download/releases/2.3/mro/

And to reproduce the error I code this:

class Foo(object):
def __init__(self, msg):
self.msg = msg

def __str__(self):
return 'Foo: ' + self.msg

class Bar(Exception):
def __init__(self, msg):
self.msg = msg

def __str__(self):
return 'Bar: ' + self.msg

class A(Exception):
pass

class B(RuntimeError):
pass

class AFoo(A, Foo): pass
class ABar(A, Bar): pass

class BFoo(B, Foo): pass
class BBar(B, Bar): pass

print AFoo('ok') # ok
print ABar('ok') # Bar: ok

print BFoo('ok') # ok
print BBar('fail') # AttributeError: ... not attribute 'msg'

# EOF

After running the code I was still confused. So I read carefully again
the mro stuff. And ended doing this inheritance tree:

   object (__init__, __str__)
  |\
  |Foo (__init__, __str__)
  |
 BaseException (__init__, __str__)
  |
  |
  |
  Exception (__init__)
 /| \
A| Bar (__init__, __str__)
  |
  StandardError (__init__)
  |
  |
  |
RuntimeError (__init__)
/
   B

Then I figure out the method resolution following the inheritance graph:
  * AFoo(A, Foo):
__init__ from Exception
__str__  from BaseException

  * ABar(A, Bar):
__init__ from Bar
__str__  from Bar

  * BFoo(B, Foo):
__init__ from RuntimeError
__str__  from BaseException

  * BBar(B, Bar):
__init__ from RuntimeError
__str__  from Bar


Finally everything make sense. And make think about be careful when
doing multiple inheritance.

Any thoughts?

~Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: understanding the mro (long)

2010-07-23 Thread Rolando Espinoza La Fuente
On Sat, Jul 24, 2010 at 12:28 AM, Benjamin Kaplan
 wrote:
[...]
>
> And second, not to in any way diminish the work you did tracing out
> the inheritance tree and working through the inheritance, but Python
> has easier ways of doing it :)
>
 BBar.__mro__
> (, ,  'exceptions.RuntimeError'>, ,  '__main__.Bar'>, ,  'exceptions.BaseException'>, )

Yes, actually I looked at __mro__ to confirm that I was right.

 '__str__' in BBar.__dict__
> False
 '__str__' in Bar.__dict__
> True

I see! I couldn't figure out how to find if a method is defined within
given class.

 for cls in BBar.__mro__ :
>        if '__str__' in cls.__dict__ :
>                print cls
>                break
>
>
> 

This is good one! It could save time figuring out where a method comes from.
Anyway, was a good exercise to figure out the mro by hand :)

Thanks for your comments Benjamin and Steven.

~Rolando
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python interview quuestions

2010-08-10 Thread Rolando Espinoza La Fuente
On Fri, Aug 6, 2010 at 10:31 PM, Tim Chase
 wrote:
[...]
>> More over, it can be done in just a single line of Python.
>>
>> 7 if you're not very familiar with Python.
>
> While it *can* be done in one line, I'm not sure it's the most legible
> solution.  Though I must say I like this one-line python version:
>
> for i in range(1, 101): print ((i%3==0 and 'fizz' or '') + (i%5==0 and
> 'buzz' or '')) or i
>
> (adjust "3" and "5" for your local flavor of fizzbuzz)
>
> I'm not sure I'd hire a candidate that proposed this as a solution in
> earnest, but I'd have fun chatting with them :)

I didn't believe it could take more than 5 minutes, but this took me
~10 minutes,
though I'm familiar with python and I did the FizzBuzz one-liners before:
http://gist.github.com/518370

Well.. I tried to use generators to make it "cool" but changed it for
a test-friendly approach.
I'll find hard to remember the one-liners in an interview and get it right.

Rolando Espinoza La fuente
www.insophia.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Source code for itertools

2010-08-30 Thread Rolando Espinoza La Fuente
On Mon, Aug 30, 2010 at 11:06 PM, vsoler  wrote:
> On 31 ago, 04:42, Paul Rubin  wrote:
>> vsoler  writes:
>> > I was expecting an itertools.py file, but I don't see it in your list.
>> >> ./python3.1-3.1.2+20100829/Modules/itertoolsmodule.c
>>
>> looks promising.  Lots of stdlib modules are written in C for speed or
>> access to system facilities.
>
> Lawrence, Paul,
>
> You seem to be running a utility I am not familiar with. Perhaps this
> is because I am using Windows, and most likely you are not.
>
> How could I have found the answer in a windows environment?

Hard question. They are using standard unix utilities.

But you can find the source file of a python module within python:

>>> import itertools
>>> print(itertools.__file__)
/usr/lib/python2.6/lib-dynload/itertools.so

Yours should point to a windows path. If the file ends with a ".py",
you can open the file
with any editor. If ends with ".so" or something else  likely is a
compiled module in C
and you should search in the source distribution, not the binary distribution.

Hope it helps.

Regards,


Rolando Espinoza La fuente
www.insophia.com
-- 
http://mail.python.org/mailman/listinfo/python-list