Asyncio for Python and Scala

Asyncio for Python and Scala

2020, Dec 13    

The python part are mainly from “A Web Crawler With asyncio Coroutines”

Let begin with simple web crawler task, the task is to download urls from website. As we known, the download part is a IO-bound operation.

Python asyncio

Start with block-io

def fetch(url):
    sock = socket.socket()
    sock.connect(('xkcd.com', 80))
    request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
    sock.send(request.encode('ascii'))
    response = b''
    chunk = sock.recv(4096)
    while chunk:
        response += chunk
        chunk = sock.recv(4096)

    # Page is now downloaded.
    links = parse_links(response)
    q.add(links)

There block-io is the connect, send and recv for socket.

What’s will happen if use multipe-thread or multipe-process for fetch operation?

Multipe-thread will have the same issue.

Multipe-process will create resource overhead and partially resolve the issue.

Use non-blocking socket

sock = socket.socket()
sock.setblocking(False)
try:
    sock.connect(('xkcd.com', 80))
except BlockingIOError:
    pass
request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
encoded = request.encode('ascii')

while True:
    try:
        sock.send(encoded)
        break  # Done.
    except OSError as e:
        pass

What’s will happen if use multipe-thread?

Use select-like function

selector = DefaultSelector()

sock = socket.socket()
sock.setblocking(False)
try:
    sock.connect(('xkcd.com', 80))
except BlockingIOError:
    pass

def connected():
    selector.unregister(sock.fileno())
    print('connected!')

selector.register(sock.fileno(), EVENT_WRITE, connected)


def loop():
    while True:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback()

There will have some improvement with Future class defined to handle callback.

Use asyncio lib

@asyncio.coroutine
def fetch(self, url):
    response = yield from self.session.get(url)
    body = yield from response.read()

Scala way

Scala Future

f1:A=> Future[B] f2:B=> Future[C]

List[A].map{ A=>f1(A).flatMap(B=>f2(B)) }

Scala Akka

Event driven

Doradilla lib

Python use asyncio lib to handle IO-bound task, and yield from in Python is similar to flatMap in Scala. For python’s GIL feature, only one CPU core will be used in one process, so asyncio can’t well handle CPU-bound task without use multipe-process lib.

Scala has two ways(Future and Akka) to handle IO-bound task, and can use multipe-CPU by default, also can handle CPU-bound task. But when CPU-bound tasks overflows, these two ways will also create more jobs which will introduce context swith overhead.

Doradilla lib could handle above scenarios easily.