You are here
Stream Peekaboo With Python
The Python standard libary provides a reasonably adequate module for reading delimited data streams and there are modules available for reading everything from XLS and DIF documents to MARC data. One definiciency of many of these modules is the ability to gracefully deal with whack data; in the real world data is never clean, never correctly structured, and you are lucky if it is accurate even on the rare occasion that it is correctly encoded.
For example, when Python's CSV reader meets a garbled line in a file it throws an exception and stops, and you're done. And it does not report what record it could not parse, all you have is a traceback. Perhaps in the output you can look at the last record and guess that the error lies one record beyond that... maybe.
Fortunately most of these modules work with file-like objects. As long as the object they receive properly implements iteration they will work. Using this strength it is possible to implement a Peekaboo on the input stream which allows us to see what the current unit of work being currently processed is, or even to pre-mangle that chunk.
Aside: The hardest part, at least for not-line-oriented data, is defining the unit of work.
For example here is a simple Peekaboo that allows for easy reporting of the line read by the CSV reader whenever that line does not contain the expected data:
import csv
class Peekaboo(object):
def __init__(self, handle):
self._h = handle
self._h.seek(0)
self._c = None
def __iter__(self):
for row in iter(self._h):
self._c = row
yield self._c
@property
def current(self):
return self._c
class RecordFormatException(Exception):
pass
def import_record(record):
# verify record data, check field types, field count, etc...
if not valid:
raise RecordFormatException()
if __name__ == '__main__':
rfile = open('testfile.csv', 'rb')
peekabo = Peekaboo(rfile)
for record in csv.reader(wrapper):
try:
data = import_record(record)
except RecordFormatException as exc:
print('Format Exception Processing Record:\n{0}'.format(peekabo.current, ))
Another use for a Peekabo and CSV reader is reading a delimited file that contains comments - lines starting with a hash ("#") are to be ignored when reading the file.
class Peekaboo(object):
def __init__(self, handle, comment_prefix=None):
self._h = handle
self._h.seek(0)
self._c = None
self._comment_prefix = comment_prefix
def __iter__(self):
for row in iter(self._h):
self._c = row
if self._comment_prefix and self._c.startswith(self._comment_prefix):
# skip the line
continue
yield self._c
@property
def current(self):
return self._c
if __name__ == '__main__':
rfile = open('testfile.csv', 'rb')
peekabo = Peekaboo(rfile, comment_prefix="#")
...
The Peekaboo is nothing revolutionary; to experienced developers it is likely just obvious. But I've introduced it to enough Python developers to believe it worthy of a mention.
- Log in to post comments