pyaccuwage/scripts/pyaccuwage-pdfparse
Binh Nguyen 1c7533973a Parsing all the way through the pdf appears to work. Next we need
to track the beginning/ending points for each record and append
continuation records onto the previous. There's some issue in
the pyaccuwage-pdfparse script causing it to have problems reading
the last record field in a record group. Maybe the record extractor
needs to dump the last failed ColumnCollector rather than return it
if it's determined to hold junk data?

The record builder seems to handle everything just fine.

Added a function to the field name parsing to replace ampersands
with an "and" string so as not to cause problems with variable names.
2012-11-13 15:53:41 -06:00

54 lines
1.7 KiB
Python
Executable file

#!/usr/bin/python
from pyaccuwage.parser import RecordBuilder
from pyaccuwage.pdfextract import PDFRecordFinder
import argparse
import sys
import os
import re
parser = argparse.ArgumentParser(description="Parse and convert contents of IRS files into pyaccuwage e-file classes.")
parser.add_argument("-i", "--input", nargs=1, required=True, metavar="file", type=argparse.FileType('r'), help="Source PDF file, ie: p1220.pdf")
parser.add_argument("-f", "--full", help="Generate full python file, including related imports.", action="store_true")
args = parser.parse_args()
def generate_imports():
return "\n".join([
"from pyaccuwage import model",
"from pyaccuwage.fields import *",
"",
"",
])
def generate_class_begin(name):
return "class %s(mode.Model):\n" % name
if args.full:
sys.stdout.write(generate_imports())
source_file = os.path.abspath(args.input[0].name)
doc = PDFRecordFinder(source_file)
records = doc.records()
builder = RecordBuilder()
def record_begins_at(record):
return int(record[1][1].data.values()[0].split('-')[0], 10)
def record_ends_at(record):
return record[1][-1].data
return int(record[1][-1].data.values()[0].split('-')[-1], 10)
for rec in records:
print record_begins_at(rec) #, 'to', record_ends_at(rec)
# FIXME record_ends_at is randomly exploding due to record data being
# a lump of text and not necessarily a field entry. I assume
# this is cleaned out by the record builder class.
sys.stdout.write("class %s(object):\n" % re.sub('[^\w]','',rec[0]))
for field in builder.load(map(lambda x:x.tuple, rec[1][1:])):
sys.stdout.write('\t' + field + '\n')
#print field