pyaccuwage

Author	SHA1	Message	Date
Mark Riedesel	d08f1ca586	hopefully fix python 2 and 3 compatability	2019-01-27 09:30:22 -06:00
Mark Riedesel	6381f8b1ec	bump version to 2018.01	2019-01-26 16:11:24 -06:00
Mark Riedesel	7c32cb0dd3	add StateTotalRecord for Iowa	2019-01-26 11:49:31 -06:00
Mark Riedesel	5afdcd6a50	add 'permitted_benefits_health' to RT and RO records for 2017	2018-01-27 11:38:23 -06:00
Mark Riedesel	706c39f7bb	CRLFField return binary data for get_data()	2017-10-29 10:41:52 -05:00
Mark Riedesel	078273f49f	fix json encoding by encoding bytes as ascii	2017-01-07 17:00:29 -06:00
Mark Riedesel	9320c68961	use BytesIO to work with python3	2017-01-07 14:52:33 -06:00
Mark Riedesel	16bf2c41d0	run through 2to3	2017-01-07 13:58:33 -06:00
Binh Nguyen	961aedc0ae	Added very important data cleaning added TextField now cleans CR and LF from data, this is very important for not breaking everything and leaving me completely confused. Thank you, Lauren!	2014-02-01 15:10:40 -06:00
Binh Nguyen	fc04a66869	Fixed debugging output	2014-02-01 12:57:36 -06:00
Mark Riedesel	4eedab0e7c	Added default record length	2013-10-11 00:21:28 -05:00
Binh Nguyen	03ce460181	Completed JSON importer. Exported from import matches original data, must be working	2013-05-21 13:36:44 -05:00
Binh Nguyen	7f9e5dbf65	added json encoder and partially functioning json decoder	2013-05-14 13:48:48 -05:00
Binh Nguyen	b9982c3a21	added missing modeldef.py and fixed genfieldfill	2013-04-20 14:34:14 -05:00
Binh Nguyen	9bbe100929	added pyaccuwage-genfieldfill	2013-04-20 14:31:09 -05:00
Binh Nguyen	ef9f012bd2	added checkseq to scripts in setup.py	2013-04-20 13:03:09 -05:00
Binh Nguyen	6bff5da58b	pyaccuwage-checkseq now reports error lines when it encounters out-of-sequence field comments	2013-04-13 12:31:11 -05:00
Binh Nguyen	c6df6c5452	Added pyaccuwage-checkseq. Everything works so far, currently the sequence comments are returned as string tuples. Next step is to take these results, convert them to integers, and make sure they occur in the expected linear order.	2013-03-30 13:15:23 -05:00
Binh Nguyen	e8e57bb932	improved record detection, state records are now found	2013-03-26 13:23:48 -05:00
Binh Nguyen	8cf78b5336	removed blank field counter, replaced with hash digest of rowspan	2013-03-20 15:49:16 -05:00
Binh Nguyen	456c15eb1c	Merge branch 'master' of brimstone.klowner.com:pyaccuwage Conflicts: pyaccuwage/pdfextract.py	2013-03-20 15:19:31 -05:00
Binh Nguyen	47f5021a84	changing repr	2013-03-20 15:18:12 -05:00
Binh Nguyen	e0d54c8a01	merging	2013-03-20 15:15:51 -05:00
Binh Nguyen	d058e64d26	tweaking validation	2013-03-20 15:13:44 -05:00
Binh Nguyen	a1ab6b4918	Looks like 1220 form has changed since last year, work on getting changes applied in a simple manner.	2013-03-05 14:49:38 -06:00
Binh Nguyen	afc4138898	fixed automatic model generation inheretence	2013-02-19 16:06:11 -06:00
Binh Nguyen	b40e736ae0	bumping version, improving field type guessing	2013-02-19 15:55:05 -06:00
Binh Nguyen	730073dcd1	working better!	2013-02-05 15:43:04 -06:00
Binh Nguyen	e6e087ef38	Record merging seems to work now that header offsets have been corrected. There's an issue parsing p1220 on line 2570. Maybe making the parser ignore full-width lines during parsing would fix the problem, if there's some way to check the length of a row, only counting single-spaced words?	2013-01-29 15:48:32 -06:00
Binh Nguyen	6e4a975cfb	Changed the way records are found by searching for field headers and then working backwards to determine the record name. We also added the ability to "break" from reading a series of field definitions based on certain break points such as "Record Layout". There is currently an error in p1220 line 2704 which is caused by the column data starting on the 4th column "Description and Remarks". If ColumnCollectors started with the field titles, and had awareness of the column positions starting with those, it may be possible to at least read the following record fields without auto-adjusting them.	2012-12-04 16:04:08 -06:00
Binh Nguyen	8995f142e5	Merge branch 'master' of brimstone.klowner.com:pyaccuwage Conflicts: pyaccuwage/pdfextract.py	2012-12-04 14:57:20 -06:00
Binh Nguyen	6e1d02db8d	trying new header location method	2012-12-04 14:54:10 -06:00
Binh Nguyen	e9a6dc981f	Refer to previous log, but also verify that records are returning proper information prior to getting passed into the ColumnCollector. It seems like some things are getting stripped out due to blank lines or perhaps the annoying "Record Layout" pages. If we could extract the "record layout" sections, things may be simpler"	2012-11-27 16:01:00 -06:00
Binh Nguyen	31ff97db8a	Almost have things working. It seems like some of the record results are overlapping. I'm assuming this is due to missing a continue or something inside the ColumnCollector. I added a couple new IsNextRecord exceptions in response to blank rows, but this may be causing more problems than expected. Next step is probably to check the records returned, and verify that nothing is being duplicated. Some of the duplicates may be filtered out by the RecordBuilder class, or during the fields filtering in the pyaccuwage-pdfparse script (see: fields).	2012-11-20 16:05:36 -06:00
Binh Nguyen	1c7533973a	Parsing all the way through the pdf appears to work. Next we need to track the beginning/ending points for each record and append continuation records onto the previous. There's some issue in the pyaccuwage-pdfparse script causing it to have problems reading the last record field in a record group. Maybe the record extractor needs to dump the last failed ColumnCollector rather than return it if it's determined to hold junk data? The record builder seems to handle everything just fine. Added a function to the field name parsing to replace ampersands with an "and" string so as not to cause problems with variable names.	2012-11-13 15:53:41 -06:00
Binh Nguyen	fe4bd20bad	Record detection seems to be working much better. We currently have an issue where full-page width blocks are being interpreted as a single large column, and then subsequent field definition columns are being truncated in as subcolumns. The current problematic line in p1220 is 1598. Maybe add some functionality which lets us specify the number of columns we're most interested in? Automatically discard 1-column ColumnCollectors maybe?	2012-11-06 15:34:35 -06:00
Binh Nguyen	46755dd90d	updated VERSION	2012-10-16 13:22:44 -05:00
Binh Nguyen	820f71b3f5	Merge branch 'master' of brimstone.klowner.com:pyaccuwage	2012-10-09 15:36:11 -05:00
Binh Nguyen	6abfa5b345	fixed missing field, updated for 2012	2012-10-09 15:35:13 -05:00
Binh Nguyen	30376a54f3	fixed missing field, updated for 2012	2012-10-09 15:31:35 -05:00
Binh Nguyen	717f929015	updated records to match 2012 definitions	2012-09-25 15:45:00 -05:00
Binh Nguyen	40fcbdc8b8	getting closer, added a FIXME to one of the fields. Having issues with columns in description fields	2012-07-17 15:44:28 -05:00
Binh Nguyen	5dde3be536	forgot to convert tuple to list for the missing description field fix, derrrp	2012-07-17 14:16:28 -05:00
Binh Nguyen	0dc55ab3dd	fixed reading fields that don't have descriptions	2012-07-17 14:10:34 -05:00
Binh Nguyen	b3aed20388	fixed rangetoken issue with single byte values	2012-07-10 15:41:47 -05:00
Binh Nguyen	e8145c5616	adding new pdf extract capability	2012-07-10 15:24:13 -05:00
Binh Nguyen	b77b80e485	We need to remove some of the yield statements because it's making iteration very confusing to keep track of, due to global iterators being passed around and iterated over in chunks. I've added a located_heading_rows method which scans the entire document for row numbers that look like record definition headings. I think we can use these number spans to feed into the row columnizer stuff.	2012-06-30 15:21:05 -05:00
Binh Nguyen	6b5eb30f34	added ColumnCollector, fixed column parsing by scanning for whitespace before separating	2012-06-26 15:55:18 -05:00
Binh Nguyen	fecd14db59	adding pdfextract for column extraction	2012-06-19 15:37:17 -05:00
Binh Nguyen	770aeb0d2b	Ranges in descriptions are ignored, except in cases where the range matches the next expected range. The only way to get around this seems to be to manually remove the range value from the input. One idea is to iterate through the entire token set and look for range tokens. When a range token correctly continues the sequence, then it is assumed to be a new record. Instead, if we scan the whole list of tokens and look for out of order ranges and exclude them as possible field identifiers. 1 10* 10 20 30 90* 40 10* 50	2012-06-06 14:46:17 -05:00

1 2

87 commits