pyaccuwage

Author	SHA1	Message	Date
Binh Nguyen	6e4a975cfb	Changed the way records are found by searching for field headers and then working backwards to determine the record name. We also added the ability to "break" from reading a series of field definitions based on certain break points such as "Record Layout". There is currently an error in p1220 line 2704 which is caused by the column data starting on the 4th column "Description and Remarks". If ColumnCollectors started with the field titles, and had awareness of the column positions starting with those, it may be possible to at least read the following record fields without auto-adjusting them.	2012-12-04 16:04:08 -06:00
Binh Nguyen	8995f142e5	Merge branch 'master' of brimstone.klowner.com:pyaccuwage Conflicts: pyaccuwage/pdfextract.py	2012-12-04 14:57:20 -06:00
Binh Nguyen	6e1d02db8d	trying new header location method	2012-12-04 14:54:10 -06:00
Binh Nguyen	e9a6dc981f	Refer to previous log, but also verify that records are returning proper information prior to getting passed into the ColumnCollector. It seems like some things are getting stripped out due to blank lines or perhaps the annoying "Record Layout" pages. If we could extract the "record layout" sections, things may be simpler"	2012-11-27 16:01:00 -06:00
Binh Nguyen	31ff97db8a	Almost have things working. It seems like some of the record results are overlapping. I'm assuming this is due to missing a continue or something inside the ColumnCollector. I added a couple new IsNextRecord exceptions in response to blank rows, but this may be causing more problems than expected. Next step is probably to check the records returned, and verify that nothing is being duplicated. Some of the duplicates may be filtered out by the RecordBuilder class, or during the fields filtering in the pyaccuwage-pdfparse script (see: fields).	2012-11-20 16:05:36 -06:00
Binh Nguyen	1c7533973a	Parsing all the way through the pdf appears to work. Next we need to track the beginning/ending points for each record and append continuation records onto the previous. There's some issue in the pyaccuwage-pdfparse script causing it to have problems reading the last record field in a record group. Maybe the record extractor needs to dump the last failed ColumnCollector rather than return it if it's determined to hold junk data? The record builder seems to handle everything just fine. Added a function to the field name parsing to replace ampersands with an "and" string so as not to cause problems with variable names.	2012-11-13 15:53:41 -06:00
Binh Nguyen	fe4bd20bad	Record detection seems to be working much better. We currently have an issue where full-page width blocks are being interpreted as a single large column, and then subsequent field definition columns are being truncated in as subcolumns. The current problematic line in p1220 is 1598. Maybe add some functionality which lets us specify the number of columns we're most interested in? Automatically discard 1-column ColumnCollectors maybe?	2012-11-06 15:34:35 -06:00
Binh Nguyen	46755dd90d	updated VERSION	2012-10-16 13:22:44 -05:00
Binh Nguyen	820f71b3f5	Merge branch 'master' of brimstone.klowner.com:pyaccuwage	2012-10-09 15:36:11 -05:00
Binh Nguyen	6abfa5b345	fixed missing field, updated for 2012	2012-10-09 15:35:13 -05:00
Binh Nguyen	30376a54f3	fixed missing field, updated for 2012	2012-10-09 15:31:35 -05:00
Binh Nguyen	717f929015	updated records to match 2012 definitions	2012-09-25 15:45:00 -05:00
Binh Nguyen	40fcbdc8b8	getting closer, added a FIXME to one of the fields. Having issues with columns in description fields	2012-07-17 15:44:28 -05:00
Binh Nguyen	5dde3be536	forgot to convert tuple to list for the missing description field fix, derrrp	2012-07-17 14:16:28 -05:00
Binh Nguyen	0dc55ab3dd	fixed reading fields that don't have descriptions	2012-07-17 14:10:34 -05:00
Binh Nguyen	b3aed20388	fixed rangetoken issue with single byte values	2012-07-10 15:41:47 -05:00
Binh Nguyen	e8145c5616	adding new pdf extract capability	2012-07-10 15:24:13 -05:00
Binh Nguyen	b77b80e485	We need to remove some of the yield statements because it's making iteration very confusing to keep track of, due to global iterators being passed around and iterated over in chunks. I've added a located_heading_rows method which scans the entire document for row numbers that look like record definition headings. I think we can use these number spans to feed into the row columnizer stuff.	2012-06-30 15:21:05 -05:00
Binh Nguyen	6b5eb30f34	added ColumnCollector, fixed column parsing by scanning for whitespace before separating	2012-06-26 15:55:18 -05:00
Binh Nguyen	fecd14db59	adding pdfextract for column extraction	2012-06-19 15:37:17 -05:00
Binh Nguyen	770aeb0d2b	Ranges in descriptions are ignored, except in cases where the range matches the next expected range. The only way to get around this seems to be to manually remove the range value from the input. One idea is to iterate through the entire token set and look for range tokens. When a range token correctly continues the sequence, then it is assumed to be a new record. Instead, if we scan the whole list of tokens and look for out of order ranges and exclude them as possible field identifiers. 1 10* 10 20 30 90* 40 10* 50	2012-06-06 14:46:17 -05:00
Binh Nguyen	04b3c3f273	Added pyaccuwage-parse script. We encountered a problem with the parser where a description contained a range value and the parse thought it was the beginning of a new field definition. We should be able to exclude the incorrect range values by looking at our last good range, and if the range does not continue the previous range, then it is probably incorrect and can be discarded. These changes can probably be performed in the tokenize section of the parser.	2012-06-02 15:16:13 -05:00
Binh Nguyen	69da154e59	attempting to add a commandline script	2012-06-02 14:18:48 -05:00
Binh Nguyen	ad5262e37e	added length checking to field matching criteria for parser	2012-05-08 14:08:39 -05:00
Binh Nguyen	2c9551f677	Fixed issue with last item not being insert into tokens. Now able to convert PDF text into record field definitions pretty reliably. Need to add additional field type detection rules.	2012-04-18 14:51:59 -05:00
Binh Nguyen	027b44b65c	Parser is mostly working, there's an issue with the last grouping of tokens not being parsed. This can probably fixed by yielding an end-marker from the tokenizer generator so the compiler knows to clear out the last item.	2012-04-13 14:39:02 -05:00
Binh Nguyen	6e9b8041b9	adding a simple parser for reading stuff from pdfs	2012-04-05 15:19:00 -05:00
Binh Nguyen	97a74c09f9	fixed some field types, misc	2011-11-12 15:26:17 -06:00
Binh Nguyen	7772ec679f	Renamed "verify" functions to "validate". Another idea for defining the fields in records would be to create a class method that would instantiate the individual fields at instance creation rather than during class definition. This would use less memory when there are no Record objects being used. Storing each Field after it's instantiated into a List, as well as a Dict would remove the necessity for counting the Field instantiation order, since the List would hold them in their proper order.	2011-11-12 13:50:14 -06:00
Binh Nguyen	ea492c2f56	renamed NumericField to IntegerField	2011-11-05 14:12:47 -05:00
Binh Nguyen	a3f89e3790	fixed a couple field types being wrong, improved validation, auto-truncate over-length fields	2011-11-05 14:11:37 -05:00
Binh Van Nguyen	076efd4036	0.0.6, fixed field types	2011-10-29 14:58:59 -05:00
Binh Van Nguyen	7cb8bed61e	Bumped version to 0.0.5 Fixed problem where fields contained shared values by performing a shallow copy on all fields during Record instantiation. That way, each record has its own copy of the field instances, rather than the shared class-wide instance provided by the definition.	2011-10-29 14:03:03 -05:00
Binh Van Nguyen	4023d46b4a	Changed a few fields to be optional. Found a fairly difficult bug involved with Field instances being shared across Records. The issue is that Field instances are static. I either need to implement a way to instantiate copies of all the Fields per-record, or write a wrapping interface which provides a unique value store on a per-Record basis.	2011-10-25 14:54:22 -05:00
Binh Van Nguyen	775d3d3700	bump to v0.0.3	2011-09-24 15:40:06 -05:00
Binh Van Nguyen	3dfcf030e7	I can't type	2011-09-24 15:27:55 -05:00
Binh Van Nguyen	c8965afab5	changing to version 0.0.2	2011-09-24 13:32:31 -05:00
Binh Van Nguyen	93d7465e1a	promoting to v0.2	2011-09-24 13:29:42 -05:00
Binh Van Nguyen	1a0f4183e7	Everything works, or seems to. The package is now installable as a regular python module through pip or whatever. Now our apps can assemble data objects to be converted into accuwage files.	2011-09-17 11:22:04 -05:00
Binh Van Nguyen	6f5d29faab	moved everything into pyaccuwage subdir	2011-06-25 15:08:38 -05:00
Binh Van Nguyen	5eb8925032	added __init__ to setup	2011-06-25 15:02:06 -05:00
Binh Van Nguyen	78f8b845fe	fixed set>setup	2011-06-25 14:59:18 -05:00
Binh Van Nguyen	3d6a64db1d	added test setup.py	2011-06-25 14:57:30 -05:00
Binh Van Nguyen	ab16399e19	made enum names consistent	2011-06-25 14:33:28 -05:00
Binh Van Nguyen	5f9211f30a	fixed two silly syntax errors	2011-06-25 14:31:52 -05:00
Binh Van Nguyen	0646bf7b9b	Added record validation functions for everything (that we saw in the PDF), we should go over them once more to make sure we didn't miss anything, but testing validation should probably be done after that. Verify that the record ordering enforcement code is correct, then start thinking of how to get data from external sources into the record generator.	2011-06-11 14:45:12 -05:00
Binh Van Nguyen	7dcbd6305b	Added country code list to enums	2011-06-04 15:52:48 -05:00
Binh Van Nguyen	a0014ca451	Added a MonthYear field, fixed some field required values and fixed validation functions. Added numeric state abbreviation capability. So far everything appears to be working good.	2011-06-04 15:46:41 -05:00
Binh Van Nguyen	5781cbf335	Finished up most of the record order validation and also checking for all required records in a set. Added a controller class but decided to put stuff in __init__ instead, at least for now. Added a DateField which converts datetime.date into the proper string format for EFW2 files (hopefully), this should still be tested next week.	2011-05-07 15:19:48 -05:00
Binh Van Nguyen	f30237a90d	added custom field-record validator support, not using it yet though	2011-04-23 14:57:18 -05:00

1 2 3

108 commits