Log in

No account? Create an account

Previous Entry | Next Entry


One problem with the Australian Archives is that it charges like a wounded bull for photocopying - 50c per page. So when I found a 350 page document that I was interested in, I knew that getting them to photocopy it for me was not the answer.

However, the archives will make a digital copy of five documents per person per year for free. So I asked for a digital copy to be made.

This was on 17 October. Yesterday, they put it up so it could be accessed online from RecordSearch. Three hundred and fifty colour JPGs -- one per page!

So I wrote a Perl script to suck them all down onto my hard disk. Then coded up a C# program to turn them into a single PDF document.

What on Earth the people who can't program do I wouldn't know. Maybe they just make everybody do CompSci 101 these days.


( 4 comments — Leave a comment )
12th Nov, 2003 02:17 (UTC)
Re: Ex-Digitate
Three hundred and fifty colour JPGs -- one per page!
(goggles) And they call that a digital copy? Haven't they heard of Optical Character Recognition? Or was the thing hand-written or something?
Does your single PDF document consist of 350 images, or did you manage to do more?
12th Nov, 2003 12:55 (UTC)
The curious can
the document here
. You'll need Javascript turned on.

My PDF contains exactly 351 pages -- they didgitised the cover as well. It's typed. So I will give the OCR program a go and see what it comes up with. It will require a script too of course.
12th Nov, 2003 19:03 (UTC)
Re: Ex-Digitate
Looking at the document now, I can almost see why they didn't bother trying to OCR it. Yeah, it's typed, but that image quality is so poor (low contrast and blurry), I could barely read it myself. I don't think a poor OCR program is going to have much of a chance.
13th Nov, 2003 12:23 (UTC)
The original was a simple typed page! I don't know where the blueness came from -- somewhere in their digitisation process.

They don't normally OCR documents because so many have handwriting. The choice of resolution is, I think, to make the scans smaller.
( 4 comments — Leave a comment )