Tuesday, July 13, 2004

Senate Iraq Report Censored... With a Scanner?

From Technology Review

Senate Iraq Report Censored... With a Scanner?
posted by Simson Garfinkel @ 7/10/2004 12:02:56 PM

Over the past few years there have been numerous cases in which classified information has leaked to the public domain because it was censored using Adobe Acrobat’s "black box" feature.

Well, you won’t be able to find text-under-the-image on the version of the report handed out by the Senate intelligence committee (http://intelligence.senate.gov/). That’s because the report on their home page was scanned and the scan was put up for download!

This is one way to make sure that nobody can recover the underlying material. Unfortunately, it also produces a report that’s 23.4MB in length --- probably 10x longer than it needs to be. And, even worse, the report isn’t searchable.

As a public service, I have OCR’ed the report and put up two versions for download.

http://web.mit.edu/simsong/www/iraqreport2-textunder.pdf is a copy of the scan but with OCR applied, with the text underneath the original images. It has all the fidelity of the original report but you can search it. No clue why this version of the report is half the size of the original.

http://web.mit.edu/simsong/www/iraqreport2-ocr.pdf is just the OCR’ed text. It’s 4.3MB in length. There are many random OCR errors, including occasional bold text that should be something else, but it’s pretty reasonable, easy to search. and quick to download.