Nomoa.com

Paving the way for .NET in Tonga

Low No Cost Tech

Categories
Main Menu
Subscribe to Our RSS Feed Subscribe to Comments Feed Signup for MSN Alerts to Nomoa.com: Articles Signup for Yahoo Alerts to Nomoa.com :: News Articles
Google Ads
Mind Dumps Go Here + Low ~ No Cost Technology 4 Productivity
Browse in : All > Soap Box
All > Soap Box > Low No Cost Tech
Any of these categories - All of these categories

Ubuntu - the straw that broke the camels back

Soap Box
Posted by: Samiuela LV Taufa on May 30, 2007 2:57:24 PM

Otherwise known as: How can I view those Microsoft Office 2003 Scanned documents in Unix?

[Update 2007.06.14 to include gnome2/nautilus-script and hopefully clarified some text]

The desktop replacement Ubuntu box I've been putting together for my father-in-law's office has ground to a halt because of a very simple problem:

I can't get a graphics viewer for Microsoft Office 2003's TIFF format created by the MS Office 2003 tool for managing scanners, Microsoft Office Document Imaging.

Technically, Ubuntu/Linux can view the multiple images embedded in the TIFF file, but it is a song and dance affair at the moment that is doable for a techno-dweeb, but not yet accessible to mere humans.

Scanning: YES we can scan documents under Ubuntu by using XSane Image Scanner, but I'm interested in viewing TIFF documents created by business partners.

Background

Microsoft Office Document Imaging (MDI) is part of Microsoft Office 2003 suite and provides a generic scanning tool for scanning images into your machine (most notably for attaching printed documents to email.) This tool is a way way easier to use than the driver based tool provided by the scanner manufacturer, as well as allowing a single application on which the user is trained for scanning documents/pictures.

We currently use it for scanning contracts, forms and sent faxes to forward between business partners. Likewise, our major business partner uses it extensively when sending us printed forms and faxes.

MDI's scanning tool saves multi-page scanned documents as a single TIFF file. Within this TIFF file are:

  1. JPEG images of the scanned pages
  2. JPEG thumbnails of the scanned pages
  3. OCR'd versions of the above pages (OCR - Optical Character Recognition - an attempt to recognise the text in your document)

The TIFF/TIF file format has been extensively documented and in Microsoft's own promotional blurb about Microsoft Office - Document Imaging.

TIFF is a commonly used format for various imaging applications, including those that scan and fax. Microsoft Office Document Imaging uses the TIFF format, utilizing the format's capability to contain text recognized by optical character recognition (OCR) (OCR: Translates images of text, such as scanned documents, into actual text characters. Also known as text recognition.) When you scan new documents, they are saved in TIFF format (with a .tif extension), and any OCR text is stored in the TIFF file along with the image.

You can open and edit TIFF files created with Office Document Imaging by using many other graphics applications. When you do so, any OCR text that the file contains is lost. You will have to rerun OCR if you want to access the text in the TIFF file again in Office Document Imaging.

It seems that Microsoft are using legitimate extensions to TIFF 6.0 and it's extensions, but not near enough programmers out there have access to the documentation on these extensions or can cut the code for it.

Some further notes on the TIFF format in the below Unix section but there are problems even within Microsoft Windows to view these scanned multi-page documents.

Visit bTonga

Viewing:

On an MS Windows XP desktop, you can view these multiple pages using:

  • Microsoft's MDI application's viewer, or
  • the free application IrfanView, or
  • the free application XnView (my preferred tool at the moment)

On Unix, Linux there's a convoluted way to get at the files and shown later in this post.

Viewing Limited:

Microsoft's Office Picture Manager (12.0.4518) can view the 1st Image in the TIFF file, but I can't see anyway of seeing the rest of the images in the file and there are no notable indications that there are multiple images in the file (leading you to a conclusion that the single image you see is the only relevant image.)

OT: Weird limitation considering the product is shipped by the same team, only further highlighting how big Office development/programming has become.

Viewing NOT:

You cannot, however, view the images using Microsoft's own current tools and other popular tools.

Unfortunately I don't have a copy of Adobe Photoshop on my machine to give people more information.

Similarly, I have tried to view multipage TIFFs on Linux with the following applications also failing with errors complaining about the TIFF format

  • F-Spot Photo Manager 0.3.5 (crashes on import, and fails to display image/s)
  • GIMP Image Editor 2.2.13 (multiple "unknown field tag" error message on loading file)
  • GNU Paint gpaint-2 0.3.0-pre5 (error: cannot open file)
  • gThumb Image Viewer 2.10.2 (no errors, but no image view)
  • Gwenview 1.4.1 (multiple "unknown field tag" errors and "Invalid YCBCr subsampling")
  • xloadimage 4.1 (same error message as tiffinfo shown below)

Viewing GNU Linux:

Thanks to a post by Michael R. Head, there is a way to view the multipage TIFF files, but there is some command-line magic you have to walk through.

Let's first take a look at an indicator that we have a TIFF file created by Microsoft's MDI by using LibTIFF's tiffinfo tool. We first transport 2 multipage TIFF files (multipage.tif and multipage2.tif) from our Windows box to Ubuntu Linux.

$ ls
multipage2.tif  multipage.tif
$ file multipage.tif multipage2.tif
multipage.tif:  TIFF image data, little-endian
multipage2.tif: TIFF image data, little-endian

The unix file utility is telling us that the two images we're using in this example is a file with the format "TIFF image data, little-endian"

$ tiffinfo multipage.tif
TIFFReadDirectory: Warning, multipage.tif: unknown field with tag 513 (0x201) encountered.
TIFFReadDirectory: Warning, multipage.tif: unknown field with tag 514 (0x202) encountered.
TIFFReadDirectory: Warning, multipage.tif: unknown field with tag 37680 (0x9330) encountered.
multipage.tif: Invalid YCbCr subsampling.
TIFFReadDirectory: multipage.tif: cannot handle zero strip size.

Using tiffinfo we now know that for both the multipage.tif and multipage2.tif file that we do not recognise portions of the file that seem to be equivalent areas in both files.

$ tiffinfo multipage2.tif
TIFFReadDirectory: Warning, multipage2.tif: unknown field with tag 513 (0x201) encountered.
TIFFReadDirectory: Warning, multipage2.tif: unknown field with tag 514 (0x202) encountered.
TIFFReadDirectory: Warning, multipage2.tif: unknown field with tag 37680 (0x9330) encountered.
multipage2.tif: Invalid YCbCr subsampling.
TIFFReadDirectory: multipage2.tif: cannot handle zero strip size.

Seeing the error messages displayed by tiffinfo helps us to understand some of the error messages displayed by the above image viewers. The errors are implying these viewers use of the libtiff  library and it's limitations. It should be pointed out here that libtiff.org documents:

TIFF 6.0 Specification Coverage

The library is capable of dealing with images that are written to follow the 5.0 or 6.0 TIFF spec. There is also considerable support for some of the more esoteric portions of the 6.0 TIFF spec.
...
Note that there is no support for the JPEG-related tags defined in the 6.0 specification; the JPEG support is based on the post-6.0 proposal given in TIFF Technical Note #2.
...
The JPEG-related tag is specified in TIFF Technical Note #2 which defines a revised JPEG-in-TIFF scheme (revised over that appendix that was part of the TIFF 6.0 specification).

I am not so sure how relevant the above is to the Microsoft MDI problem, but suffice it to say I don't know enough to blame anyone for why so many open source software lack support for viewing MDI multi-page TIFF files.

Unix: Extracting the Images

We now know that the TIFF file could be a legitimate TIFF file, but we can't view the images without resorting to a Windows box. Thanks again to Michael R. Head's article the solution is through a forensics tool Foremost.

Foremost is a console program to recover files based on their headers, footers, and internal data structures. This process is commonly referred to as data carving. Foremost can work on image files, such as those generated by dd, Safeback, Encase, etc, or directly on a drive. The headers and footers can be specified by a configuration file or you can use command line switches to specify built-in file types. These built-in types look at the data structures of a given file format allowing for a more reliable and faster recovery.

Foremost seems to understand the TIFF data structure presented by Microsoft's MDI, so it can extract the separate streams/images and store them to the disk for 'later processing. Using foremost is rather simple as shown below on our two multipage files.

$ ls
multipage2.tif  multipage.tif
$ foremost -i multipage.tif -o multipage
Processing: multipage.tif
|*|
$ foremost -i multipage2.tif -o multipage2
Processing: multipage2.tif
|*|

foremost creates subdirectories (-o)  jpg and ole where jpg contains the images (both full image and thumbnail image), and ole contains ocr'd versions of the pages.

$ ls -R
.:
multipage  multipage2  multipage2.tif  multipage.tif
./multipage:
audit.txt  jpg  ole
./multipage/jpg:
00000000.jpg  00000545.jpg  00000937.jpg  00001543.jpg  00002127.jpg
00000538.jpg  00000931.jpg  00001535.jpg  00002120.jpg  00002682.jpg
./multipage/ole:
00002692.ole
./multipage2:
audit.txt  jpg  ole
./multipage2/jpg:
00000000.jpg  00002941.jpg  00006432.jpg  00009274.jpg  00011870.jpg  00014243.jpg  00016827.jpg
00001609.jpg  00004364.jpg  00006444.jpg  00009284.jpg  00011879.jpg  00014252.jpg  00016836.jpg
00001622.jpg  00004375.jpg  00007880.jpg  00010598.jpg  00012939.jpg  00015470.jpg  00018163.jpg
00002931.jpg  00004954.jpg  00007891.jpg  00010608.jpg  00012948.jpg  00015481.jpg
./multipage2/ole:
00018172.ole

The jpg files, being thumbnail and full image should have distinctive sizes such as the above listing shown below

$ ls -lR
./multipage:
total 12
-rw-r--r-- 1 samt samt 1178 2007-05-30 14:47 audit.txt
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 jpg
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 ole
./multipage/jpg:
total 1380
-rw-r--r-- 1 samt samt 275019 2007-05-30 14:47 00000000.jpg
-rw-r--r-- 1 samt samt   3709 2007-05-30 14:47 00000538.jpg
-rw-r--r-- 1 samt samt 197089 2007-05-30 14:47 00000545.jpg
-rw-r--r-- 1 samt samt   3011 2007-05-30 14:47 00000931.jpg
-rw-r--r-- 1 samt samt 305575 2007-05-30 14:47 00000937.jpg
-rw-r--r-- 1 samt samt   4002 2007-05-30 14:47 00001535.jpg
-rw-r--r-- 1 samt samt 294723 2007-05-30 14:47 00001543.jpg
-rw-r--r-- 1 samt samt   3442 2007-05-30 14:47 00002120.jpg
-rw-r--r-- 1 samt samt 284052 2007-05-30 14:47 00002127.jpg
-rw-r--r-- 1 samt samt   4793 2007-05-30 14:47 00002682.jpg
./multipage/ole:
total 8
-rw-r--r-- 1 samt samt 5632 2007-05-30 14:47 00002692.ole
./multipage2:
total 12
-rw-r--r-- 1 samt samt 1998 2007-05-30 14:47 audit.txt
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 jpg
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 ole
./multipage2/jpg:
total 9200
-rw-r--r-- 1 samt samt 823649 2007-05-30 14:47 00000000.jpg
-rw-r--r-- 1 samt samt   6345 2007-05-30 14:47 00001609.jpg
-rw-r--r-- 1 samt samt 669597 2007-05-30 14:47 00001622.jpg
-rw-r--r-- 1 samt samt   5344 2007-05-30 14:47 00002931.jpg
-rw-r--r-- 1 samt samt 728014 2007-05-30 14:47 00002941.jpg
-rw-r--r-- 1 samt samt   5365 2007-05-30 14:47 00004364.jpg
-rw-r--r-- 1 samt samt 296251 2007-05-30 14:47 00004375.jpg
-rw-r--r-- 1 samt samt 756384 2007-05-30 14:47 00004954.jpg
-rw-r--r-- 1 samt samt   6134 2007-05-30 14:47 00006432.jpg
-rw-r--r-- 1 samt samt 734716 2007-05-30 14:47 00006444.jpg
-rw-r--r-- 1 samt samt   5064 2007-05-30 14:47 00007880.jpg
-rw-r--r-- 1 samt samt 707892 2007-05-30 14:47 00007891.jpg
-rw-r--r-- 1 samt samt   4973 2007-05-30 14:47 00009274.jpg
-rw-r--r-- 1 samt samt 672318 2007-05-30 14:47 00009284.jpg
-rw-r--r-- 1 samt samt   4854 2007-05-30 14:47 00010598.jpg
-rw-r--r-- 1 samt samt 645537 2007-05-30 14:47 00010608.jpg
-rw-r--r-- 1 samt samt   4784 2007-05-30 14:47 00011870.jpg
-rw-r--r-- 1 samt samt 542300 2007-05-30 14:47 00011879.jpg
-rw-r--r-- 1 samt samt   4081 2007-05-30 14:47 00012939.jpg
-rw-r--r-- 1 samt samt 662687 2007-05-30 14:47 00012948.jpg
-rw-r--r-- 1 samt samt   4416 2007-05-30 14:47 00014243.jpg
-rw-r--r-- 1 samt samt 623235 2007-05-30 14:47 00014252.jpg
-rw-r--r-- 1 samt samt   5299 2007-05-30 14:47 00015470.jpg
-rw-r--r-- 1 samt samt 688888 2007-05-30 14:47 00015481.jpg
-rw-r--r-- 1 samt samt   4436 2007-05-30 14:47 00016827.jpg
-rw-r--r-- 1 samt samt 678824 2007-05-30 14:47 00016836.jpg
-rw-r--r-- 1 samt samt   4619 2007-05-30 14:47 00018163.jpg
./multipage2/ole:
total 8
-rw-r--r-- 1 samt samt 5632 2007-05-30 14:47 00018172.ole

I don't know what the sequencing issues are with the file names, but it seems obvious that the larger files will be the full image, with one of the smaller files being a thumbnail of the same (presumably the nearest higher order number.)

Unix: Automating extraction and viewability

In a comment to Michael R. Head's article, typhoncore writes a nice bash script that uses ImageMagick's 'convert' utility and pdftk to create a multipage PDF file from the larger images. It is listed  here with a few minor modifications I have inserted (for better or worse.)

#!/bin/bash
DOC_COUNT=0
arg1=$1
arg_out=$arg1.out
echo "Extracting Images from $arg1 using foremost to $arg_out"
foremost -i $arg1 -o $arg_out
echo "Done"
cd $arg_out/jpg
echo "Converting Single Images to PDF"
for i in $(ls *.jpg); do
   ODDEVEN=$(echo "scale=0; $DOC_COUNT % 2" | bc)
   if [ "$ODDEVEN" = "0" ] ; then
        echo -n "  >  $i to $i.pdf"
        convert $i $i.pdf
        echo " - done"
   fi
   DOC_COUNT=$(echo "scale=0; $DOC_COUNT + 1" | bc)
done
echo -n "Merging separate single page PDF's to a multipage PDF"
pdftk *.pdf cat output merged.pdf
mv merged.pdf ../../$arg1.pdf
echo "  - done"
cd ../..
echo -n "Removing temporary directory $arg_out"
rm -Rf $arg_out
echo "  - done"

The bastardisation of typhoncore's script is to add console progress indicators (and as additional documentation within the script) for us noobs.

Output of the script will look something like the below.

$ sh TIFFtoPDF.sh multipage.tif
Extracting Images from multipage.tif using foremost to multipage.tif.out
Processing: multipage.tif
|*|
Done
Converting Single Images to PDF
  >  00000000.jpg to 00000000.jpg.pdf - done
  >  00000545.jpg to 00000545.jpg.pdf - done
  >  00000937.jpg to 00000937.jpg.pdf - done
  >  00001543.jpg to 00001543.jpg.pdf - done
  >  00002127.jpg to 00002127.jpg.pdf - done
Merging separate single page PDF's to a multipage PDF  - done
Removing temporary directory multipage.tif.out  - done
$

Unix: GNOME GUIfying extraction and viewability

I was thinking what could be a registry hack (Windows Hat on) or other means to let the File Explorer in X Windows (later discovering it is called GNOME Nautilus) send TIFF files to the above script when I came across a solution for separate but related problem Mount and UnMount ISO images without burning them

That lead me to a rehacked whack of the above TIFFtoPDF.sh that can be placed in your ~/username/.gnome2/nautilus-scripts/ folder.

Read Nautilus File Manager Scripts : Questions and Answers for more details on how to get the below script working properly with Nautilus.

<preformat>

#!/bin/bash
# mount

BASENAME=`basename $NAUTILUS_SCRIPT_SELECTED_FILE_PATHS`

DOC_COUNT=0
INFILE=$BASENAME
OUTPUT=$INFILE.odir

if ! zenity --question --title "Convert MS TIFF file to Multipage PDF"

--text "Do you wish to Convert the MS TIFF $BASENAME to a Multipage PDF?"
then
        exit 0
fi

foremost -i $INFILE -o $OUTPUT
cd $OUTPUT/jpg

for i in $(ls *.jpg); do
   ODDEVEN=$(echo "scale=0; $DOC_COUNT % 2" | bc)
   if [ "$ODDEVEN" = "0" ] ; then
        convert $i $i.pdf
   fi
   DOC_COUNT=$(echo "scale=0; $DOC_COUNT + 1" | bc)
done
pdftk *.pdf cat output merged.pdf
mv merged.pdf ../../$INFILE.pdf
cd ../..
rm -Rf $OUTPUT

The bare essentials for getting the above script working in GNOME Nautilus is:

  1. Put the script in ~/username/.gnome2/nautilus-scripts/
  2. Make the script executable
  3. Visit the directory using GNOME Nautilus

Conclusion

There is no going to Ubuntu/Linux or any other variant of Unix/BSD until this image viewing problem can find a simpler solution for these guys.

Funny how for the big ticket items we were eventually able to find good alternate solutions, but things fell over with this simple yet insurmountable problem.

Microsoft Outlook 2003 --> now using Thunderbird 2.0.x
Microsoft Word 2003 --> we have been testing Open Office 2.2 Write
Microsoft Excel 2003 --> we have been testing Open Office 2.2 Calc
Microsoft Access 2003 --> not currently using, no need for an alternative
Microsoft Publisher 2003 --> infrequent use, although testing scribus
Printing --> CUPS with Vendor Linux Drivers
Scanning --> XSane with Vendor Linux Drivers

Accounting Software --> Not currently using one, but looking around

For my own desktop needs, I'm still an XP man and will probably go to Vista with my next machine, as that will definitely be a TabletPC, but there's plenty of cheap Pentium IV's on www.ebay.com.au so I'm getting an X Windows (Gnome/KDE) up for some of the kid's fun and gaming (defining anything they enjoy as play.)

The sledge-hammer solution would be to run a mail server that would parse incoming emails for TIFF files and automatically detect/convert multipage files from TIFF to PDF. If this was a do or die situation I would probably work on it, as it is, it will have to wait for another day/solution.

References

Michael R. Head's Handling Microsoft Office Document Scanning TNEF and TIFFs in Linux
typhonecore Multipage TIFF to Multipage PDF script 
DRAFT TIFF Technical Note #2
Adobe Photoshop TIFF Technical Notes (PDF)

There are no comments attached to this item.

Options :
View Article Map
Log In to Contribute
View Archives