BookScanWizard

BookScanWizard

Homepage: http://sourceforge.net/projects/bookscanwizard/
Interface: Command Line
Material: Linux, Windows, Mac
Platen Angle: GPL-3

Description

Book Scan Wizard (BSW) Is a book scanning post processor. It takes the pictures from a book scanner’s cameras, and converts them into tiff images that are suitable for converting to a PDF, DjVu, or other formats.

If you want to try out BookScanWizard, the easiest way is to start it using Java Web Start. It can be accessed from the following URL:
http://bookscanwizard.sourceforge.net/run/

From the website

A utility to help with Book scanning using cameras as a scanner. It will automate things such as cropping, rotating, fixing keystoning, fixing the DPI, and outputing it to tiff files that can be changed into PDF's or ebooks.

Mac Support

It is written in Java with the JAI toolkit so will run on many platforms, including Windows, Linux, and Mac. Note that the Mac version will run slower, because the JAI toolkit doesn’t have a native library for the Mac.

Install

The easiest way to install is to use the Java Web start version. It will install the application and all the prerequisites. To do that, simply click this link:

http://bookscanwizard.sourceforge.net/run/

The other way is a manual process. Download the current version, unzip it and follow the directions in INSTALL.txt file.

Resources

http://sourceforge.net/apps/mediawiki/bookscanwizard/

Instructions / Documentation for BSW

Postby JonEP » 23 Mar 2011, 16:22

BSW looks like it will be incredibly useful. However, the interface seems a bit daunting to me. As it has now undergone several revisions, and has numerous features and commands, and as there are numerous threads on the forum dedicated to various questions and issues, I am wondering if there exists a good starting point for the average end user to try to figure out what BSW is and how to use it. So far, I don't see a wiki or documentation available (or have I just missed it?). Is the forum the best bet, currently, for trying to figure out how to make use of BSW?

Thanks.

JonEP

Re: Instructions / Documentation for BSW

Post by daniel_reetz » 23 Mar 2011, 16:25

I agree that this is a need. Someone (perhaps me) really needs to make a BSW introduction video. Steve has made quite an effort toward support and documentation, but an intro video would still help a lot. Here's his main post on this topic: viewtopic.php?f=3&t=839

daniel_reetz

Location: City of Angels

Top

Re: Instructions / Documentation for BSW

Post by steve1066d » 23 Mar 2011, 16:42
It makes sense for me to do the video.. I think I'll have some time Friday to do it.

Steve Devore aka steve1066d

BookScanWizard, a flexible book post-processor.

Location: Minneapolis, MN

Top

Re: Instructions / Documentation for BSW

Postby steve1066d » 28 Mar 2011, 21:12

In the meantime, you could try following the example on the wiki.. It might be enough to get you started.

Steve Devore

BookScanWizard, a flexible book post-processor.

steve1066d

Location: Minneapolis, MN

Newbie Questions

Postby seasalt » 01 May 2011, 06:10

Hello - I am new to this forum - and new to book scanning - new to acrobat - not very techncial and very new to posting in forums - so apologises ahead, if I use the wrong language/wrong place/wrong style.

I do love books though. So any help anyone wishes to send my way - I am very appreciative of.

I have read both apps sections on the 2 tools - BSW and ST - plus wiki and have questions. I am not sure which is best area, so I put under instructions as I could not find the answer anywhere. I have got [[ScanTailor]] (ST) working but am struggling with BookScanWizard (BSW). I would like to use BSW as many of my images require the same processing before I ocr + create a pdf.

So I am delighted to find 2 tools that can help simple people like me. I have about 500 non-fiction books to scan in and I want to create SMALL sized (less than 10mb) magnficent searchable, bookmarked pdfs - not average. IMPRINT of the book is important to me - page numbering - table of contents etc…

workspace:
Macbook intel - 10.6X Snow L
Adobe Acrobat Prof 10.x and ABBYY Express for MAC for post processing. Also new, Graphic Converter
Source of book page is 1 of these 3 options:
- flat bed scanner, 300 dpi and can scan to TIFF, PNG, JPEG, JPEG2000 or pdf
- djvu extracted to PDF or TIFF document (using djvulibre/extract tool)
- (adobe digital editions) (ADE) and use print function to produce screen image PDF, which then
I extract to TIFF

I am stumped as I don't get some basics - COMPRESSION (what sets it/how to) - IMAGE TYPES (text layers vs images, and the different types) - the best workflow sequence.

Specific questions (tell me if these should be in separate posts)

1) Can not get BSW to install on my macbook. I can use java version but I am not online often. I have DL it successfully and now I have a bunch of scripts. install.txt tells me to do stuff with java, but java comes with SL 10.6x I found (thanks to anoymous1 post directing me to http://javatester.org/version.html) to determine version. I also have the firefox java plugin.

1.1) Can you direct me to step by step instructions please to install - and show me how I test if I have these in my system per install.txt
Java Advanced Imaging (JAI) 1.1.3 http://java.sun.com/products/java-media … 1_1_3.html
JAI ImageIO 1.1: http://download.java.net/media/jai-imag … elease/1.1
How do I run "java -Xmx1024 -jar BookScanWizard.jar", in the terminal window? What do I type?

1.2) What do I set in the parameter override DPI

1.3) What do I put into field destination DPI (is it 300dpi for text and 600dpi for illustrations+text content)?

2) Don't understand how to get a "loseless" compression PDF and how compression works. I understand ST uses uncompressed TIFF. I don't know what BSW exports out to.

2.1) I don't know when is compression important - e.g. in final step or the entire way thru the process? (e.g. is it better to scan at high quality (600dpi with illustrations 300dpi if text - black and white - and turn off all compression parameters?) to TIFF and then set the compression in "create PDF tool"

2.2) Is there more than 1 type of compression e.g. compression at image and compression at out put document (e.g. PDF) or am I just plain muddled

2.3) What is the best settings for compress loseless in Adobe acrobat profession 10.x?

2.4) Is there a better tool for MAC 10.6x users to create compressed PDFs (I use annotate alot in my work so, I prefer PDF to djvu still)

2.5) When people say "uncompressed TIFF", does that mean TIFF? or TIF? (I have both options in my scanner)
e) loseless means = no loss of quality and small in size, correct?

2.6) Compression types that appear in the different apps
e.g. in BSW its G4 and deflate
e.g. in acrobat it appears to be in optimized options tab - colour/grayscale - monochrome and size slider
e.g. in djvu export to TIFF has 3 options "force bitonal G4 compression - Allow lossy JPEG compression and then set jpeg quality number x - allow deflate compression"

2.7) If I scanned to PDF rather than to TIFF image (as it is quicker in scanning mechancis (e.g. scanner does not finish routine, is ready for next page turn), is there any loss in quality?

2.8) How does colour - greyscale - black white affect "image quality" and/or compression

3) Image Type

3.1) I could not see in ST or BSW that speaks to layering. I understood unless we layer the text and the image separately the compression is impacted. am I mistaken? or is layer nothing to do with image type?

3.2) Don't understand what is advantage or disadvantage of using TIFF v PNG
I get ST uses TIFF and why (per the post) but with flat bed scanners, TIFF files cause fuzzy as they scan 1 directional, so for big books, I scan to jpeg 300dpi and then extract to TIFF.
— does this cause any loss in quality?

3.3) Don't understand what JPEG2000 image type is - its in acrobat under optimised PDF option (colour/grayscale) - it is also in my scanner. But it is not in BSW or ST.

3.4) For covers and back of book what is the suggested image settings for a beautiful looking high gloss colour image?

4) IN terms of workflow

4.1) What is "keystone"? BSW mentions it. I have no idea what it is.

4.2) Is it best to not load cover/back into ST or BSW and process these images separately as they are different to rest of book

4.3) Is it best to do the image detailed work first for all the content (e.g. 300-400 pages) e.g. lighten boxes with text in them, so the text can be read in ocr

4.3) Or is it best to split 2 page/rotate/crop first, then do the detailed image work

4.4) What is best step and tool to scrub out handwritten notes? (I was thinking if I use ST content border and cropped at that, then most of the handwritten notes would be gone - as an alternative to scrubbing)

4.5) In ST I could not see how to "sharpen the text". I saw make the text fatter or thinner but nothing about sharpening text

4.6) Is it best to add cover last (as colour and different resolution to rest of pages) - ST did not do a clean cut on edges - or probably me as the border function did not make sense to me. I find cropping much easier.

4.7) Is the only way to get correct page numbers and table of contents in the "created PDF" to set them in WORD first? or is there some other ideas out there??

4.8) I could not find option in ST or BSW to change background colour e.g. old books its a bit yellow

5) software problems - in my testing for OCR quality

5.1) If I create a huge files for OCR (ABBYY Fine REader MAC Express) keeps crashing e.g. 600dpi for 300 pages. But if I put it at 300dpi, it processes (is this a known issue in book scanning community??)

5.2) Is there a link "comparing OCR quaity" in acrobat - OCR in PDFmaker - OCR in ABBYY finereader that anyone is aware of

- hopefully such a long post is ok -
thankyou BIG TIME, in advance for any help
cheers

seasalt

Top

Newbie Answered

Postby steve1066d » 01 May 2011, 16:53

You should still be able to use the webstart version even if you are offline. I'm not exactly sure where it installs on a mac, but on a PC, it creates a shortcut in the Programs lists. Thats going to be easier than installing it manually. The manual version is meant for those that are wanting to run things from the command line.

The override DPI is used to specify the DPI of your source document. Its equivalent to setting the DPI in ScanTailor. However, there's other, automatic ways of setting the DPI in BSW, so if you are using those (like using the focal length of the camera, or scanning a barcode page first). So if you are doing one of those, you don't need to specify the DPI. If you are using a scanner, just enter whatever DPI your scanner was set to.

The destination DPI depends on what you are doing with the scan. If you are keeping the output as grayscale or color, my opinion is 300 DPI is sufficient. If you are converting to bitonal (black and white), you may find 450 or 600 will give you better quality. There's a tradeoff between file size and quality. (a 600 DPI is 4 times as big as a 300 dpi image).

Compression is tough to understand because there are tradeoffs and paramters to adjust.. so it isn't a real simple topic to begin with. The reason we do compression is to reduce the file size.

There's two kinds of compression.. Loseless, and lossy. Loseless will only modestly reduce the image quality. (say by 30%). Loseless can reduce things much better, but it can introduce changes (called artifacts) in the compressed image. If a reasonable compression value is chosen, the changes shouldn't be very noticeable. If you are dealing with bitonal text or line drawing images, loseless compression works quite well, and creates a small, loseless file. For color or gray scale, uncompressed images are quite large. (A letter size color document at 600 DPI is around 55 megs for a single page).

Normally, you want to bother with compression at the output, as you generally dont' care if an intermediate file is large (as those will just get deleted when you are finished). Also, you want to avoid lossy compressing thing things multiple times, as you will loose a bit of quality each time you do that (think of making a copy of a copy of a VHS tape).

When you create a PDF, you can choose the compression you want.
I don't use adobe acrobat, so I can't really say what to use. If you give me the options, I can help you choose.
If you have access to adobe acrobat, I think that is the best tool to use. It can read any format that BSW can create, and gives you many options. If you have specific questions on using it, create a new question in the software forum.

Tiff and tif are the same format. The format is officially tiff, but then you have the old ms-dos filenames, that could only have 3 letters, which lead to the .tif name.

loseless means no loss in quality and smaller in size. Lossy compression will compress much smaller, but you loose some quality. I think you are better with lossy compression unless you are dealing with bitonal images.

Yep… there's different compressions, and different apps can handle different ones…
Here's a quick take:
G4: a good lossless compression but only works for bitonal images.
deflate: a lossless compression that uses the same compression that .ZIP uses.

2.7: It probably is using lossy jpg compression in creating the pdf directly from the scanner, so yes, I'd expect some loss of quality.
2.8. I think I've answer that above… if you have further questions, let me know.

3.1: Adobe acrobat does some fancy stuff where it breaks apart the background from the text and handles each separately. In that way, the background can be compressed greater without affecting the image (using lossy compression). ST & BSW don't layer data. DjVu and Adobe acrobat have options to use them.

3.2 There's many options with tiff files (which means it can have different compression types, etc). PNG files are simpler. If a program handles PNG, then it can handle pretty much any PNG file. However, there's so many options with tiff files, that it is common to have tiff files that can't be read by certain programs. As far as which to use.. it doesn't really matter, as long as the software you are using can read them. I'm not sure why your scanner behaves differently on tiff vs. png and jpeg.

3.3: BSW can handle jpeg2000 images. (Though it is done using the SaveImage operation, instead of using the default .tiff output). It reads jpeg2000 files normally. jpeg2000 has the advantages that it provides better compression over jpeg, and provides a losseless compression. It also can compress to a certain size. So if you wanted your pages to each be 100K, jpeg2000 is the best way of doing that. The disadvantage is that jpeg2000 isn't universally supported. Browsers don't ready jpeg2000 images natively.

3.4. Standard lossly compression works fine.

So in summary..
Only bother with compression on the final creation of the PDF. Use lossy compression for grayscale or color images.

Hope this helps..
Steve Devore
BookScanWizard, a flexible book post-processor.

steve1066d

Location: Minneapolis, MN


Top

Newbie recommendations

Postby Misty » 03 May 2011, 17:59

"steve1066d wrote:Yep… there's different compressions, and different apps can handle
different ones…"

Here's a quick take:
G4: a good lossless compression but only works for bitonal images.
deflate: a lossless compression that uses the same compression that .ZIP uses.

In Acrobat, I'd recommend using the JBIG2 lossless or lossy option rather than G4. It's a much more efficient compression, which will ensure much smaller filesizes. Lossless is perfectly lossless (and so identical to G4 in terms of the image you get), while lossy tries to identify similar characters and replaces them. It reduces the filesize even further, but you do run a risk of it making a few mistakes and replacing some similar-looking characters that aren't supposed to be identical. You can find the settings for that in the "convert to PDF" section of Acrobat's preferences.

If I'm remembering right, Acrobat X now defaults to JBIG2 lossless for monochrome compression instead of G4 like in older versions. If you import TIFFs that are monochrome, like a TIFF G4 from BSW, it will use the compression settings from the "convert to PDF" option's "TIFF" section.

My process using BookScanWizard

Postby steve1066d » 04 Feb 2011, 05:01

I thought I'd publish my process, and how I think is the most effective way to use BookScanWizard:

I've got a pretty much standard "New Standard" scanner, with the cameras running chdk.

I've got the two camera's connected to a powered usb hub, with a switch spliced into the 5v power line to the hub. I've seen using a hub mentioned before on this site, but not too often.. its a lot easier to do that then messing with batteries and splicing usb cords together. Here's the chdk script I use:

Code: Select all
@title Remote button
:loop
press "shoot_half"
do
until (is_key "remote")
click "shoot_full"
release "shoot_half"
goto "loop"

The speed of that script is equivalent to the SDM "Fast" mode… It takes the picture pretty much immediately when the remote is pressed. (I used [CHDK] because I was having problems getting SDM to override the focus).

I've got a process where I can scan multiple books at a time and then use BSW to split the images into a set of books, using barcodes (actually QR-codes).

On the cradle of my scanner, I've got barcodes on both sides that indicate "end of book".

That way when I'm scanning along and I come to the end of the book, I take a picture of the barcode as the last picture of the book. It then knows that it should start a new book.
Or if I know ahead of time the titles of the books I'm going to scan, I can create barcodes that encode the name of the book, which I scan before the start of the book.

I also have barcodes to do the following:

* Redo the page set: This indicates that the previous 2 pages will be be replaced with
the set that follows.
* Adjust perspective: This will rotate and fix keystone distortion, using the technique Rob
pioneered with his qrpc utility. However, since it requires carefully lining the barcode up
straight with the book, I usually skip this stip and do the correction interactively in the
program.
* Change to black and white or grayscale. If I want to set the page to black and white or
grayscale I can use these cards. While these can also be entered interactively, sometimes it is
easier to handle this while the book is open.
* Flag page: I use this to just put a note in the configuration file to indicate I should go back
and check that scan. I might use it if I think I might have missed or duplicated a page.
* Gray card: The next page that follows is a gray card, which I can use to adjust the exposure.
(Normally I don't bother with that, unless I'm trying to scan an art book or something like
that).
* Skip this page: I can put this on a blank page or other page I don't want in my finished output.

After scanning I copy the images to two directories, say c:\source\l for the left side of the pages, and c:\source\r for the right side. Next I run these two bsw commands:

Code: Select all
bsw -mergelr c:\source c:\merged
bsw -split -scale .25 c:\merged c:\sorted

That will merge the left & the right files together., and save them to the c:\merged directory. Next it will scan the files in c:\merged for barcodes, and move them to the c:\sorted directory, split into books. If I use the "Title" barcodes they are stored with the book name. If the break was just indicated by the "End book" barcode, it gets a generic name based on the date and time of the scan. At this stage the barcode information is written to barcodes.csv, which is a standard comma separated file that contains all of the found barcodes.

After this stop I'm ready for the interactive portion of the processing. You can start by using the wizard within BSW to create a startup configuration. However, once you get used to it you'll find it is easier to come up with your own template and use that as a starting point instead.

So I copy my template in the sorted book directory. Here's the template I use:

Code: Select all
# BSW Configuration Template
#
# The source directory
LoadImages = .

# The Destination directory
SetDestination = c:\abbyy\input

# Sets the final DPI and compression
SetTiffOptions = 300 NONE

# Estimate the source DPI from the focal length setting
EstimateDPI = 5.8,135, 23.2,482

# Configure the left pages]
Pages = left
Rotate = -90

# Configure the right pages]
Pages = right
Rotate = 90

Pages = all
Barcodes =

########################################################################
### Insert commands to fix keystone, color, etc.
########################################################################

Pages=left

Pages = right

########################################################################

Pages=all
#This imporves the contrast a bit than the above levels command.
Levels = 0, 90

# Rescale the image to match the final DPI

ScaleToDPI=
PostCommand = c:\java\BookScanWizard\util\bsw.ahk
PostCommand = move c:\abbyy\output\output.pdf "%parentAsName%.pdf"
PostCommand = del /q c:\abbyy\input

Note: If you right click the Barcodes: configuration line, it will replace the Barcodes: with the commands defined via barcodes.

Next I select a left page selection in the template, then choose a left page in the middle of the book and fix the perspective. For this step you just need four corners that will be straightened into a rectangle. I save that then define the crop by clicking two points to indicate the crop, then save that as a crop operation.

I do the same thing for the right side, but when I get to the crop, I cut and paste the configuration from the left page, then hold down the shift key and move it to the same spot on the right side. That way the crop sizes are the same as the right & left.

I then make sure that the checkboxes are turned off, and examine a few other pages to ensure that the perspective and cropping works for all the pages. I'll either adjust the crops, or maybe create a crop for a subset of the pages, if necessary.

Next I use the right click option in the viewer, "autolevels", which will figure out a level command to improve the contrast of the image. Usually I use a separate correction for the left & right images because in my scanner at least, my lights don't illuminate both pages quite the same.

Once everything is looking right, I'll click at the end of the script and examine a few pages. Once I'm satisfied everything is looking good I'll save the configuration file. If I just have one book to do I'll press the "submit" button to run the batch.. otherwise I'll continue on with the following:

Repeat the above steps for the other books I've scanned.

I then use a script that runs this command for each book:

bsw -batch

That will run the command and save the tiff files. I've set it up so the script will also call an [http://www.autohotkey.com|] AutoHotKey script to create a PDF file using ABBYY's FineReader professional.

Here is a copy of that script:

Code: Select all
SetTitleMatchMode 2
run C:\Program Files\ABBYY FineReader 10\FineReader
WinWaitActive ABBYY FineReader
Send ^t
Send !n
WinWaitActive, A_Convert, Close
Send {Enter}
WinWaitNotActive, A_Convert, Close
Send !fx
WinWaitActive,, Do you Want to save
Send n
return

It all may seem a bit complicated at first, but actually, for less than 5 minutes of my time per book I can go from raw images to dekeystoned, cropped, color adjusted, and OCR'd pdf file.


Forum links

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License