Products
OmniPage Capture CSDK Asian OCR Support
The OmniPage Capture SDK supports for Simplified and Traditional Chinese, Japanese (Hiragana, Katakana and Kanji), and Korean (Hangul and Hanja) languages.
The Asian OCR is a separate kit from Professional OCR and Professional Recognition kits. It can be installed and run together on the same developer’s machine. The same end user application can use both Western and Asian OCR. The Asian OCR API is the same as the Capture SDK 16 Western language kits.
Asian OCR Specifications
- OCR Engine System Software and Data
The SDK includes all OCR engine system software and data that will be required to use the OCR engine. This includes, but is not limited to: Dictionaries Shape Recognition Tools and Data - Supported Languages
The OCR engine supports the following languages and character sets:- Japanese (Shift-JIS)
- Simplified Chinese (GB-2312 character set)
- Traditional Chinese (BIG5 character set)
- Korean (KSC)
- Image Modes
Black and white, Grayscale and Color - Image Input
Scanner, Image file, and Memory, in strips at a time for both gray-scale and color - Output File Formats
Single page and multi-page text, XML, RTF, Excel, HTML, Office 2007 (DOCX, XLSX, PPTX) and an optional module for PDF and XPS. - Font Information
- Simplified Chinese: Hei, Song, Kai, SimSun, SimHei
- Traditional Chinese: MingLiu, Gothic
- Korean: Batang, Myeongjo, Gothic
- Japanese: Mincho, Gothic
- Text Detection
- Horizontal and vertical text layout
- Full and half width spacing
- Japanese Ruby (Hiragana/Katakana (8pt), Kanji (9pt), Latin (7pt))
Character and Document Structure
Each language also supports characters in the ASCII character set. The default output representation will be Unicode with conversion functions for SJIS, GB, Big5 and KSC.
The OCR engine is able to automatically identify the following components of a document without human intervention. Beyond automatically identifying different components of a document, the engine can output this information for use by other software products.
Character Identities (including punctuation and special characters)
- Numbers
- Spaces
- Character Bounding Boxes
- Confidence Vectors
- Position Coordinates
- Font size for characters
Word Information (applies to English and Korean output only)
- Word bounding boxes
- Position Coordinates
Document Structure Information
- Position coordinates for the document structural elements
- Paragraph Boundaries
- Text Region Boundaries
- Vertical or Horizontal Text Identification
- Text Columns
- Headers and Footers
- Tables
- Read Order
- Automatic Page Segmentation
- Picture/Image Bounding Boxes