All Topics, C#, .NET >> C# Programming >> Office Programming http://www.codeproject.com/csharp/modi.asp OCR with Microsoft� Office By Martin Welker. Coming with MS Office 2003, the MODI library offers you an easy but effective way to integrate Optical Character Recognition (OCR) functionality into your own applications. | C# Windows, .NET (.NET 1.1) Win32, VS (VS.NET2003) Dev Posted: 15 Apr 2005 Updated: 5 Apr 2007 Views: 100,858 |
| |||||||||||
IntroductionOptical Character Recognition (OCR) extracts text and layout information from document images. With the help of Microsoft Office Document Imaging Library (MODI), which is contained in the Office 2003 package, you can easily integrate OCR functionality into your own applications. In combination with the MODI Document Viewer control, you will have complete OCR support with only a few lines of code. Important note: MS Office XP does not contain MODI, MS Office 2003 is required! Getting startedAdding the libraryFirst of all, you need to add the library's reference to your project: Microsoft Office Document Imaging 11.0 Type Library (located in MDIVWCTL.DLL). Create a document instance and assign an image fileSupported image formats are TIFF, multi page TIFF, and BMP. _MODIDocument = new MODI.Document(); _MODIDocument.Create(filename); Call the OCR methodThe OCR process is started by the // The MODI call for OCR _MODIDocument.OCR(_MODIParameters.Language, _MODIParameters.WithAutoRotation, _MODIParameters.WithStraightenImage); With the
The use of these parameters depend on your specific imaging scenario. Tracking the OCR progressSince the whole recognition process can take a few seconds, you may want to keep an eye on the progress. Therefore, the // add event handler for progress visualisation_MODIDocument.OnOCRProgress += new MODI._IDocumentEvents_OnOCRProgressEventHandler(this.ShowProgress);public void ShowProgress(int progress, ref bool cancel){ statusBar1.Text = progress.ToString() + "% processed.";} The document viewerTogether with the MODI document model comes the MODI viewer component axMiDocView1.Document = _MODIDocument; To make the component available in Visual Studio, just go to the Toolbox Explorer, open the context menu, select Add/Delete Elements.., and choose the COM Controls tab. Then, search for Microsoft Office Document Imaging Viewer 11.0, and enable it. Processing the recognition resultWorking on the result structure is pretty straightforward. If you just want to use the full text, you simply need the image's Collapse private void Statistic(){ // iterating through the document's structure doing some statistics. string statistic = ""; for (int i = 0 ; i < _MODIDocument.Images.Count; i++) { int numOfCharacters = 0; int charactersHeights = 0; MODI.Image image = (MODI.Image)_MODIDocument.Images[i]; MODI.Layout layout = image.Layout; // getting the page's words for (int j= 0; j< layout.Words.Count; j++) { MODI.Word word = (MODI.Word) layout.Words[j]; // getting the word's characters for (int k = 0; k < word.Rects.Count; k++) { MODI.MiRect rect = (MODI.MiRect) word.Rects[k]; charactersHeights += rect.Bottom-rect.Top; numOfCharacters++; } } float avHeight = (float )charactersHeights/numOfCharacters; statistic += "Page "+i+ ": Avarage character height is: "+ "avHeight.ToString("0.00") +" pixel!"+ "\r\n"; } MessageBox.Show("Document Statistic:\r\n"+statistic);} SearchingMODI also offers a full featured built-in search. Since a document may contain several pages, you can use the search method to browse through the pages. MODI offers several arguments to customize your search. // convert our search dialog properties to corresponding MODI argumentsobject PageNum = _DialogSearch.Properties.PageNum;object WordIndex = _DialogSearch.Properties.WordIndex;object StartAfterIndex = _DialogSearch.Properties.StartAfterIndex;object Backward = _DialogSearch.Properties.Backward;bool MatchMinus = _DialogSearch.Properties.MatchMinus;bool MatchFullHalfWidthForm = _DialogSearch.Properties.MatchFullHalfWidthForm;bool MatchHiraganaKatakana = _DialogSearch.Properties.MatchHiraganaKatakana;bool IgnoreSpace =_DialogSearch.Properties.IgnoreSpace; To use the search function, you need to create an instance of the type // initialize MODI searchMODI.MiDocSearchClass search = new MODI.MiDocSearchClass();search.Initialize( _MODIDocument, _DialogSearch.Properties.Pattern, ref PageNum, ref WordIndex, ref StartAfterIndex, ref Backward, MatchMinus, MatchFullHalfWidthForm, MatchHiraganaKatakana, IgnoreSpace); After the initialization call of the search instance, the process call itself is simple. MODI.IMiSelectableItem SelectableItem = null;// the one and only search callsearch.Search(null,ref SelectableItem); You will find the search results in the referenced MODI, Office 2007 and VistaGood news: Office 2007 and Vista, both support MODI! It's not installed by default, but you can easily add the package via installing options of your Office 2007. You just need to rerun the setup.exe (of your Office installation) again and choose the package as in the screenshot below.About document processingOCR is only one step in document processing. To get a more qualified access to your paper based document information, usually a couple steps and techniques are required: before documents are available as images, they have to be digitalized. This process is called 'scanning'. There are two important standards used for interacting with the scanning hardware: TWAIN and WIA. There are (at least) two good articles in CodeProject on how to use these APIs. Although the scanning devices are getting better, a couple of methods can be used to increase the image quality. These pre-processing functions include noise reduction and angle correction, for instance. As a next step, OCR itself interprets pixel based images to layout and text elements. OCR can be called the 'highest' bottom up technology, where the system has no or only little knowledge about the business context. In most cases, you have certain target structures you want to fill with the document information. That is called 'Document Classification and Detail Extraction'. For instance, you might want to process invoices, or you have certain table structures to fill. In Document Processing Part II, you can see how this kind of content knowledge can be used. After that, you might have an address database you want to match the document addresses with. Due to 'noisy' environments or disordered information, you need more sophisticated techniques than simple SQL. In the last step, the extracted information is given to the client application where customized workflow activities are triggered. ReferencesVersions
About Martin Welker |
'기본 카테고리' 카테고리의 다른 글
[필독]81회 응시자 전체 필기시험 점수 분포 통계 예상 (0) | 2007.04.09 |
---|---|
81회 기술사 성적 (0) | 2007.04.09 |
허걱 페인트로 모나리자 그리기???? (0) | 2007.04.08 |
게임엔진 기술의 최근 동향 (0) | 2007.04.06 |
윈도우 프로그래밍과 루아의 대통합 루아와 C/C++ 바인딩 하기 (0) | 2007.04.06 |