Home
3-Heights™ PDF Extract API, User Manual
Contents
1. QMS 3 Heights TM PDF Optimizer API 1 60 3 Heights TM PDF Printer API 1 60 3 Heiahts TM PDF Renderer API 1 60 Fa gt 3 Heights TM PDF Export API 1 60 Location D Bin bin PDFParser dll Language Standard ASP Script The PDF Extract component can be accessed in an ASP script using the call Server CreateObject and a class name as parameter For example to create PDF Extract Document object use a command like this set pdfDoc Server CreateObject PDFParser Document Here is a small ASP sample how to create a Document object and then retrieve the total number of pages in a PDF file The path to the PDF myfile pdf needs to be modified lt Language VBScript gt lt option explict dim pdfDoc set pdfDoc Server CreateObject PDFParser Document if not pdfDoc Open Server Mappath myfile pdf then Response Write lt p gt Response Write Could not open file amp lt br gt end if Response Write lt p gt Response Write Number of pages amp pdfDoc PageCount lt br gt Response Write lt p gt gt PDF Tools AG Premium PDF Technology 3 3 3 Heights PDF Extract API Version 4 5 Page 23 of 80 July 9 2015 NET There should be at least one NET sample for MS Visual Studio 2005 available in the ZIP archive of the Windows Version of the 3 Heights PDF
2. Table Interfaces Interface Programming Languages NET The MS software platform NET can be used with any NET capable programming language such as CH VB NET JA others JNI The Java native interface JNI is for use with Java COM The component object model COM interface can be used with any COM capable programming language such as MS Visual Basic MS Office Products such as Access or Excel VBA C VBScript others c The native C interface is for use with C and C PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 13 of 80 July 9 2015 Distributed Files The software developer kit SDK contains all files that are used for developing the software The roles of all files with respect to the four different interfaces is shown in Table Files for Development The files are split in four categories Req This file is required for this interface Opt This file is optional e g pdcjk dll is used to support Asian languages it is not used for other languages See also Table File Description to identify which files are required for your application Doc This file is for documentation only An empty field indicates this file is not used at all for this particular interface Table Files for Development Name NET JNI COM C binXPDFParser dll Req Req Req Req binYpdcjk dl1 Opt Opt Opt Opt bin NET d11 Req bin NET xml Doc b
3. usssssnannnnnnnnnnnnnnnnn nennen nn nun nun nun nun nun nun nun nun nennen nennen 17 1 8 Uninstall Install a new version cece cece eee nun nun nun nun nun nun nennen nennen nennen 17 1 9 UNIX O EEE 17 Installation on Unix Systems uasserennnnennnnnnnnnennn nenne nun nennen nen nennen nen 17 Installation on Mac OS X cece nennen en enn nenne une nnnn nennen nenne 17 ES AAA 3 In LIEBEN 18 2 License Management uuanuannannannnnnunnunnunnunnunnunnunnnunnunnunnunnannunnunnannunnn 19 2 1 Graphical License Manager Tool uz urur nHnnan en nnn nn namen nn teens tees ea nenn nen 19 List all installed license keyS nennensennennennennnn nn eee eee eee eens testes une n nennen nennen 19 Add and delete license keyS u2ss snsnennnnennnnnnnn nenne nenne nenne nenne nenne nennen 19 Display the properties of a license erauserannenanennnnnnnnnennnn en nn en nn ee tees 20 Select between different license keys for a single product urs4 4 444 20 2 2 Command Line License Manager Tool c ceceeeeeee eee ee nn nn nn ann anne namen nen 20 List all installed license KCYS cceceee cece eee eee e eee nnennnn ernennen 20 Add and delete license keys unnernennennennennnnennennenn nenn nenne nun nun nun nennen nennen 20 Select between different license keys for a single product 44s4 4 444 so 20 23 license KEY SO Aira ii a beck ld ee Caled 21 WiNdOWS een li 21 Mac OS Kivi oth EN e
4. BeginOCM ocmM 4 note that OCM blocks can be nested typically uses for hierarchical OCGs Path Path gray 64 square EndOCM BeginOCM OCM 5 OCG 5 is Gray 128 smatch Path gray 128 square EndOCM EndOCM PDF Tools AG Premium PDF Technology
5. TOOLS COM Premium PDF Technology 3 Heights PDF Extract API Version 4 5 User Manual Contact pdfsupport pdf tools com Owner PDF Tools AG Kasernenstrasse 1 8184 Bachenb lach Switzerland Switzerland http www pdf tools com Copyright 2003 2015 3 Heights PDF Extract API Version 4 5 Page 2 of 80 July 9 2015 Table of Contents Table Of Contents uuuzunuunnunnannannannnnnannannunnunnunnunnunnunnunnunnnunnunnunnunnunnunnunnannen 2 1 Introduction uunuunuunnunnunnnnnnunnunnunnunnannunnunnunnunnunnunnunnunnnnnnunnunnunnannunnunen 9 Lil Description near ee a a aa er 9 1 2 FUNGHONS Er A A A A A 9 A O 10 Formats u Rena 10 COMME sa 10 13 A nee tee 10 14 Operating Systems a ne ie ei cad 10 1 5 Installation Software Developer Kit 4s4 r4sHernn nen nn nenne nenn nenne nenne nenne 12 Interfaco Sh en sense 12 Distributed les a a LI 13 Color Profiles an N ai N a A END la 14 1 6 Deployment Runtime Kit s r4srerannennn nen nn nenne nenne nenne nenne nenne nenne nennen 14 Distributed Files Anna na 14 Deploying the Application uunersennennennennen eee eee en een nn nennen nennen nennen nennen 15 Example 2 3 NO 15 1 7 Interface specific Installation Steps essssseneen non none nen nn e estes nennen nennen nennen 15 COM Interface a a A Alsen ii ds 15 Java Interf ce ii nn a nn Te 16 NET Interface ak ie 16 Native C Interface
6. and World and otherwise as Hello World Merge text tokens that are a single space width apart displacement insert space Do not set this option if you need the RawString property PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 73 of 80 July 9 2015 Example If set the text objects Hello and World are extracted as Hello World if they are approximately one space width apart eTECPosMergeMultiSpace Merge text tokens that are one or more space widths apart displacement insert multiple spaces Do not set this option if you need the RawString property Example If set the text objects Hello and World are extracted as Hello World where spaces are inserted to represent the distance of the objects PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 July 9 2015 Page 74 of 80 5 Interface Changes 5 1 Changes from 1 4 to 1 4 1 This is a list of interface changes from version 1 4 1 4 0 21 to version 1 4 1 1 4 1 24 Annotation Interface New Property TextLabel ColorSpace Interface New Property Colorant Content Interface New Property Flags Destination Interface New Property Zoom Font Interface Removed Property FirstChar Property LastChar Image Interface New Method StoreInMemory Method GetImage Page Interface New Property BleedBox Property TrimBox Property ArtBox Property Device
7. 2015 Subject Property String Subject Accessors Get Return the subject from the document s info object Title Property String Title Accessors Get Return the title from the document s info object Page Interface ArtBox Property Variant ArtBox ACCESSOS COE This property returns the art box rectangle given by the coordinates left bottom right top The values are returned as an array of four single precision real numbers The art box is optional it defines the region that contains meaningful content intended by the creator If there is no art box set the crop box is returned BleedBox Property Variant BleedBox Accessors Get Return the bleed box rectangle given by the coordinates left bottom right top The values are returned as an array of four single precision real numbers The bleed box is optional it defining the region to which the contents of the page should be clipped when output in a production environment If there is no bleed box set the crop box is returned Content Property IPDFContent Content Accessors Get Return an interface to the content stream of the page see Content Interface CropBox Property Variant CropBox Accessors Get PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 35 of 80 July 9 2015 Return the crop box rectangle given by the coordinates left bottom right top The values are returned as an array of four single pr
8. Compile Debug References Unused References Reference Paths Reference Name Type version Copy Local Path References libpdFNET NET 1 7 0 13 True C Program Files pdf tools bin libpdf NET dll PdfExtractNET NET 1 7 0 13 True C Program Filesipdf tools bin PdfExtractNET dll Resources System NET 2 0 0 0 False CHWINNT Microsoft NET Frameworkiv2 0 507271System dil System Data NET 2 0 0 0 False CHWINNT Microsoft NET Framework v2 0 507271System Data dll Settings System Deployment NET 2 0 0 0 False CHWINNT Microsoft NET Frameworkiyv2 0 507271 5ystem Deployment dil System Drawing NET 2 0 0 0 False CHWINNT Microsoft NET Framework v2 0 507271 System Drawing dil Signing System Windows Forms NET 2 0 0 0 False Ch WINNT Microsoft NET Framework w2 0 50727 5ystem Windows Forms dll System xml NET 2 0 0 0 False C WINNT Microsoft NET Framework yv2 0 50727 System Xml dll Security Publish Imported namespaces Pdftools Pd Add User Import Pdftools Pdftools PdfExtract System CodeDom System CodeDom Compiler System Collections ObjectModel System Collections Specialized System ComponentModel Add Remove Update Web Reference System ComponentModel Design System ComponentModel Design Serialization xl Update User Import a The NET interface can now be used as shown
9. Property Integer ComponentsPerPixel Accessors Get Return the number of components per pixel HighIndex Property Integer HighIndex ACCESS Orgs COT Return the highest value of the indexed colors It is O when no indexed color space is used IsColor Property Boolean IsColor Accessors Get Return true when the color space is color IsIndexed Property Boolean IsIndexed Accessors Get PDF Tools AG Premium PDF Technology 4 9 3 Heights PDF Extract API Version 4 5 Page 59 of 80 July 9 2015 Return true when the image uses indexed colors IsMonochrome Property Boolean IsMonochrome ACCOSSOTSS COE Return true when the color space is monochrome Lookup Property Variant Lookup Accessors Get Return the lookup table Name Property String Name Accessors Get Return the name of the color space as string for example DeviceGrey DeviceRGB or Indexed TransformMatrix Interface a b C d e f Property Single Property Single Property Single Property Single Property Single h 0 QQ oe mw Property Single Accessors Get The transformation matrix in PDF is specified by six numbers All information about orientation rotation scaling skewing and translation can be calculated based on these six numbers However PDF Extract also provides properties which compute these values The values e and f represent the translation In a matrix 100 1 ef e is the distance on the x axis fr
10. eText elmage ePath eSave eRestore TPDFErrorCode Start of a sequence of objects whose visibility is defined by an optional content membership string End of OCM sequence No content object Text object Image object Path object Save the current graphics state Restore the current graphics state All TPDFErrorCode enumerations start with PDF_ followed by a single letter which is one of S E W or I an underscore and a descriptive text The single letter gives in an indication of the type of error These are Success Error Warning Information With respect to corrupt PDF files An error indicates a corruption in the PDF the file may or may not be readable A warning indicates the file is readable but not valid PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 72 of 80 July 9 2015 A full list of all PDF Tools error codes is available in the header file pdferror h The error codes that are listed to file access are listed here PDF_S_SUCCESS PDF E EVAL PDF E FILEOPEN PDF E FILECREATE PDF E PASSWORD TPDFOrientation eOrientationUndef eOrientationTopLeft eOrientationTopRight eOrientationBottomRight eOrientationBottomLeft eOrientationLeftTop eOrientationRightTop eOrientationRightBottom eOrientationLeftBottom The operation was completed successfully This software is an evaluation version Please contact www pdf tools com The
11. 11i and later a64 Itanium 64 it IBM AIX 5 1 and later 64 bit Linux 32 and 64 bit Mac OS X 10 4 and later 32 and 64 bit PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 11 of 80 July 9 2015 e Sun Solaris 2 8 and later SPARC and Intel e FreeBSD 4 7 and later 32 bit or FreeBSD 9 3 and later 64 bit on request PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 12 of 80 July 9 2015 e Installation 1 5 Installation Software Developer Kit The installation of the software requires the following steps 1 Download the software which is provided as ZIP archive from your download account 2 Unzip the files using a tool like WinZip to a directory on your local hard disk where your program files reside Check the appropriate option to preserve file paths folder names The list of files including sub directories of the developer kit SDK is listed in Table Files for Development 3 Identify which interface NET JNI COM C you are using and perform the specific installation steps for that interface These steps are described in the following chapters Interfaces The 3 Heigths PDF Extract API provides four different interfaces The installation and deployment of the software depend on the interface you are using The table below shows the supported interfaces and with which programming languages they can be used
12. Accessors Get Return the destination of a link annotation This entry is permitted if an A action entry is present Flags Property Long Flags AGECASS OS CC Return the flags of the annotation as 32 bit integer Invisible Hidden PDF 1 2 Print PDF 1 2 NoZoom PDF 1 3 NoRotate PDF 1 3 NoView PDF 1 3 ReadOnly PDF 1 3 Locked PDF 1 4 ToggleNoView PDF 1 5 oOOWOrAtoauw4F WDE IsMarkup Property Booloean IsMarkup nNeeessors n Get Return whether the annotation is a markup annotation The following annotations are considered markup annotations e Free Text annotations e Annotations that have a pop up window that may display text e Sound annotations Name Property String Name Accessors Get Return the name of the annotation as string PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 63 of 80 July 9 2015 Rect Property Variant Rect Accessors Get Return the rectangle of the annotation as x1 y1 x2 y2 Where x1 y1 is the lower left corner of the annotation and x2 y2 the upper right corner The coordinates are raw pdf coordinates In order to calculate where the rectangle is positioned on the page as displayed by a viewer the rectangle must be cropped using the page s CropBox and rotated using the Rotate attribute Subj Property String Subj Accessors Get Return the text representing a short description of the subject This property is only a
13. July 9 2015 APOS Y POS A EA A AAA AAA 47 4 6 GraphicsState Intertace midi A nennen 47 Alda SA A dd 47 AE a eee nent nee een EEE EEE nun nun ESSERE EEE nun nun EEE EE EES 47 CharS paGinG A O TT 48 TM A A AAA AAA 48 DE NR 48 BE E NO 48 FINA aC tad a 48 FINGOIOFEMNY Kisii tc a laud EAA A ah 48 ENCARGA i 49 FIICOLORS PACE 200 AAA Re 50 EINOVverprintrFl g rs dl dale 50 REA 50 FON Esc iia 50 A E 50 HorizontalScalin aaa 50 E A ran nk aa aa anaa aa 51 A Ha oh de bck sas ba hd bn add bane ak dd bak ate Gas ahd od 51 LINGIJOIN seis eave Sta coat dy co SM ta ci i i 51 EineWidthurase sen ren a RL een 51 A a a a a a ann nd ah ER ha nen sendin ae Ehe een 51 OverprintMode n nr are an 52 RenderingIntent u en e ei a aan 52 SmoothnessTolerance zu2snsnnnnnnnnnnnnnnnnnnen nenne nun nun nun nun nun nun nun nun nun nun nn 52 SoftMask ur 4 58 A 52 Strok amp Adjustment u nm ea een ana 52 SPACE Width iii a nn Renee 52 Stroke AlphaConstantianiiii a ad 52 StrokeColorGMy Koi en in Lena nen rd 53 StrokeC0lo rRGB i u un nn Ihn 53 Stroke ColorS pace in EDEL ee 53 StrokeO VverprintFlag a ame ee ab ia seer vad teva a a de i ives 53 TEX TKNOCKOU diia su 53 TextRenderingMode ncaa a a a LER Lei 53 TETRIS A a can a eae eae ae a eae Meee 54 WOrdSpaGinG iii A A 54 4 7 AAA A 54 A OE 54 AI A a 54 Base ii a Pendens eens da 55 C apHeight iii a ae TER cy inka eg tated 55 CA ai iia 55 Descente osaa E S a Veeialedee ender P
14. are italic 17 AllCap Font has no lowercase letters 18 SmallCap Lowercase letters are small uppercase letters 19 ForceBold If set bold glyphs are painted bold even at very small text size FontBBox Property Variant FontBBox Accessors Get Return the font bounding box The font bounding box is the rectangle in which all glyphs would fit if they were placed on top of each other with their origins at the same point FontFile Property Variant FontFile Accessors Get Return a stream that contains a Typel font program FontFileType Property Integer FontFileType ACCS SOS COT Return the type of the font A value of 1 corresponds to a Type 1 font program A FontFile2 contains a TrueType font program In most cases a value of 1 2 or 3 will be returned ItalicAngle Property Single ItalicAngle neeessors Geis Return the counter clockwise angle of the dominant vertical strokes of the font Leading Property Single Leading Neeessors COT Return the desired spacing between baselines of consecutive lines of text PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 57 of 80 July 9 2015 MaxWidth Property Single MaxWidth DeeessorsseGeie Return the maximum width of the glyphs in the font MissingWidth Property Single MissingWidth DNeeessorsseGeis Return the value of the width which is used for character codes for which the glyph is missing in the font directory s Width arr
15. are stored in the registry e HKLM Software PDF Tools AG for all users e HKCU Software PDF Tools AG for the current user Mac OS X The license keys are stored in the file system e Library Application Support PDF Tools AG for all users e Library Application Support PDF Tools AG for the current user Unix Linux The license keys are stored in the file system e etc opt pdf tools for all users e pdf tools for the current user Note The user group and permissions of those directories are set explicitly by the license manager tool It may be necessary to change permissions to make the licenses readable for all users Example chmod R gotrx etc opt pdf tools Getting started Visual Basic In order to use the component in a Visual Basic 6 project you have to add the component as a project reference as shown below The version which is registered will show up PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 22 of 80 July 9 2015 References Project1 yr xi Available References Visual Basic For Applications visual Basic runtime objects and procedures Visual Basic objects and procedures OLE Automation 145 Helper COM Component 1 0 Type Library 145 RADIUS Protocol 1 0 Type Library al 3 Heights TM Font To PDF Conversion API 1 60 __ 3 Heights TM Image to PDF Converter API 1 60 Priority 3 Heights TM PDF Annotation API 1 60 ights TM
16. consists of NET assemblies which are added to the project and a native DLL which is called by the NET assemblies This has to be accounted for when installing and deploying the tool The NET assemblies NET d are to be added as references to the project They are required at compilation time See also chapter Getting Started PDFParser dll is not a NET assembly but a native DLL It is not to be added as a reference in the project The native DLL PDFParser dll is called by the NET assembly PdfExtractNET dll PDFParser dl must be found at execution time by the Windows operating system The common way to do this is adding PDFParser dll as an existing item to the project and set its property Copy to output directory to Copy if newer Alternatively the directory where PDFParser dll resides can be added to the environment variable PATH or it can simply be copied manually to the output directory PDF Tools AG Premium PDF Technology 1 8 3 Heights PDF Extract API Version 4 5 Page 17 of 80 July 9 2015 Native C Interface e The header file expa_c h needs to be included in the C C program e The Object File Library ib PDFParser lib needs to be linked to the project e PDFParser dll should be on the environment variable PATH or if using MS Visual Studio in the directory for executable files Uninstall Install a new version 1 9 In order to uninstall the product undo all the steps d
17. constant PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 53 of 80 July 9 2015 StrokeColorCMYK Property Long StrokeColorCMYK neeessors Geie Return the CMYK color quad for stroking operations The color value is obtained by converting the color values of the property StrokeColor by means of the StrokeColorSpace The CMYK quads are encoded using the following formula Quad C 256 M 256 Y 256 K StrokeColorRGB Property Long StrokeColorRGB ASE SSS OMS Cie Return the RGB color triple for stroking operations The color value is obtained by converting the color values of the property StrokeColor by means of the StrokeColorSpace The RGB triples are encoded using the following formula Triple R 256 G 256 B StrokeColorSpace Property PDFColorSpace StrokeColorSpace Accessors Get Return an interface to the current color space that is used for stroking operations see ColorSpace Interface The color space is used to interpret color values of the property StrokeColor StrokeOverprintFlag Property Boolean StrokeOverprintFlag nNeeessorsemGeis This property returns the overprint flag for stroking painting operations TextKnockout Property Boolean TextKnockout nNeeessors aGeite Return the text knockout flag This Boolean flag determines what text elements are considered elementary objects for purposes of color compositing in the transparent imaging mo
18. degrees while displaying A positive number turns the page clockwise The value must be a multiple of 90 i e valid values are 270 180 90 O 90 180 270 TrimBox Property Variant TrimBox nNeeessors mGet Return the trim box rectangle given by the coordinates left bottom right top The values are returned as an array of four single precision real numbers The trim box is optional it defines the intended dimensions of the finished page after trimming If there is no trim box set the crop box is returned Content Interface BreakWords Property Boolean BreakWords Accessors Get Set Default True This property is deprecated and superseded by the TextExtConfiguration property In order to get the same behavior as with BreakWords use the following options eBreakWords true Set the eTECBreakSpaceUnicode flag and clear the flags eTECPosMergeSingleSpace and eTECPosMergeMultiSpace eBreakWords false Clear the eTECBreakSpaceUnicode flag and set the flags eTECPosMergeSingleSpace and eTECPosMergeMultiSpace BoundingBox Property Variant BoundingBox Accessors Get Set Default CropBox of the page The bounding box is a rectangle in user space units 1 72 inch The rectangle is used when the Reset method is called with AccountForRotate set to TRUE and has an effect on the coordinate transform The bounding box must be set before calling Reset PDF Tools AG Premium PDF Technology 3 Heights PDF E
19. embed Unicode mapping information for a symbolic font PDF Tools AG Premium PDF Technology 4 6 3 Heights PDF Extract API Version 4 5 Page 47 of 80 July 9 2015 Width Property Single Width Accessors Get Return the width of the string in points XPos YPos Property Variant XPos Property Variant YPos DeeessossemGeie Return the X and Y position of the characters The return value is a 1 dimensional array holding the positions of all characters Ifthe a Text contains n characters XPos 0 represents the 1 character XPos n 1 represents the last character XPos n is a calculated virtual position of where the next character would start This position and the actual position of the next character can be compared to decide whether they belong to the same word or not GraphicsState Interface Entries which have a complex structure such as a function are not retrievable with the 3 Heights PDF Extract Tool These are for example black generation functions BG transfer functions TR or under color removal functions UCR The extract tool has the ability to return colors in RGB or CMYK If the requested color space is different from the actual color space in the PDF the color conversion is down using color profiles AlphalsShape Property Boolean AlphalsShape Neeessors COE Return the AlphalsShape flag It is true if the soft mask contains shape values it returns false for opacity Blen
20. license is selected in the license list its properties are displayed in the right pane of the window Select between different license keys for a single product More than one license key can be installed for a specific product The checkbox on the left side in the license list marks the currently active license key Command Line License Manager Tool The command line license manager tool icmgr is available in the bin directory for all platforms except Windows A complete description of all commands and options can be obtained by running the program without parameters licmgr List all installed license keys liemgr List The currently active license for a specific product ist marked with a star on the left side Add and delete license keys Install new license key licmgr store X XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX Delete old license key licmgr delete X XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX Both commands have the optional argument s that defines the scope of the action e y For all users e u Current user Select between different license keys for a single product licmgr select X XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX PDF Tools AG Premium PDF Technology 2 3 3 Heights PDF Extract API Version 4 5 Page 21 of 80 July 9 2015 License Key Storage 3 1 Depending on the platform the license management system uses different stores for the license keys Windows The license keys
21. method opens a PDF memory block e makes the objects contained in the PDF document accessible If the document is already open it is closed first e Parameters MemBlock The memory block containing the PDF file given as a one dimensional byte array Password optional the user or the owner password of the encrypted PDF document If this parameter is left out an empty string is used as a default e Return value True The document was opened successfully from memory False The document in memory is not readable Page Property PDFPage Page Accessors Get This property allows to retrieve an interface to the currently selected page of a document PageCount Property Long PageCount nNeeessors nGeie Return the number of pages of an open document If the document is closed then zero is returned For collections aka PDF Portfolios with no cover page this property returns 0 PageNo Property Long PageNo Accessors Get Set Dekan This property allows to set and get the currently selected page of an open document given its page number The numbers are counted from 1 for the first page to the value ofthe PageCount attribute for the last page If the document is closed zero is returned Producer Property String Producer Accessors Get Return the name of the producer from the document s info object PDF Tools AG Premium PDF Technology 4 2 3 Heights PDF Extract API Version 4 5 Page 34 of 80 July 9
22. that is provided with the Windows operating system located in C windows system32 The following screenshot shows the registration of PDFExtract dll PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 16 of 80 July 9 2015 Command Prompt C Program Files pdf tools bin gt regsvur32 pdfparser dll_ If the registration process succeeds the following box is displayed x f 9 DilRegisterServer in pdfparser dll succeeded The registration can also be done silently e g for deployment using the switch s Other Files The other DLLs do not need to be registered but for simplicity it is suggested that they are in the same directory as the PDFParser dll Java Interface For compilation and execution The Java Archive jar EXPA jar needs to be on the class search path This can be done by either adding it to the environment variable CLASSPATH or by specifying it using the switch classpath javac classpath C pdf tools jar EXPA jar TextExt java For execution Additionally the Library bin PDFParser dlil needs to be on the library path This can be achieved by either adding it to the environment variable PATH or by specifying it using the switch Djava library path java classpath C pdf tools jar EXPA jar Djava library path C pdf tools bin TextExt input pdf NET Interface The 3 Heights PDF Extract API does not provide a pure NET solution Instead it
23. the current GraphicsState The image space that is transformed by the CTM is the unit square 0 O 1 1 i e the unit square is mapped to the rectangle or parallelogram in which the image is to be painted For example the coordinate on the page of the bottom right corner of the untransformed image is the transformation of the coordinate 1 1 Image Resolution Images are resources in a PDF document Every image can be referenced multiple times in the document The image itself doesn t have resolution it only has a resolution when referenced on a page The resolution depends on the ratio of the dimensions of the image and its size on the page it can be different every time Image Orientation Images can be stored with an orientation other than TopLeft default In order to display them visually correctly there is a transformation matrix applied to invert the orientation In order to ensure the images are saved with the same orientation as they are displayed on the PDF use the method ChangeOrientation as shown in the sample Optional Content Layers In order to associate content objects to Optional Content Groups OCG that define their visibility the following steps have to be taken First the IgnoreOCM property must be set to true Second use the Content interface s GetNextObject method to extract content objects Whenever a BeginOCM operator is encountered the OCM property contains the optional content membership string that
24. u444 n HR ann nn en nn 78 Text Extraction of Text Marked as Symbolic 4 u4 4HR HR ann an en 79 Image Extr ction a 79 Image RESOM ON escanea re dan ne ee 79 PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 8 of 80 July 9 2015 IMAGE OREA Pettit aa Pelee Sa ictal da ici dela ctci lett aS 79 5 14 Optional Content Layers z4ur rnnnennn nen nn nenne eee eee eee eee eee tenets 79 PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 9 of 80 July 9 2015 1 Introduction 1 1 Description The 3 Heights PDF Extract Tool is a solution for extracting and querying various attributes and page content from a PDF document This includes texts images graphic objects including paths metadata and embedded fonts It is also possible to query the properties of objects Intelligent mechanisms significantly increase extraction rates for instance when extracting text PDF Extract Tool Texts Fonts rn Se Pages Contents 7 Document Metadata TIFF JPEG ____Outines a Parameters 1 2 Functions The PDF Extract Tool is used to extract text images and graphic objects including paths from PDF documents Text is extractable as lines and as individual words It is also possible to query information such as position color font and font size Intelligent functions such as heuristics word formation support and character set
25. 00 to OxFF GG is green RR is red Decimal To retrieve the values for blue green red apply the following formulas integer division and bitwise and And Triple PDFPARSERLib GraphicsState FillColorRGB B Triple 65536 G Triple 256 And 255 R Triple And 255 Example Triple 8388736 purple B 8388736 65536 128 G 8388736 256 And 255 0 R 8388736 And 255 128 PDF Tools AG Premium PDF Technology Page 50 of 80 3 Heights PDF Extract API Version 4 5 July 9 2015 There are also other ways to retrieve these values than using the above formulas FillColorSpace Property PDFColorSpace FillColorSpace Accessors Get Return an interface to the current color space that is used for filling operations see ColorSpace Interface The color space is used to interpret color values of the property FillColor FillOverprintFlag Property Boolean FillOverprintFlag nNeeessors Get Return the overprint flag for painting operations other than stroking FlatnessTolerance Property Single FlatnessTolerance Accessors Get Return the flatness tolerance Must be a positive number A small number means higher precision Font Property IPDFFont Font Accessors Get Return an interface to the text s font object that describe the character encoding as well as the shape of the character glyphs FontSize Property Single FontSize Accessors Get Return the current font size for text s
26. A one bit signifies a transparent pixel and a zero bit signifies a pixel with the current fill color see GraphicsState Interface SMask Propertiy Variant SMask With this property the soft mask of an image can be extracted Store Method Boolean Store String FileName TPDFCompression Compression Store the image as a file e Parameters FileName The name of the disk file include path drive or Server string according to the operating system s naming rules The type of the image is defined by its extension jpg or tif Compression optional The compression type for TIFF images The default value is eComprDefault e Return values True The file has successfully been written False An error has occurred and the disk file is unusable StoreInMemory Method Boolean StoreInMemory String Extension TPDFCompression Compression Store the image in memory The saved image can be retrieved using the method GetImage e Parameters PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 45 of 80 July 9 2015 Extension The type of the image is defined by its extension jpg or tif Compression optional The compression type for TIFF images The default value is eComprDefault e Return values True The image has successfully been saved False otherwise Width Property Long Width Accessors Get Return the width of the image in pixels also called samp
27. Coli anni 61 COMUNES A A ee Aen Aiea eee Fangen 61 Dar ee TE 62 Dei ar DE e Rall 62 PAS at en ne er a 62 E aa eg ae eect ces 62 Name aia a Annan 62 Rd A A a HERE esau a 63 SUD A A A A A ASA 63 SUDY PE la a ann a a 63 TexXtLabel ni anne 63 UREA O ee ee 63 VEREICES En ee ee ee ae A LAANE ee here 63 4 12 Outlineltem Interface cece cece eee nun nun nun nun nun nun nun nun nun nun nennen nennen 64 A irbeirsen 64 Desa dada 64 A O O 64 4 13 Destination Interface eusessennennonnnnnnnnnnnnn non nun nr nr rr rr rr rr rr rr rr 64 BOOM aa tele e nesta velew hake holes nesta wel whole pele Seka weld hed whale Bekw wuld whek Reale UA bale 64 S A tA ihe kata ddA wise hada Meds A E ad tale AL 64 PAGGNO iii A A A Malet ee 64 RIGA id 65 E ORTA 65 TP ge Waianae 65 ZOOM en ee aris LD ara ne Renee ons 65 4 14 Ocg Interf ce un ee la RR ee 65 A MN NE 66 PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 7 of 80 July 9 2015 4 15 4 17 5 5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 5 10 5 11 5 12 Level RR Be ee 66 Names ra a nl ee EEE 66 UD i 66 Example Livin is oa din ov aie vc cick dea ee elev da A ce dees Feed 67 EX MPIG 2 ea nes ea naar nr nee Erna welds va RER aan tenes FRa HR 67 PBFObject Interface lol o Ln do aa A di 67 Begin GetNext End ur nn a es 67 TA NN 68 Dispose DestroyObject nn ia 68 GEtEIEMENE doc ias 68 GELEN O O RN 68 G tStrea M noa A en 68 TR
28. Colorant Text Interface New Property BoundingBox Property FontSize Property Length Property Rotation Property Width Property XPos Property YPos Removed Property RawString Property TextMatrix Property NextXPos Property NextYPos The properties TexMatrix NextXPos NextYPos are marked as deprecated No changes in the following interfaces Alternatelmage Document GraphicsState Outlineltem TransformationMatrix 5 2 Changes from 1 4 1 to 1 5 This is a list of interface changes from version 1 41 1 4 1 24 to version 1 5 1 5 0 40 Annotation Interface New Property Subj Property Dest Property URI ColorSpace Interface New Property Colorant Content Interface New Property SpaceFactor Document Interface New Method GetDestination Property IsLinearized Font Interface Removed Property FirstChar Property LastChar Text Interface Removed Property TexMatrix Property NextXPos Property NextYPos 5 3 Changes from 1 5 to 1 6 This is a list of interface changes from version 1 5 1 5 0 40 to version 1 6 1 6 0 41 PDF Tools AG Premium PDF Technology 5 4 3 Heights PDF Extract API Version 4 5 Page 75 of 80 July 9 2015 Annotation Interface New Property Vertices ColorSpace Interface New Properties ColorantName IsColor IsMonochrome Content Interface New Property BreakWords Document Interface New Properties Creator Producer GraphicsState Interface New Properties AlphalsShape BlendMode FillAlphaConstant Fi
29. EEGEIV AMO ita A ETE ea ev EL EE ta oh ed bales 68 Ma o 69 ObjJECENUMDET 0d ia ad en 69 REAIVAIUG ciao da ER Ren EA Eh eure 69 SS VASE EINER A RER REINE EEE O AN 69 StringVYalue ana Rinne 69 A FRE ILISTA TT ELIET ATTE LISIATE ETAETA TTET 69 EmbeddedFile Interface 28 22 NE 69 Check SUM ini a ae nal 69 CreationDates nd a dr dd dd en E dai de 70 TA Be 70 ModDate hen nn a 70 O 70 StoOrel Me MO Yi AAA 70 ENUMEFARIONS sa RR la dl aia Kal 71 1 PDFESMPLESSION sa ts 71 TPBFCGoNtEentObJect n u ai 71 lO 71 DPDEOMEMCAGON soa nas 72 TPDFTextExtractConfiguration aiaeei niena ke a a nun en anne nun 72 Interface Changes unuanunuanannnnanunnannnnannanannunannanannnnannanannanannanannanannanen 74 Changes from 1 4 to 1 4 1 2 na 74 Changes fromy 1 4 1 Co 1 5 a2 a legen 74 Ch nges from IO daa 74 Changes from LO 1 2 0 2 aD 75 Changes from A 7 tO AB paaa aa a aa a aa a a 75 CRanges froM1 8 to di Divan ini ae aiii 75 Changes from 1 9 to IA 76 Changes from 1 91 to 2 Qornini akai kin 76 Changes from 2 0 10 2 1 Jans ars o da een 76 Ch nges fr0M 4 3 10 4 42 trad 76 Samples amp Background Information zsr s4srennn nennen en nenn nenn nenn nenn anne 77 EXT EXC nee ek nen ae Tea a Ren EEE era EEE EEE bens 77 Undesired Missing Blanks zersersennennennen nn nn en nennen nennen nennen nennen nennen 77 Extracted Text is Unreadable 0 2 nn 78 Handling of Symbolic and Non Symbolic Fonts z
30. Extract API Easiest for a quick start is to refer to this sample In order to create a new project from scratch do the following steps 1 Start Visual Studio and create a new C or VB project 2 Adda reference to the NET assemblies To do so in the Solution Explorer right click your project and select Add Reference The Add Reference dialog will appear In the tab Browse browse for the NET assemblies libpdfNET dll RendererNET dll and PafExtractNET dll and add them to the project as shown below NET com Projects Browse Recent a e A a e libpdFWET dll SS PdFExtractNeT dl 9 PDFParser dll File name Pate xtractNET dil libpdfNe T dll gt Files of type Component Files dll tlb olb ocx exe manifest 3 Import namespaces Note This step is optional but useful 4 Write Code Steps 3 and 4 are shown separately for C and Visual Basic Visual Basic 3 Double click My Project to view its properties On the left hand side select the menu References The NET assemblies you added before should show up in the upper window In the lower window import the namespaces Pdftools Pdf and Pdftools PdfExtractNET You should now have settings similar as in the screenshot below PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 July 9 2015 Page 24 of 80 CH Application Configuration N A F Platform N A E
31. RS SE Return true when the image is bi tonal IsColor Property Boolean IsColor ASESSS OBS CO Return true when the image is color IsMonochrome Property Boolean IsMonochrome Accessors Get Return true when the image is monochrome ObjNumber Property Long ObjNumber Accessors Get Returns a unique number of this image resource If the number is 0 the image resource occurs once only in the document i e it is an inline image If the number is larger than 0 the image resource might be used multiple times IsMonochrome Property Boolean IsMonochrome Accessors Get PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 44 of 80 July 9 2015 Return true when the image is monochrome Samples Property Variant Samples Accessors Get Return the image s data samples in a byte array The sample data is ordered by line from top to bottom and within a line from left to right The lines are byte aligned If the number of bits per component is less than one byte then the samples are ordered beginning with the most significant bit first If the property ImageMask of the image is set to False the interpretation of the sample data must be done according to the properties in the color space of the image If the property ImageMask of the image is set to True the sample data represents a stencil mask In this case the color space isn t meaningful and the data is organized one bit per pixel
32. Root Info Encrypt e n Page n Path operators e name Entry name of the dictionary PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 30 of 80 July 9 2015 e i Index i in the array Examples e g Root Pages Kids 0 Contents e 1 Resources Font TT2 FontDescriptor FontFamily GetOcg Method Ocg GetOcg Integer Count Return an interface to an optional content group item e Parameters Count The number of the optional content group Optional content groups are numbered from O to OcgCount 1 e Return value An interface to an optional content group item GetPageLabel Method String GetPageLabel Long PageNo Return the label text associated to a specific page given its number Examples for page labels are 7 or vii e Parameters PageNo The page number e Return value A string holding the page label if a page label exists If no page label exists the page number is converted to a string and returned GetXMPMetadata Method Boolean GetXMPMetadata String FileName Extract the document s XMP metadata stream and write it to the specified file e Parameters FileName The name of the output file e Return value True if the document contains XMP metadata and the stream was successfully written to the output file GetXMPMetadataMem Method Variant GetXMPMetadata Extract the document s XMP metadata stream as a byte array If the
33. Text Method PDFText GetNextText This method reads the content stream objects until a text object can be returned or the end of the content stream is reached If a text object can be found an interface to the next read text object see Text Interface is returned In contrast to the methods GetNextImage and GetNextPath this method reads text objects and merges text objects until a major text property font line coordinate etc changes or a word break occurs if word breaking is enabled see Property BreakWords The current graphic state can be retrieved through the current content object s interface e Return value An interface to the next text object if there is any one this page Nothing otherwise GraphicsState Property TPDFGraphicsState GraphicsState Accessors Get Return an interface to the content s graphics state see GraphicsState Interface The graphics state is updated each time a method GetNextText GetNextImage GetNextPath or GetNextObject is called IgnoreOCM Property Boolean IgnoreOCM PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 39 of 80 July 9 2015 Accessors Get Set Option to ignore optional content membership and make all content visible BeginOCM and EndOCM objects are extracted but they have no effect on the extracted content E g when true hidden text is extracted as well Set this property to true in order to extract all content Image Pro
34. am is decompressed IntegerValue Property Long IntegerValue Accessors Get Return the integer value of a numeric object PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 69 of 80 July 9 2015 Name Property String Name Accessors Get Returns the character sequence of a name object The string is null terminated ObjectNumber applies to Indirect Objects Method Long ObjectNumber Return the object number RealValue Property Double RealValue Accessors Get Return the real value of a numeric object Size applies to Arrays Property Long Size ACCESS Ons ma Gest Returns the size of the array StringValue Property Variant StringValue ACCOSSOrSS COT Return the content of a string object as byte array Type Property Type Type Accessors Get Return the type of the object Possible return values eTypeBoolean eTypelnteger eTypeReal eTypeString eTypeName eTypeArray eTypeDictionary eTypelndirect 4 16 EmbeddedFile Interface CheckSum Property Variant CheckSum PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 70 of 80 July 9 2015 Accessors Get Get the 16 byte MD5 check sum CreationDate Property String CreationDate Accessors Get Get the creation date FileName Property String FileName Accessors Get Get the embedded file s path If the embedded file has no associated file stream the func
35. ay StemH StemV Property Single StemH Property Single StemV ACCS SOS COT These properties return the vertical and horizontal thickness of the dominant vertical and horizontal stems of the glyphs in the font Type Property Single Type Accessors Get Return the font type as string Widths Property Variant Widths Accessors Get Return an array which contains the widths of the glyphs XHeight Property Single XHeight nNeeessorsen Get Return the maximum height of flat non ascending lowercase letters such as the letter x measured from the baseline For further information about font descriptors see PDF Reference chapter 5 7 PDF Tools AG Premium PDF Technology 4 8 3 Heights PDF Extract API Version 4 5 Page 58 of 80 July 9 2015 ColorSpace Interface BaseColorSpace Property IPDFColorSpace BaseColorSpace Accessors Get Return a IPDFColorSpace interface to the base color space if it is existing ColorantName Property Variant ColorantName Accessors Get Return the name of the colorant Interface Note COM A variant containing an array of strings is returned These strings represent the name of the colorants of the color space In an RGB color space these are Red Green Blue C Net An additional parameter is passed which defines the index of the colorant Instead of a array containing all strings a single string is returned e g Red ComponentsPerPixel
36. below Dim document As New Pdftools PdfExtract Document document Open Dim content document Page Content Add the following namespaces using Pdftools Pdf using Pdftools PdfExtract The NET interface can now be used as shown below document new ment document Open content document Page Content PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 25 of 80 July 9 2015 Trouble Shooting The most common issue when using the NET interface is if the native DLL is not found at execution time This normally manifests when the constructor is called for the first time and exception is thrown normally of type System TypelnitializationException To resolve that ensure the native DLL is found at execution time For this see sub chapter NET Interface in the chapter Installation PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 26 of 80 July 9 2015 4 Reference Manual Note this manual describes the COM interface only Other interfaces C Java NET however work similarly e they have calls with similar names and the call sequence to be used is the same as with COM 4 1 Document Interface Author Property String Author Accessors Get Return the author from the document s info object Close Method Void Close This method closes an open document If the document is already closed the met
37. d When an encoding is missing or incorrect the text could become not extractable Even if the text is visually readable if the meaning of the glyphs is not encoded it cannot be extracted except by means of OCR If text is not extractable using the text extraction of Adobe Acrobat 7 Professional then it s most likely not extractable with the 3 Heights PDF Extract Tool and vice versa Handling of Symbolic and Non Symbolic Fonts Fonts in PDF documents have so called font descriptor flags See PDF Reference Manual chapter 5 7 1 These flags describe the font characteristics such as fixed pitch serif symbolic italic etc If a font is flagged symbolic it means its glyphs are not part of the standard Latin character set Typical symbolic glyphs are squares stars or other small icons like cars or animals Often there is no Unicode for these glyphs The 3 Heights PDF Extract Tool handles text extraction of symbolic as well as non symbolic fonts as described below If there is no encoding provided with the font the intrinsic encoding is applied which works as follows e Incase font file is embedded If there is a Unicode for the glyph the corresponding Unicode is returned If there is no Unicode and the font is flagged symbolic and part of the glyph names consist of a numerical value such as G1 G2 G100 the corresponding glyph number and for TrueType fonts the Unicode Private Section prefix OxFOOO is returned Oth
38. d by the PDF specification and our set of heuristics These Unicodes might not be accurate In some cases you might have prior knowledge about this specific font and know the mapping of character codes to Unicodes yourself E g you know the creator used the EBCDIC encoding For this reason the property RawString returns the string of character codes and allows you to apply your own mapping With RawString do not use the TextExtConfiguration options eTECBreakSpaceUnicode eTECPosMergeSingleSpace and eTECPosMergeMultiSpace because the Unicode these options work with might not be accurate Rotation Property Single Rotation Accessors Get Return the rotation of the string in radians rad 2 pi rad 360 StringLength Property Integer StringLength ACCESSOS Gels Return the number of characters in the string UnicodeString Property String UnicodeString ACES SIS OBS Get Return the text as a Unicode UTF 16 encoded string The number of bytes per character is a multiple of two For most languages such as English a character can be mapped to a single 16 Bit Unicode value Complex languages such as Chinese can return multiple 16 Bit values per character Some text strings however cannot be correctly mapped or cannot be mapped at all The former is the case if e g the PDF creator program didn t use correct names for the character in the font encoding see Font Interface The latter is the case if e g the PDF creator program didn t
39. dMode Property String BlendMode Accessors Get Return the name of the blend mode A blend mode can be Normal Multiply Screen Overlay etc PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 48 of 80 July 9 2015 CharSpacing Property Single CharSpacing nNeeessorseGeite Return the current space between two characters of a text string as a single precision real number in text units CTM Property PDFTransformMatrix CTM Accessors Get Return an interface to the current transform matrix The transform describes the transformation of the graphic object s coordinates from user units to page units including the effect of the page rotate attribute if requested see method Reset of the Content Interface DashArray Property Variant DashArray AScossorss COT Return the dash array of a line dash pattern The line dash pattern controls the pattern of dashes and gaps used to stroke paths DashPhase Property Single DashPhase Accessors Get Return the dash phase of a line dash pattern The dash phase is the offset of the pattern and can be larger as the pattern itself FillAlphaConstant Property Single FillAlphaConstant Accessors Get Return the alpha constant for filling FillColorCMYK Property Long FillColorCMYK Accessors Get Return the CMYK color quad for filling operations The color value is obtained by converting the color values of the property FillColor by mea
40. default is 0 3 This means any distance between two characters that are further apart as 0 3 times the width of the space character glyph in this font is interpreted as a new word For text that is written very narrowly this property should be decreased in order to avoid concatenation of words Text Property PDFText Text Accessors Get Return an interface to the last read text object see Text Interface The text object is updated each time the method GetNextText or GetNextObject is called TextExtConfiguration Property Long TextExtConfiguration Accessors Get Set PDF Tools AG Premium PDF Technology 4 4 3 Heights PDF Extract API Version 4 5 Page 41 of 80 July 9 2015 Default 7 eTECBreakTextState eTECBreakGraphicsState eTECBreakSpaceUnicode This property serves to control the way the text extraction algorithm works Text extraction collects all text objects and merges them into a single text This property controls which text objects are merged See the Enumeration TPDFTextExtractConfiguration for a list of all possible options Recommended settings for different use cases eText search or indexing i e text formatting is not important o Extract Words individually eTECBreakSpaceUnicode o Extract phrases eTECPosMergeSingleSpace eTECPosMergeMultiSpace eConversion of pdf content to another format i e text formatting and exact positioning is crucial o Usage of RawString or
41. defines the visibility of subsequent content objects until the matching EndOCM operator is encountered The respective OCG can be retrieved using the Document s GetOcg method As an example look at file www pdf tools com public downloads samples layers pdf It contains six colored squares and six optional content groups The visibility of the red green and blue squares is controlled by the respective OCGs The yellow square is only visible if both OCGs Green and Blue are ON The OCGs Gray 64 and Gray 128 are child elements of the OCG Gray and control the visibility of the respective gray OCGs These are visible only if both the child and the parent OCG are ON PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 July 9 2015 Page 80 of 80 Extracting OCGs from Layers pdf id name level 0 Red 0 1 Green 0 2 Blue 0 3 Gray 0 4 Gray 64 1 5 Gray 128 1 Extracting objects from Layers pdf type property value comment BeginOCM ocM 0 the visibility of subsequent objects is defined by the state of OCG 0 Red Path Path red square EndOCM end of OCM segment BeginOCM OCM 1 OCG 1 is Green Path Path green square EndOCM BeginOCM OCM 2 OCG 2 is Blue Path Path blue square EndOCM BeginOCM OCM 1 amp amp 2 subsequent objects are visible if OCG 1 and OCG 2 are ON Path Path yellow square EndOCM BeginOCM OCM 3 OCG 3 is Gray parent OCG of 4 and 5
42. del TextRenderingMode Property Short TextRenderingMode ACCES Songs GST Return a value that indicates whether the text should be stroked filled used as a clip path or some combination of the three The meaning of the values in detail is PDF Tools AG Premium PDF Technology 4 7 3 Heights PDF Extract API Version 4 5 Page 54 of 80 July 9 2015 Fill text Stroke text Fill then stroke text Neither fill nor stroke text invisible Fill text and add path for clipping Stroke text and add path for clipping Fill then stroke text and add path for clipping Add path for clipping YOU A W N BO TextRise Property Single TextRise Accessors Get Return a single precision real number in un scaled text units that indicates by which amount the base line of the text is moved up or down It is most commonly used to display subscripts and superscripts WordSpacing Property Single WordSpacing AEESSS OS COE Return the current space between two words of a text string as a single precision real number in text units For further information about the Graphic State see PDF Reference chapter 4 3 Font Interface Ascent Property Single Ascent Accessors Get Return the Ascent value This value represents the maximum height above the baseline reached by the glyphs in the font excluding the height of glyphs for accented characters AvgWidth Property Single AvgWidth Aeeessors GST Return the average w
43. document does not contain XMP metadata NULL is returned PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 31 of 80 July 9 2015 IsCollection Property Boolean IsCollection nNeeessorseGeie Return true if the PDF document is a collection aka PDF Portfolio IsEncrypted Property Boolean IsEncrypted nNeeessors mGet Return true if the PDF document has an encryption entry IsLinearized Property Boolean IsLinearized nNeeessorsseGeis Return true if the linearization flag is set in the PDF document This property does not actually validate whether the linearization is correct Linearization refers to optimizing the PDF for fast web access i e support random page access Keywords Property String Keywords neeessors mGet Return a string with the keywords of the document s info object LastError Property TPDFErrorCode LastError Accessors Get This property can be accessed to receive the latest error code Any return value other than PDF_S_SUCCESS 0 indicates that an error occurred See enumeration TPDFErrorCode LastErrorMessage Property String LastErrorMessage Accessors Get Return the error message text associated with the last error see property LastError Note that the property is NULL if no message is available PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 32 of 80 July 9 2015 MajorVersion Property Inte
44. ecision real numbers The crop box is optional it defines the range of the visible region of the page If there is no crop box set the media box is returned DeviceColorant Property String DeviceColorant Accessors Get Return the device colorant Document Property PDFDocument Document Necocssorss Est Return the interface to the page s document see Document interface GetFirstAnnotation Method Annotation GetFirstAnnotation Return an interface to the first annotation see Annotation Interface e Return value An interface to the first annotation if any annotations exist Nothing otherwise GetNextAnnotation Method Annotation GetNextAnnotation Return an interface to the next annotation e Return value An interface to the next annotation if any further annotations exist Nothing otherwise MediaBox Property Variant MediaBox Accessors Get Return the media box rectangle given by the coordinates left bottom right top The values are returned as an array of four single precision real numbers The media box is required it defines the physical boundaries of the medium on which the page is intended to be displayed or printed PDF Tools AG Premium PDF Technology 4 3 3 Heights PDF Extract API Version 4 5 Page 36 of 80 July 9 2015 Rotate Property Integer Rotate Accessors Get Return the rotation value of the page This value is used by viewer programs to turn the page by the given number of
45. ee 21 UNDE A LINUX ers coerce og Bra tea Rn BIO EN EL Meade ed Id Ideas 21 3 Getting started unuanannnnanunnannnnannnnannnnannnnannanannanannanannanannanannanannanannanen 21 3 1 VisUal Basic rs E I ne ee ee 21 3 2 ASP SCript en me a a aa ia 22 3 32 NET ithe A A A er nee 23 Visual Basic unta AA 23 PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 3 of 80 July 9 2015 a aie ces A ahr tele ora ae ee ache te ie a dig Ra E E A Raat A A Bs hae 24 Trouble Shooting sn ah 25 4 Reference Ma llal iran 26 40 DOCUMENt INCE Mace tds 26 AUTO 26 Cs diia 26 COMPANIA A ih 26 Creation Daten sonar a IA Re 26 Creado a iaa 26 GetCurrentOutlinelevel vc a a dde 26 GetDestinadtiON vs A AR 27 GetFirstColorSpaceResource uuesansenanennnnennanennnnennnn en nam ennn en nnn nennen nen 27 GetFirstembeddedRile u aaa ana ana un 27 GetFirstFontResource cccceccseceeeeessceeeeeeeessuaeeeeeeeesagaeeeereesggggnteeteesgaags 27 GetFirstlMagGeReSOUrce iaa 28 GetElIrstQUUlIN lema Has 28 GetInfoEntrYy era te dg veddivadvetein CE AA EIA ETAL aaa to da act 28 GetNextColorSpaceResource 2 cecceceee eee e cence teeta e eee teeta teeta ee nennen 28 GetNextEMbedded File iia anne ai 28 GetNeXtFONtRESOUICe ccccececcceeeeescceeeeeeecggueeeeeeessguaeeeeeeesgugeneertttsgaags 29 GetNextIMmageReSounCe iaa 29 GetNextoutlineltem us ankamen 29 GOOD Clinic a de E REN a T 29 O a a a RON 30 GetPageLabe
46. enis ends 55 ENGOCING ished T seadivesdeveus gehen 55 A ee RN TA 55 FOREBBOX rar A AAA 56 COM ais 56 FONtFIEGT Y PE aaa 56 TtaliGAN le une een ale 56 ET A O 56 MaxWidth ne RR E 57 MissingWidth a 2 2 mau 57 PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 6 of 80 July 9 2015 Stemi Stem eas sedi a AA AAA AAA AA ead dace a oh 57 TY PO a a tates Nine A AE A donde tele tala 57 Widtns ans a o o ae 57 XHelght nn a ale Sates tea distaste cid den a da ea ae ews 57 4 8 GolorS pace Interactiva A a na 58 BaSGColorS pace ss iii er ee ee 58 Colorada let 58 ComponentsPerPixel 22 2 Rain 58 SA Hl ei 58 ISCOlOR ee aia 58 IsIndexed sa aa 58 ISMonochr me a a un AH eng 59 LookKUDe n A te tes art da a re ee ae an oa Sa asain RE atau an Er tees 59 Names ee A TE ET ae ER 59 4 9 TransformMatrix Interface zensersenneneenennennenn non en nun nun nun nun nun nnnnnennen nennen nenne 59 Ay De Erde Te a a 59 Orientation ir AAA HR neh 60 Rotativas 60 XScaling YScalilnd 2 222 ii Eee ee ne ee 60 XSKEW YSKEW anne nn nn ee 60 XTranslation YTranslation oo cece cece ee none nn nn nun nun nun nun nun nennen nennen nenn 60 4 10 Alternate Image Interface us4urernnnennn nenne nenne nenne ne nenne nenne nennen nenn 61 DOUE PA a 61 NAO ii A A A A er 61 4 11 Annotation Interface urnennennennennnn nennen nenn nn een nn nennen nennen nn nnennennen nennen nennen 61 Attached lio ER un 61
47. erwise the glyph index is returned the font is non symbolic the standard encoding is used e Incase font file is not embedded The standard encoding is applied Notes about the above algorithm When the standard encoding is applied all control characters lt 31 are mapped to character 32 blank The glyph numbers G1 G2 G100 are often created by Ghost Script related PDF Creators In these cases the number in the glyph name corresponds to the encoding of the used code page E g G65 is the character A in WinAnsi encoding PDF Tools AG Premium PDF Technology 5 13 3 Heights PDF Extract API Version 4 5 Page 79 of 80 July 9 2015 Text Extraction of Text Marked as Symbolic Sometimes text is marked as symbolic but it actually is not In certain cases PDF creators do this to prevent text extraction Assuming a PDF contains a TrueType font that is by mistake marked as symbolic As a result the returned characters contain the Unicode Private Range prefix OxFOOO to OxFOFF In this case the prefix needs to be removed again This can be achieved by setting the property TranslateSymbolic to true Image Extraction 5 14 Image extraction samples in different programming languages are available online at http www pdf tools com pdf pdf extract content metadata text aspx An image is placed on the output page in any position orientation and size as specified by the current transformation matrix property CTM of
48. ext property elmage An image object could be found and its interface can be retrieved through the content s Image property The graphics state can be retrieved through the content s GraphicsState property ePath A path object could be found and its string representation can be retrieved through the content s Path property The graphics state can be retrieved through the content s GraphicsState property eSave Save the current graphics state on the graphics state stack PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 38 of 80 July 9 2015 eRestore Restore the graphics state by removing the most recently saved state from the stack and making it the current graphics state eBeginOCM Start of a sequence of objects whose visibility is defined by an optional content membership string property OCM Sets the property OCM OCM sequences can be nested eEndOCM Marks the end of an OCM sequence GetNextPath Method String GetNextPath This method reads the content stream objects until a path object can be returned or the end of the content stream is reached If a path object could be found a string representation of a path object is returned It can also be retrieved through the content s Path property The graphics state can be retrieved through the content s GraphicsState property e Return value The next text path on this page if there is any Nothing otherwise GetNext
49. extracted fonts eTECBreakTextState eTECBreakGraphicsState o Other eTECBreakTextState eTECBreakGraphicsState eTECPosMergeSingleSpace TranslateSymbolic Property Boolean TranslateSymbolic Accessors Get Set Default False Replace symbolic character from the Unicode custom range OxF000 0xFOFF with WinAnsi codes Ox00 0xFF Image Interface Alternates Property Variant Alternates Accessors Get Return an array of alternate images see Interface AlternateImage An image can have none one or multiple alternate images BitsPerComponent Property Integer BitsPerComponent AGESOSS OBS Sie Return the number of bits that are used to represent a single color component of an image sample The number of color components per image data sample can be retrieved through the image s color space interface PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 42 of 80 July 9 2015 ChangeOrientation Method Boolean ChangeOrientation TPDFOrientation Orientation Set the orientation of the image This value has to be set prior to using the method Store The orientation of the image can be retrieved from the property GraphicsState ctm Orientation ColorSpace Property IPDFColorSpace ColorSpace ACCESS Oms St Return an interface to the color space of the image see ColorSpace Interface Compression Property IPDFCompression Compression Accessons Get Return the compressio
50. file couldn t be opened The file couldn t be created The authentication failed due to a wrong password Undefined Pages appear in columns from bottom to top and right to left relative to page orientation Pages appear in columns from bottom to top and left to right relative to page orientation Pages appear in columns from top to bottom and left to right relative to page orientation Pages appear in columns from top to bottom and right to left relative to page orientation Pages appear in rows from right to left and bottom to top relative to page orientation Pages appear in rows from left to right and bottom to top relative to page orientation Pages appear in rows from left to right and top to bottom relative to page orientation Pages appear in rows from right to left and top to bottom relative to page orientation TPDFTextExtractConfiguration eTECBreakTextState eTECBreakGraphicsState eTECBreakSpaceUnicode eTECPosMergeSingleSpace Start new text object if text state changes font font size horizontal scaling Set this property if text state is important to you Start new text object if graphics state changes color Set this option if the color is important to you Start new text object if extracted text contains a blank Unicode At nbsp etc Do not set this option if you need the RawString property Example If set the text Hello World will be extracted as Hello
51. file of both the evaluation and the release version of the 3 Heights PDF Extract Tool API Samples are also available at the website of PDF Tools for the 3 Heights PDF Extract Tool Please find the latest samples online at http www pdf tools com asp products asp name EXPA Note Code samples in this manual are not constantly updated and might not be 100 compatible with the latest version of the Extract API Text Extraction For text extraction a page number must be set Using the method GetNextText returns the text tokens in Z order This means the text token which is on top i e is rendered last when the document is displayed is retrieved last Some PDF creators save the text in the order from the upper left to the lower right corner As a result extracting such documents yields in a readable text sequence This however is not true for all creators It is as well possible to save every single character separately and in random order Extracting text in such a document results in a random and therefore unreadable sequence of text tokens The text tokens will first need to be sorted by coordinate in order too make it readable Undesired Missing Blanks Using the property TextExtConfiguration the text extraction algorithm can be configured It is best to start with one of the settings recommended for your use case Sometimes this can lead to undesired blanks within what visually looks as one word For example if Text is
52. from all layers the IgnoreOCM property can be to true For more background information including a sample see the section Optional Content Layers PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 66 of 80 July 9 2015 Label Property Boolean Label Accessors Get Flag that indicates whether this is an OCG or a label Labels are used to label groups of OCGs in the hierarchy Setting their visibility has ho effect Level Property Long Level Accessors Get In user interfaces OCGs can be shown in a tree The property level indicates the hierarchy level of the OCG in that tree OCG with Level O is a top level OCG Level 1 means that the OCG is not part of the hierarchy it should not be presented to the user Parent elements in the OCG hierarchy can be labels or OCGs If the level of a label b is higher than its predecessor a b is the parent element of the following objects of the same level as b If the level of an OCG b is higher than its predecessor ocg a a is the parent of the following objects of the same level as b Note that the hierarchy reflects actual nesting of OCGs in the content Setting the visibility of an OCG to true only has an effect if the visibilities of all its parents are set to true Name Property String Name Accessors Get Return the name of the OCG Visible Property Boolean Visible Accessors Get Set Get or set if the OCG is visible This property c
53. g path W PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 40 of 80 July 9 2015 The exact details to the path construction operators can be found in Adobe s PDF Reference Manual The path object is updated each time the method GetNextPath or GetNextObject is called This property cannot be set Reset Method Void Reset Boolean AccountForRotate This method allows to reset the content extraction process and set the point of extraction to the beginning of the content stream e Parameters AccountForRotate Optional default false This property defines origin and orientation of the coordinate system of the coordinates of extracted content elements The unit of the coordinate system is 1 72 inch eFalse The coordinates are extracted as raw coordinates as used in the PDF document eTrue Extracted coordinates are relative to the bottom left corner of the visible page as displayed by a viewer I e the page is rotated by the page s Rotate attribute and cropped using a bounding box For example the coordinate 0 0 denotes the bottom left corner of the page The default bounding box used is the CropBox This can be changed by setting the BoundingBox property before calling the Reset method SpaceFactor Property Single SpaceFactor Accessors Get Set This property can be used to get or set the distance between two characters that is required to insert a blank for text extraction The
54. ger MajorVersion Accessoms Get Return the major version of the document Ex PDF Version 1 5 corresponds to Adobe Acrobat 6 the major version is 1 the minor is 5 MinorVersion Property Integer MinorVersion Accessors Get Return the minor version of the document ModDate Property Date ModDate Accessors Get Return the modification date of the info object of the document OcgCount Property Long OcgCount ACCOSS OS Gee Get the number of optional content groups also known as layers of the document e Return value The number of optional content groups in this document Open Method Boolean Open String FileName String Password This method opens a PDF random access disk file i e makes the objects contained in the PDF document accessible If the document is already open it is closed first e Parameters FileName The file name and optionally the file path drive or server string according to the operating systems file name specification rules Password optional the user or the owner password of the encrypted PDF document If this parameter is left out an empty string is used as a default e Return value True The was opened successfully False The file does not exists it is corrupt or the password is invalid PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 33 of 80 July 9 2015 OpenMem Method Boolean OpenMem Variant MemBlock String Password This
55. hod does nothing Compliance Property TPDFCompliance Compliance Get the claimed compliance of the document For instance this property can be used in order to detect if the document claims to be PDF A CreationDate Property Date CreationDate Accessors Get Return the creation date of the document s info object Creator Property String Creator Accessors Get Return the name of the creator of the document s info object GetCurrentOutlineLevel Method Long GetCurrentOutlineLevel Return the level of the current outline bookmark PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 27 of 80 July 9 2015 e Return value The level of the current outline O is equal to root level GetDestination Method PDFDestination GetDestination String Destination Return an interface to the destination specified in the parameter e Parameters Destination The named destination e Return value An interface to the specified destination if it exists Nothing otherwise GetFirstColorSpaceResource Method PDFColorSpace GetFirstColorSpaceResource Return an interface to the first color space resource see ColorSpace Interface e Return value An interface to the first color space resource if there is any Nothing otherwise GetFirstEmbeddedFile Method PDFEmbeddedFile GetFirstEmbeddedFile Return an interface to the first embedded file see EmbeddedFile Interface Embedded files
56. idth MiterLimit Property Single MiterLimit Accessors Get Return the miter limit The miter limit imposes a maximum on the ratio of the miter length to the line width which can be fairly large when two line segments meet at a sharp angle When the limit is exceeded the join is converted from a miter to a bevel PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 52 of 80 July 9 2015 OverprintMode Property Integer OverprintMode Return the overprint mode RenderingIntent Property String RenderingIntent Return the name of the rendering intent SmoothnessTolerance Property Single SmoothnessTolerance Accessors Get Return the smoothness tolerance The values are in the range 0 0 1 0 where 1 0 corresponds to 100 SoftMask Property IPDFImage SoftMask Accessors Get Return the soft mask as image StrokeAdjustment Property Boolean StrokeAdjustment Accessors Get Return the flag for the automatic stroke adjustment SpaceWidth Property Float SpaceWidth Accessors Get Get the width of the space character in text space To get page user units transform using the text s matrix The SpaceWidth property can be used to implement your own word breaking algorithm For more information about this read the descriptions of the properties BreakWords and SpaceFactor StrokeAlphaConstant Property Single StrokeAlphaConstant Accessors Get Return the current alpha stroke
57. idth of the glyphs in the font PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 55 of 80 July 9 2015 BaseName Property String BaseName Accessors Get Return the font name CapHeight Property Single CapHeight Accessors Get Return the height of the top of flat capital letters measured from the baseline Charset Property String Charset Accessors Get Return a string listing the character names defined in a font subset This property is only useful for Typel fonts Descent Property Single Descent Accessors Get Return the Descent value This negative number represents the maximum depth below the baseline reached by the glyphs in the font Encoding Property Variant Encoding Accessors Get Return the glyph name of each character Flags Property Long Flags NOCSSSOLSS Get Return the flags of the font The flags are listed the following table Bit positions within the flag word are numbered from 1 low order to 32 high order Bit Position Name Meaning 1 FixedPitch All glyphs have the same width 2 Serif Glyphs have serifs 3 Symbolic The font contains characters outside the standard Latin character set PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 56 of 80 July 9 2015 4 Script Glyphs resemble cursive handwriting 6 NonSymbolic Font uses standard Latin character set or a subset of it 7 Italic Glyphs
58. in Icc Opt Opt Opt Opt doc pdf Doc Doc Doc Doc doc PDFParser idl Doc doc javadoc Doc include expa_c h Req include Opt jar EXPA jar Req liblPDFParser lib Req samples Doc Doc Doc Doc The purpose of the most important distributed files of is described in Table File Description Name bin PDFParser dll bin pdcjk d11 binX NET dl1 Table File Description Description This is the DLL that contains the main functionality This DLL contains support for Asian languages It is loaded from the module path The NET assemblies are required when using the NET interface The files bin NET xml contain the corresponding XML documentation for MS Studio PDF Tools AG Premium PDF Technology 1 6 3 Heights PDF Extract API Version 4 5 July 9 2015 Page 14 of 80 bin Icc doc include jar EXPA jar lib PDFParser 1lib samples The two color profiles USWebCoatedSWOP icc and sRGB Color Space Profile icm are required to transform RGB to CMYK values and vice versa when extracting colors The color profiles must not be renamed or they will not be found Compatibility Note In versions prior to 2 1 7 the color profiles has different names CMYK icc and sRGB icm These old names are no longer supported Various documentation Contains files to include in your C C project The Java wrapper The Object File Library needs to be linked t
59. ing ibPDFPARSER dylib to the DYLD_LIBRARY_PATH For Java PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 18 of 80 July 9 2015 e Rename the file ibPDFPARSER dylib to libPDFPARSER jnilib or create a file link for this purpose by using the following command ln libPDFPARSER dylib libPDFPARSER jnilib e Add the jar EXPA jar file to the CLASSPATH 1 10 Samples Samples for various programming languages are included in the Windows kits They can also be downloaded at the PDF Tools AG web site http www pdf tools com asp products asp name EXPA PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 19 of 80 July 9 2015 License Management 2 1 There are three possibilities to pass the license key to the application 1 The license key is installed using the GUI tool Graphical user interface This is the easiest way if the licenses are managed manually It is only available on Windows 2 The license key is installed using the shell tool This is the preferred solution for all non Windows systems and for automated license management 3 The license key is passed to the application at runtime via the LicenseKey property This is the preferred solution for OEM scenarios Graphical License Manager Tool The GUI tool LicenseManager exe is located in the bin directory of the product kit amp PDF Tools License Mal File Ed
60. interpretation make it possible to restore text that is lacking essential information The tool can also collect significant data such as position color space and size when extracting images such as TIFF or JPEG Querying document attributes such as PDF version creator author title subject and creation date is also possible The tool also supports reading encrypted PDF files PDF Tools AG Premium PDF Technology 1 3 3 Heights PDF Extract API Version 4 5 Page 10 of 80 July 9 2015 Features Extract text contained on a PDF page line wise and word wise Retrieve text attributes such as position and font Extract graphics objects paths Extract images Retrieve PDF image attributes such as format position and transparency masks Retrieve PDF document attributes such as page count version number and title Retrieve PDF page attributes such as the Crop Box and page rotation Retrieve detailed font information from PDF text Retrieve detailed graphics state information Retrieve detailed color space information Specify a password to decrypt PDF files Formats Input Formats PDF 1 x e g PDF 1 4 PDF 1 5 Compliance Standards ISO 32000 1 PDF 1 7 Interfaces The following interfaces are available C Java NET COM 1 4 Operating Systems Windows XP Vista 7 8 8 1 32 and 64 bit Windows Server 2003 2008 2008 R2 2012 2012 R2 32 and 64 bit Has 11 and later PA RISC2 0 32 bit or HP UX
61. isi saat a a ii E EE iari 38 CAES A A E A T 38 GraphicsState ui ia a cele 38 PQMOFEOECM A vata hend de cone cadetatend ea naeh en ern ee eek nee 38 A OOO TR 39 DEM A A A AAA AA 39 PM A A A A ss 39 A wet tea ei ieee eet i atthe tere EN neh ae ved 40 SA A ieee EERFEUFSSLFERTEUELFEEUSERPFRETELEPEREUESETRREURSEEERRPERELERUUERE 40 MEX seis E derdavedbevenwed degdascgecd aed decaaeng betas nud Sedan ea terevend decd ane 40 TEXtEXtConfiguration maria ia iii 40 TranslateSymbolie u en draeid iadi ai 41 4 4 Image Interface an an A cee nk IRRE he nen Are 41 OA 41 BIESPerCOMpPON Eric o a A iii 41 Ch ngeOtient tion ans sa a ee 42 Color Pate ae vee Pee eee 42 COMPESSION ii ot 42 COnVErtLORGB ta 42 GeUmagE A nen 42 GetResolU ti inci rra e ed la ai al 42 A ea 43 ISBitona lsir ARMANI 43 ISC Oscar a a aa EE cay a AA nen een 43 ISMonoCHrOme A a ias 43 ODINU MD se ek Hehe 43 ISMOMOCHOM Gs ai tens ee ran ne nern tied a iNi AE Eagt 43 SAM Sii NAO Eee 44 SMSKSA ARAS 44 A E A E ese goede TE T EE TEA retest 44 SM O iia 44 Wide 45 4 5 Text Interface da ds sad il 45 Bounding BOX vicodin re Rs a tel ia 45 O Sh wich ei 45 Length deceit ce en vies dae li O een 46 RAWSEFING 22 ccdisecsca a a 46 PROC UI OM secre re ke ee E 46 STEIN LONA A hate Gade ede Atel Gaetan ited Gated ofa 46 UnicodeString tit eaii aai ate ER a id 46 WI dara A A AAA AER 47 PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 5 of 80
62. it Help x G5 Add Key Delete Refresh List All Users Curent User License Properties 3 Heights TM Document Converter Name Value T f 0 10A0M ER HO O IA Key 0 J0A04 EEE Lu El Y 0 BCASN WEN HOT LALA FOREN Product 3 Heights TM Image 3 Heights TM Image to PDF Converter API J Intended Use Productive 7 Y 0 J0A94 en u Y Platform Windows 3 Heights TM Image to PDF Converter Service Y Volume Page s q T Y 0 1CAD ATCA PARK ABUTS EE Expiration Does not expire M tet oa a L tata 1 J E 4 Maintainance Expiration 2033 12 31 2 Unsiabic TRAY mann tan NNC ansarar Chall List all installed license keys The license manager always shows a list of all installed license keys in the left pane of the window This includes licenses of other PDF Tools products The user can choose between e Licenses available for all users Administrator rights are needed for modifications e Licenses available for the current user only Add and delete license keys License keys can be added or deleted with the Add Key and Delete buttons in the toolbar e The Add key button installs the license key into the currently selected list e The Delete button deletes the currently selected license keys PDF Tools AG Premium PDF Technology 2 2 3 Heights PDF Extract API Version 4 5 Page 20 of 80 July 9 2015 Display the properties of a license If a
63. java available that shows how to use this interface Begin GetNext End applies to Dictionaries Property Long Begin Property Long End Method Long GetNext Long i Iterator Property Begin method GetNext and property End can be used to traverse a dictionary object GetKey and GetValue return the key and value of an element C Example for int i dict Begin i dict End i dict GetNext i PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 68 of 80 July 9 2015 do something BooleanValue Property Boolean BooleanValue Accessors Get Return the Boolean value of a Boolean object Dispose DestroyObject NET API All objects retrieved from the API are destroyed when the document is closed However it is recommended to use Dispose as soon as possible in order to save memory Java and C C API The TPdfExpaPDFObject objects must always be deleted using ExpaPDFObjectDestroyObject GetElement applies to Arrays Method PDFObject GetElement Long i Return the element at the index GetEntry applies to Dictionaries Method PDFObject GetEntry String Name Return the entry of the dictionary GetStream applies to Indirect Objects Method PDFObject GetStream String FileName property Variant StreamMem Return the indirect object s stream if present If the object is an image the compressed stream is returned otherwise the stre
64. les The unit of pixels can be converted to a distance unit such as inch millimeter etc using a resolution value i e 72 dpi dots per inch 4 5 Text Interface BoundingBox Property Variant BoundingBox Accessors Get Return the smallest rectangle that encloses the text as shown below 1 Text Bounding Box Height The text bounding box is a rectangle which encloses the four points Q1 Q2 Q3 Q4 The points Q1 and Q2 are 1 3 of the height below the baseline The text bounding box is defined by four values which represent the coordinate of the lower left and the upper right corner FontSize Property Single FontSize Accessors Get Return the size of the font in points The size can also be interpreted as the height of the text PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 46 of 80 July 9 2015 Length Deprecated use StringLength instead RawString Property Variant RawString Accessors Get For simple fonts this property returns the raw character codes from the PDF as a byte array For CID fonts this property is NULL If the ExpandLigatures property is not set the length of the RawString is the same as the length of the UnicodeString and the character position vector applies to the RawString character codes as well The property UnicodeString always returns a string of Unicodes These Unicodes are the result of the mapping of character codes to Unicodes define
65. llOverprintFlag FlatnessTolerance OverprintMode RenderingIntent SmoothnessTolerance StrokeAdjustment StrokeAlphaConstant StrokeOverprintFlag Font Interface Changed Type of Flags from Long to int Image Interface New Properties IsBitonal IsMonochrome IsColor Page Interface New Property DeviceColorant Text Interface New Property TextMatrix Changes from 1 6 to 1 7 5 5 5 6 This is a list of interface changes from version 1 6 1 6 0 41 to version 1 7 1 7 4 1 Annotation Interface New Property IsMarkup Document Interface New Method GetPageLabel Changes from 1 7 to 1 8 This is list of interface changes from version 1 7 1 7 4 1 to version 1 8 1 8 35 1 Image Interface New Property SMask Changes from 1 8 to 1 9 This is list of interface changes from version 1 8 1 8 35 1 to version 1 9 1 9 24 1 Document Interface Deprecated Property ErrorCode New Property LastError Content Interface New Property ConvertPathToImage Colorspace Interface Deprecated Property Colorant New Property ColorantName Deprecated Property High New Property HighIndex Text Interface Deprecated Property Length New Property StringLength PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 76 of 80 July 9 2015 5 7 Changes from 1 9 to 1 91 This is list of interface changes from version 1 9 1 9 24 1 to version 1 91 1 91 28 0 Content Interface New Properties PathImageAn
66. luisa a Reken 30 GetXMPMetadata c 0ccccoccccconncncnnncncannna nana na nana nn nun nun nun nun nn nn nnnn nun nn nun nen 30 GetXMPMetadataMem s2susnsnsnnnnnnnnnnnnnnnnnnn nn nun nn nun nn nn rana nara n nun nnnnnn nen 30 ISEoll amp ction ana ana ida dada 31 IsEncrypted e aeaa a he erh eher Eher eher 31 IsLinearizedi AAA ee ina nn a aia 31 Keywords A A A A ine 31 LASER a ne aah ae een 31 LastErrorMessage iii as 31 Maj rVersion u en A ees gate deere 32 MINOFVERSION ae a a as 32 kaloo DE i a E A a dos 32 DEGEOUNE a a A ee 32 OPEN ee Ea te ti ae ee en te Eh ae ah ger eek 32 OpenMem Ip 33 Page nassen er rn Lad ER RT 33 PageClountz ern AE aa a 33 Par NO can 33 POUR omar ee ee a re I Then rege 33 SUDJOCU a ea a ee 34 Title REDE N END se 34 4 2 Page Interface sense Nas aan na iiglienie 34 ArtBOX u a nn ana end 34 BIGCU BOX was Ar ee ent 34 EONteNnt e a 34 EF PBOX O A 34 DeviceColokanti 2 2 nee AO ee OS 35 DOGUMEN Ei odas 35 PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 4 of 80 July 9 2015 GEE IFSEAMMOLAE OM hectare etre E A cachet a te 35 GetNextAnnotation viii A A a taa 35 A et onde cee dae RED 35 A A 36 O Eee na o a RT 36 4 3 Content Interface lla Dd 36 Break WOrdS apando 36 BOU IA iaa 36 Expandligat res ii He kn 37 A aan E a RE a he nn Ban EEE A een 37 GetNextImage u ae He EEE 37 GEtNEXtObject n AA a a a aa aai 37 G tNextPath i ei
67. m a regular font to italic XTranslation YTranslation Property Single XTranslation Property Single Ytranslation Accessors Get PDF Tools AG Premium PDF Technology 4 10 3 Heights PDF Extract API Version 4 5 Page 61 of 80 July 9 2015 Return the X and Y translation These are the same values as returned by the properties e and f Alternate Image Interface 4 11 DefaultForPrinting Property Boolean DefaultForPrinting Accessors Get Return true if the alternate image is set as default for printing Image Property IPDFImage Image ACCESS Ors EST Return an interface to the alternate image see Image Interface Annotation Interface AttachedFile Property IPDFEmbeddedFile AttachedFile Accessors Get Return the embedded file attached to this annotation This property is meaningful for FileAttachment annotations only Note that the AttachedFile might not have an embedded file stream but reference an external file via the FileName property only Color Property Long Color Accessors Get Return to color of the annotation Contents Property String Contents Accessors Get Return the content of the annotation PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 62 of 80 July 9 2015 Date Property Date Date ASE SSS OMS C CO Return the date of the annotation The used format is dd mm yyyy hh mm ss Dest Property IPDFDestination Dest
68. n used for the image in the pdf ConvertToRGB Method Boolean ConvertToRGB Convert the image to an RGB image The conversion uses the image s color space to interpret the sample data Calibrated color spaces are converted to RGB values according to the sRGB color standard Device color space are converted using pre defined color profiles e Return value True if the conversion was successful False otherwise GetImage Method Variant GetImage Return the image from memory which was previously saves using the method StoreInMemory e Return value The image as a 1 dimensional byte array GetResolution Method Single GetResolution IPDFTransformMatrix Matrix Return the resolution of an image on the page in dpi dots per inch e Parameters PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 43 of 80 July 9 2015 Matrix The transformation matrix of the image This parameter is required since the image itself has no resolution The resolution is the ratio between the size of the image and the size it uses on the page e Return values The calculated resolution in dpi Height Property Long Height Accessors Get Return the height of the image in pixels also called samples The unit of pixels can be converted to a distance unit such as inch millimeter etc using a resolution value e 72 dpi dots per inch IsBitonal Property Boolean IsBitonal AGESSS O
69. ns of the FillColorSpace The CMYK quads are encoded using the following formula Quad C 256 M 256 Y 256 K If a color doesn t exist e g with an uncolored pattern then 1 is now returned PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 49 of 80 July 9 2015 Hexadecimal Quad OxCCMMYYKK where CC is the byte for the cyan value in the range from 0x00 to OxFF MM is magenta YY is yellow KK is key black Decimal To retrieve the values for cyan magenta yellow and key apply the following formulas VB code taking into account negative values using integer division and bitwise and And Quad PDFPARSERLib GraphicsState FillColorCMYK Quad And amp H7FFFFFFF t 16777216 t 65536 And 255 t 256 And 255 t And 255 If Quad lt 0 Then C C Or H80 AK S amp S AQ ec ll There are also other ways to retrieve these values than using the above formulas FillColorRGB Property Long FillColorRGB Accessors Get Return the RGB color triple for filling operations The color value is obtained by converting the color values of the property FillColor by means of the FillColorSpace The RGB triples are encoded using the following formula Triple B 256 G 256 R If a color does not exist e g with an uncolored pattern then 1 is now returned Hexadecimal Triple OXBBGGRR where BB is the byte for the blue value in the range from 0x
70. o the C C project Contains sample programs in different programming languages Color Profiles The 3 Heights PDF Extract API uses color profiles to convert sRGB to CMYK colors and vice versa If no color profiles are available the conversion is done algorithmically In order to convert using color profiles there are two files required Icc CMYK icc and Icc sRGB icm where the directory Icc must be a direct sub directory of where PdfParser dll resides Color profiles can be downloaded from the links provided in the directory Icc Download at least one CMYK color profile and sRGB profile or use copy them from your local systems Most systems have pre installed color profiles available at systemroot system32 spool drivers color Rename them to sRGB icm and CMYK icc Deployment Runtime Kit Distributed Files The runtime kit RTK contains all files that are used for deploying the software This is a subset of the files contained in the SDK Which files are required Req optional Opt or not used empty field for the four different interfaces is shown in the table below Table Files for Deployment Name NET JNI bin PDFParser dil Req Req Req Req bin pdcjk dil Opt Opt Opt Opt bin NET d11 Req bin Icc Opt Opt Opt Opt PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 15 of 80 July 9 2015 jar EXPA jar Req Deploying
71. of both the document s collection PDF Portfolio and of FileAttachment annotations are returned e Return value An interface to the first embedded file if there is any Nothing otherwise GetFirstFontResource Method PDFFont GetFirstFontResource Return an interface to the first font resource see Font Interface e Return value An interface to the first font resource if there is any Nothing otherwise PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 28 of 80 July 9 2015 GetFirstImageResource Method PDFImage GetFirstImageResource Return an interface to the first image resource see Image Interface e Return value An interface to the first image resource if there is any Nothing otherwise GetFirstOutlineItem Method PDFOutlineItem GetFirstOutlineItem Return an interface to the first outline item see Outline Interface e Return value An interface to the first outline item if there is any Nothing otherwise GetInfoEntry Method String GetInfoEntry String szKey Return the value of a custom entry in the info object e Parameters szKey The string defining the info object such as Author or Subject e Return value The string corresponding to the info object if it exists Nothing otherwise GetNextColorSpaceResource Method PDFColorSpace GetNextColorSpaceResource Return an interface to the next color space resource e Return value An interface to the next colo
72. om the left side page border f is the distance on the y axis from the bottom 0 0 is in the lower left corner on an page with a size of A4 portrait 595 842 is in the upper right corner PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 60 of 80 July 9 2015 The scale factor in a matrix a 0 O d O O can be obtained from the values a and d for x and y scaling respectively With respect to fonts d represents the font size of horizontal text A rotation of the axis by an angle a counter clockwise is produced by a matrix cos a sin a sin a cos a O 0 More detailed information can be found in the PDF Reference manual chapter 4 2 2 Orientation Property TPDFOrientation Orientation Accessors Get Return the orientation rounded to the next 90 degrees The orientation is an enumeration with eight different values rotation times flipping See enumeration TPDFOrientation Rotation Property Single Rotation Accessors Get Return the rotation angle of the matrix counter clockwise This is equal to the minimum of XSkew and YSkew XScaling YScaling Property Single XScaling Property Single YScaling Accessors Get Return the x and y scaling factor XSkew YSkew Property Single XSkew Property Single Yskew ACCESS Oms COT Return the x and y axis skewing The transformation matrix 1 tan a tan B1 O 0 skews the x axis by a and the y axis by Skewing sometimes is used to transfor
73. one during installation e g un register using regsvr32 u delete all files etc Note that an expired evaluation DLL cannot be unregistered If you would like to un register an expired evaluation DLL download a new non expired evaluation version overwrite the old version and un register it Installing a new version does not require to previously uninstall the old version The files of the old version can directly be overwritten with the new version If using the COM interface the new DLL must be registered un registering the old version is not required Unix Unpack the archive in an installation directory i e User lib pdf tools e bin libPDFPARSER so This is the library that contains the main functionality required e doc Contains documentation files e include Contains files to include in your C C project e jar EXPA jar Contains the Java wrapper Installation on Unix Systems 1 Unpack the archive in an installation directory e g usr pdftools com 2 Copy or link the shared object into one of the standard library directories e g ln s usr pdftools com bin libPDFPARSER so usr lib 3 In case you have not yet installed the GNU shared libraries get a copy of these from http www pdf tools com extract the shared images and copy or link them into usr lib or usr local lib Installation on Mac OS X 1 Unpack the archive in an installation directory e User lib pdf tools 2 Add the directory contain
74. ontrols the extraction of content objects The default value is the one configured in the PDF document Note that though invisible paths generate no marks on the page they still have an effect on the graphics state For example their effect on the current drawing position and the clipping region does not change Therefore all paths are active and extracted regardless of their visibility Invisible paths just use the end path operator n instead of a filling or stroking operator PDF Tools AG Premium PDF Technology 4 15 3 Heights PDF Extract API Version 4 5 Page 67 of 80 July 9 2015 Example 1 id OCGs Level Hierarchy 0 OCG A O OCGA 1 OCG B 0 OCG B 2 OCG B1 1 OCG Bi 3 OCG B2 1 OCG B2 4 OCG C 1 hidden OCG C Example 2 id OCGs Labels Level Hierarchy 0 OCG A O OCGA 1 Label B 1 Label B 2 OCG B1 1 OCG Bi 3 OCG B2 1 OCG B2 4 Label C 1 Label C 5 OCG C1 1 OCG C1 6 OCG D 0 OCG D PDFObject Interface This interface represents a basic PDF object More information on these types of objects can be found in chapter 3 2 of the PDF Reference The PDFObject interface represents an object which can be one of eight types Depending on its type different methods and properties should be used Note If PDF objects are traversed recursively it must be ensured the program does not end up in an endless loop for cyclical structures There is a Java sample PdfObjExt
75. perty PDFImage Image Accessors Get Return an interface to the last read image object see Image Interface The image object is updated each time the method GetNextImage or GetNextObject is called OCM Property String OCM Accessors Get Return the current optional content membership string which defines the visibility as Boolean function of OCG in C syntax OCGs are represented by Ids Retrieve the respective OCG using the Document interface s GetOcg method supported operators 88 Example 1 88 2 means that the following objects are visible only if OCG 1 and OCG 2 are visible Note This property is valid only immediately after extraction of BeginOCM object Path Property String Path Accessors Get Return the last read path object in its string form The path object describes a graphic drawing consisting of stroked lines and curves as well as filled shapes The string contains the PDF path construction tokens consisting of real value operands in angle brackets followed by operator mnemonics e Move current point to lt x gt lt y gt m e Line from current point to lt x gt lt y gt e Rectangle lt x gt lt y gt lt w gt lt h gt re e Cubic Bezier curve from current point to lt x1 gt lt y1 gt lt x2 gt lt y2 gt lt x3 gt lt y3 gt c e Close figure move to start of last sub path h e Fill path f e Stroke path s e End path without filling and stroking n e Modify current clippin
76. r space resource if there is any Nothing otherwise GetNextEmbeddedFile Method PDFEmbeddedFile GetNextEmbeddedFile Return an interface to the next embedded file e Return value An interface to the next embedded file if there is any Nothing otherwise PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 29 of 80 July 9 2015 GetNextFontResource Method PDFFont GetNextFontResource Return an interface to the next font resource e Return value An interface to the next font resource if there is any Nothing otherwise GetNextImageResource Method PDFImage GetNextImageResource Return an interface to the next image resource e Return value An interface to the next image resource if there is any Nothing otherwise GetNextOutlineItem Method PDFOutlineItem GetNextOutlineItem Long MaxLevel Boolean ReturnOpenOnly Return an interface to the next outline item e Parameters MaxLevel optional default 20 The maximum level of the depth of the outlines ReturnOpenOnly optional default false Return only outlines which are opened e Return value An interface to the next outline item if there is any Nothing otherwise GetObject Method PDFObject GetObject String Path This method returns a PDF object specified by the path string The path consists of a prefix and operators Prefix e Trailer dictionary see chapter 3 4 4 of the PDF Reference valid entries are
77. the Application The deployment of an application works as described below 1 Identify the required files from your developed application Identify all files from the RTK that are required by your developed application Include all these files into an installation routine such as an MSI file or simple batch script 4 Perform any interface specific actions e g registering when using the COM interface Example This is a very simple example of how a COM application written in Visual Basic 6 could be deployed 1 The developed and compiled application consists of the file TextExt exe 2 The application uses the COM interface and is distributed on Windows XP only e The main DLL PDFParser dil must be distributed e Asian text should be supported thus pdcjk dil is distributed 3 All file are copied to the target location using a batch script This script contains the following commands COPY TextExt exe targetlocation COPY PDFParser dll targetlocation COPY pdcjk dll targetlocation 4 For COM the main DLL needs to be registered in silent mode s on the target system This step requires PowerUser privileges and is added to the batch script REGSVR32 s targetlocation PDFParser dl1l 1 7 Interface specific Installation Steps COM Interface Registration Before you can use the 3 Heights PDF Extract API component in your COM application program you have to register the component using the regsvr32 exe program
78. tiAlias PathImageBGColor PathImageResolution ConvertPathToImage 5 8 Changes from 1 91 to 2 0 There are no interface changes from version 1 91 final to 2 0 final 5 9 Changes from 2 0 to 2 1 The color profiles to transform RGB to CMYK values and vice versa when extracting colors in the directory bin icc have been renamed from CMYK icc and sRGB icm to USWebCoatedSWOP icc and sRGB Color Space Profile icm to reflect their real names The abbreviated version are no longer supported Document Interface New Methods OcgCount GetOcg New Property LastErrorMessage GetFirstEmbeddedFile GetNextEmbeddedFile New Interface Ocg New Properties Label Level Name Visible Content Interface New Properties OCG IgnoreOCG TPDFContentObject New Enumerations eBeginOCM eEndOCM Enum New Interface New Methods GetElement GetEntry GetNext GetStream PDFObject New Properties BooleanValue IntegerValue RealValue StringValue Name Size Begin End ObjectNumber Type New Interface New Methods Store StoreInMemory EmbeddedFile New Properties CheckSum CreationDate FileName ModDate 5 10 Changes from 4 3 to 4 4 Content Interface Removed Properties PathImageBGColor PathImageAntiAlias PathImageResolution ConvertPathToImage PDF Tools AG Premium PDF Technology 5 11 3 Heights PDF Extract API Version 4 5 Page 77 of 80 July 9 2015 Samples amp Background Information 5 12 There are various code samples in the ZIP
79. tions Store and StoreInMemory return false the FileName property references an external file ModDate Property String ModDate Accessors Get Get the modification date Store Method Boolean Store String Path Store the embedded file to disk e Parameters Path The file name and path where the document shall be stored e Return Values True if the operation competed successfully False otherwise StoreInMemory Method Variant StoreInMemory Store the embedded file in memory e Return Values The embedded file as a byte array PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 71 of 80 July 9 2015 4 17 Enumerations Note Depending on the interface enumerations may have TPDF as prefix COM C or PDF as prefix NET or no prefix at all Java TPDFCompression eComprRaw eComprJPEG eComprFlate eComprLZW eComprGroup3 eComprGroup3_2D eComprGroup4 eComprJBIG2 eComprJPEG2000 eComprUnknown eComprDefault No compression Joint Photographic Expert Group Flate compression Lempel Ziv Welch CCITT Fax Group 3 CCITT Fax Group 3 2D CCITT Fax Group 4 Joint Bi level Image Experts Group JPEG2000 Unknown compression Apply a default compression which suites the color space of the image Note that not all image formats color depths support all compression types TPDFContentObject See also function Content GetNextObject eBeginOCM eEndOCM eNone
80. trings as a single precision real number in text units It doesnt include any scaling factors from coordinate transforms such as from the current transform matrix or the text matrix In order to obtain the font size in page units the values of the current text matrix have to be examined HorizontalScaling Property Single HorizontalScaling Accessors Get PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 51 of 80 July 9 2015 Return the current horizontal scaling factor that describes the amount of horizontal stretching of a text string A value of greater than 1 0 stretches the string whereas a value of less than 1 0 lets the string appear as condensed Leading Property Single Leading Neeessorse SE Return the current leading line spacing of a text string as a single precision number in text units LineCap Property Integer LineCap AEESSS OS SE Return the line cap style The line cap style specifies the shape to be used at the end of open sub paths and dashes when they are stroked 0 Butt cap Round cap 2 Projecting square cap LineJoin Property Integer LineJoin AGESSS ORS Gee This property returns the line join style The line join style specifies the shape to be used at the corners of paths that are stroked 0 Miter join 1 Round join 2 Bevel join LineWidth Property Single LineWidth ACCESS OS race Return a single precision real number in user units of the line w
81. urn the Left value PageNo Property Long PageNo PDF Tools AG Premium PDF Technology 4 14 3 Heights PDF Extract API Version 4 5 Page 65 of 80 July 9 2015 Accessors Get Return the target page number Right Property Single Right ACCOSSOTIS COL Return the Right value Top Property Single Top Accessors Get Return the Top value Type Property Single Type Accessors Get Return the type of the destination such as XYZ Fit FitH FitR etc Zoom Property Single Zoom Accessors Get Return the Zoom value of the destination A value of O has means the zoom level is left as is It has the same meaning as a null value the returns value will be O in both cases A value of 1 means 100 magnification Ocg Interface The optional content group OCG interface allows to list optional content groups also known as Layers and their properties Optional content groups OCGs in PDF differ substantially from the simple layer paradigm found e g in graphics editing programs Graphics objects in PDF do not belong to an OCG Instead their visibility is calculated by a Boolean function dependent on the state of any number of OCGs For example a path could be visible only if OCG A is ON and OCG B is OFF The functionality of OCG are described in depth in ISO 32000 1 chapter 8 11 4 or in the PDF Reference chapter 4 10 OCG is supported in PDF 1 5 or later In order to extract content
82. vailable for mark up annotations requires PDF 1 5 or later Subtype Property String Subtype Accessors Get Return the type of the annotation as string such as Widget Square PopUp FreeText Ink etc TextLabel Property String TextLabel Accessors Get Return the text label of the annotation as string This label is usually used for the name of the author URI Property String URI Accessors Get Return the URI entry of the annotation as string if present Vertices Property Variant Vertices Accessors Get Return the vertices of a polygon annotation PDF Tools AG Premium PDF Technology 4 12 3 Heights PDF Extract API Version 4 5 Page 64 of 80 July 9 2015 OutlineItem Interface 4 13 Count Property Long Count Accessors Get Return the number of children of the current outline A negative number means the child tree is not opened Dest Property IPDFDestination Dest Accessors Get Return an interface to the destination see Destination Interface Title Property String Title Accessors Get Return the title of the outline Destination Interface Note that the properties Bottom Left Right and Top of the destination interface have different meanings depending on the Type of the destination The coordinates are raw PDF user space coordinates Bottom Property Single Bottom Accessors Get Return the Bottom value Left Property Single Left Accessors Get Ret
83. written with different subsets of the same font Different subsets of a font are considered different fonts Therefore if the font changes within what visually looks as one word it is separated Text is not written on the same horizontal line This can occur in some OCRed documents There is a built in tolerance to take account it this however if Y offsets are too large a new word starts Various possible errors in the font Such as incorrect or missing width values of the glyphs in particular of the blank incorrect encoding etc In all of the above cases the coordinates need to be considered Instead of inserting blanks after each word as in the sample the coordinate and width of the previous text token needs to be compared with the position of the next text token If text is concatenated e blanks are missing decrease the property SpaceFactor for example to the value 0 2 See also property SpaceFactor in the Content interface PDF Tools AG Premium PDF Technology 3 Heights PDF Extract API Version 4 5 Page 78 of 80 July 9 2015 Extracted Text is Unreadable Fonts contain a particular set of glyphs A glyph is a specific graphical rendering of a character The glyphs P P and P are glyphs of the character P Fonts have an encoding such as WinAnsi or MacRoman or custom encodings The encoding maps the glyphs to a character If the encoding in a font is missing it is assumed it is WinAnsi encode
84. xtract API Version 4 5 Page 37 of 80 July 9 2015 ExpandLigatures Property Boolean ExpandLigatures Accessors Get Set Default False When ExpandLigatures is set to true ligatures such as fi ff fl etc found during text extraction are converted to individual characters Flags Property Long Flags ACCOSSOTSISS ES Return 1 while content is parsed and the annotation flags when annotations are parsed see also Property Flags in the Annotation interface GetNextImage Method PDFImage GetNextImage This method reads the content stream objects until an image object can be returned or the end of the content stream is reached If an image object could be found an interface to the image object see Image Interface is returned Its interface can also be retrieved through the content s Image property The graphics state can be retrieved through the content s GraphicsState property e Return value An interface to the next image object on the current page if there is any Nothing otherwise GetNextObject Method TPDFContentObject GetNextObject This method reads the content stream objects until a text image or path object can be returned or the end of the content stream is reached e Return values eNone The end of the content stream has been reached and the content s Path property doesn t return a valid value eText A text object could be composed and its interface can be retrieved through the content s T
Download Pdf Manuals
Related Search
Related Contents
Brochure Lexium SD3 - BERGER Finalités de la culture générale ALADIN User`s Manual Audiovox Aca200w User's Manual LMX Series (Asynchronous 16-Port Multiplexor) Cliccate qui per un nuovo ASUS MAXIMUS V FORMULA/THUNDERFX User's Manual BE6100 Operating Instructions BRINKMANN i-Hélicoptère Happy Cow EOS EOS EOS EOS EOS Copyright © All rights reserved.
Failed to retrieve file