Blog : Metadata Extractor for PDF Forms broken

Metadata Extractor for PDF Forms broken 


I think that PDFBox has a bug that prevents reading PDF Forms to populate metadata.
https://issues.apache.org/jira/browse/PDFBOX-1100

As a result I need to develop a way to read the values from the fields in a PDF Form.
It appears the Acrobat is capable of running JavaScript http://partners.adobe.com/public/developer/en/acrobat/sdk/AcroJSGuide.pdf so it got me thinking that perhaps Alfresco Webscripts could read the PDF Forms.

Has anyone taken this approach? Is it feasible?
Is there another way to read the data from PDF Forms to populate the metadata in Alfresco?

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken

devodl wrote:
It appears the Acrobat is capable of running JavaScript http://partners.adobe.com/public/developer/en/acrobat/sdk/AcroJSGuide.pdf so it got me thinking that perhaps Alfresco Webscripts could read the PDF Forms.


Answering his own question Steve states:
PDF Forms can execute JavaScript to copy the form field value to a custom document property. Custom document properties can be read by the PDFBox metadata extractor.

Here is what I prototyped to copy PDF Form field data to a custom property:
1 - Open the PDF Form document and create a custom property: File=>Properties=>Custom tab give it a name and null value
2 - Edit the PDF Form, select a field and open its Properties then select the Actions tab
3 - Add Action (trigger: Mouse Up, action: Run a JavaScript)
4 - Edit the action "Run a JavaScript" and add the code to copy the data


function writeToProperty() {
 var fld = this.getField("dswf_clientName");
 this.info.kcms_clientName = fld.value;
}
writeToProperty(); // call my function
This is by no means a complete of the steps required to extract PDF Form data into Alfresco metadata but should provide some direction for other developers.

FWIW,
Steve

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken
Open the PDF Form using Acrobat Pro X

Create the Custom Properties
File => Properties, Custom Tab: add name/value pairs

Add JavaScript to PDF Forms
Change to Forms Edit (Edit=>Tools=>Form then Edit)
Select a field, right-click, Properties
Select the Actions tab, trigger: Mouse Up, action: Run a JavaScript
Edit the JavaScript and enter:


function writeToProperty() {
 var fld = this.getField("sourceFieldName");
 this.info.targetPropertyName = fld.value;
}
writeToProperty(); // call my function
Save PDF Form using: Save As => Reader Extended PDF => Enable Additional Features...

Caveat: While this will copy data from fields to properties when using Acrobat Pro X it does not appear to work when using Acrobat Reader :(
reference: http://forums.adobe.com/thread/859315

We have not found a solution at this time.

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken
After much learning I believe I understand the problem and have a version 1.0 solution

Problem
PDFBox 1.6.0 (Alfresco 4.x) is not parsing all the objects in the PDF Form. Specifically it is not parsing the form fields that have been filled out using Acrobat Reader. As a result the PDF Form fields (filled out using Acrobat Reader) cannot be extracted by Alfresco using Tika and PDFBox.

Analysis
The parser in PDFBox 1.7.0 is being improved to handle stream objects in a more complete manner
https://issues.apache.org/jira/browse/PDFBOX-1199
This new code is contained in Rev 1333582 ==> NonSequentialPDFParser.java
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/ and testing shows that the new PDDocument.loadNonSeq(File, RandomAccess) method and form field values created by Acrobat Reader are now readable. :D

Tika 1.1 currently calls org.apache.pdfbox.pdmodel.PDDocument.load() which correctly parses the metadata of the document but fails to parse the PDF Form fields. Furthermore the new org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq() method will correctly parse the PDF form fields but is not able to parse the metadata. Evidently you can't have it both ways. :o

Solution
- Checkout, build and deploy PDFBox-1.7.0-SNAPSHOT.jar containing the loadNonSeq() code
- Modify and deploy the Tika 1.1 package to use the new PDFBox code
The InputStream needs to be processed twice, once for metadata and once for form field data so a temp file is used instead.
The org.apache.tika.parser.pdf.PDFParser class was edited as follows:


  try {
     // New - Use a temp file so it can be parsed twice
  tstream = TikaInputStream.get(stream, tmp);
  tsFile = tstream.getFile();
 
  // PDFBox can process entirely in memory, or can use a temp file
  //  for unpacked / processed resources
  // Decide which to do based on if we're reading from a file or not already
  if (tstream != null && tstream.hasFile()) {
  // File based, take that as a cue to use a temporary file
  scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
  pdfDocument = PDDocument.load(tsFile, scratchFile);
  } else {
  // Go for the normal, stream based in-memory parsing
  pdfDocument = PDDocument.load(tsFile);
  }
...snip code to cope with encrypted files...   
  metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
  extractMetadata(pdfDocument, metadata);
 
  // New - Now parse again but non-sequentially to retrieve any form field data
  pdfFormDoc = PDDocument.loadNonSeq(tsFile, scratchFile);   
  extractFormFieldData(pdfFormDoc, metadata);   
 
  PDF2XHTML.process(pdfDocument, handler, metadata,
  extractAnnotationText, enableAutoSpace,
  suppressDuplicateOverlappingText, sortByPosition);
In addition to changing the parse() method above, a new method was added to process the AcroForm fields as follows:

  private void extractFormFieldData(PDDocument document, Metadata metadata)
  throws TikaException, IOException {      
     PDDocumentCatalog docCatalog = document.getDocumentCatalog();
    PDAcroForm acroForm = docCatalog.getAcroForm();
    if (acroForm != null) {
       List fldList = acroForm.getFields();
       Iterator fIter = fldList.iterator();
       while(fIter.hasNext()){
        PDField field = (PDField)fIter.next();
 
        addMetadata(metadata, field.getFullyQualifiedName(), field.getValue());     
        if (logger.isDebugEnabled())
    {
          String logMsg = "extracting: " + field.getFullyQualifiedName();
      logMsg += "  value: " + field.getValue();
    logger.debug(logMsg);
    }
      }   
    }
  }
I'm sure that there are better ways of doing this but I chose to use a temp file just to get it working.
Perhaps the Tika and PDFBox developers will consider this problem as the two projects evolve.
I hope that this posting helps others.

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken
Hi Steve,

I'm really interested in applying this fix as we use Adobe forms heavily and would like to do more with them in Alfresco. I've done a little with Javascript and PHP but not really Java all so much. Are you able to give a bit more detail as to how to follow the steps you have there? I know how to use svn to checkout from the URL you gave, but I am not sure how to build and deploy the jar file, or where I would put it once I have. I also am not sure how to edit the PDFParser class.

Thanks very much in advance for your help, if you have the time and inclination to give it.
Re: Metadata Extractor for PDF Forms broken
Chris,
It's been a couple of months since I worked on this so my memory isn't too fresh.
But at a very high level here's how I built the jar files:
Using eclipse Indigo (3.7) with subversion and Maven plugins
- Checkout the Tika 1.1 project from: http://svn.apache.org/repos/asf/tika/tags/1.1
- Checkout the PDFBox 1.7.0 SNAPSHOT (or higher) http://svn.apache.org/repos/asf/pdfbox/tags/1.7.1

Resolve dependencies (tika is dependent on pdfbox)
This is where I learned how to use Maven with eclipse
- Use Maven to build PDFBox with maven goals of "clean and install" (Hint: eclipse Run Configurations...)
- Modify the Tika code org.apache.tika.parser.pdf.PDFParser
- Using the eclipse editor modify the class as described earlier in this thread
- change the parse() method


  public void parse(
  InputStream stream, ContentHandler handler,
  Metadata metadata, ParseContext context)
  throws IOException, SAXException, TikaException {
 
  PDDocument pdfDocument = null;
  PDDocument pdfFormDoc = null;
     TikaInputStream tstream = null;
  File tsFile = null;
  TemporaryResources tmp = new TemporaryResources();
  RandomAccess scratchFile = null;
 
  try {
     // SMD - Use a temp file so it can be parsed twice
  tstream = TikaInputStream.get(stream, tmp);
  tsFile = tstream.getFile();
 
  // PDFBox can process entirely in memory, or can use a temp file
  //  for unpacked / processed resources
  // Decide which to do based on if we're reading from a file or not already
  if (tstream != null && tstream.hasFile()) {
  // File based, take that as a cue to use a temporary file
  scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
//  pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true);
  pdfDocument = PDDocument.load(tsFile, scratchFile);
  } else {
  // Go for the normal, stream based in-memory parsing
//  pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true);
  pdfDocument = PDDocument.load(tsFile);
  }
 
  if (pdfDocument.isEncrypted()) {
  String password = null;
 
  // Did they supply a new style Password Provider?
  PasswordProvider passwordProvider = context.get(PasswordProvider.class);
  if (passwordProvider != null) {
  password = passwordProvider.getPassword(metadata);
  }
 
  // Fall back on the old style metadata if set
  if (password == null && metadata.get(PASSWORD) != null) {
  password = metadata.get(PASSWORD);
  }
 
  // If no password is given, use an empty string as the default
  if (password == null) {
  password = "";
  }
 
  try {
  pdfDocument.decrypt(password);
  } catch (Exception e) {
  // Ignore
  }
  }
  metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
  extractMetadata(pdfDocument, metadata);
 
  // SMD - Now parse non-sequentially to retrieve any form field data
  pdfFormDoc = PDDocument.loadNonSeq(tsFile, scratchFile);   
  extractFormFieldData(pdfFormDoc, metadata);   
 
  PDF2XHTML.process(pdfDocument, handler, metadata,
  extractAnnotationText, enableAutoSpace,
  suppressDuplicateOverlappingText, sortByPosition);
 
  } finally {
  if (pdfDocument != null) {
  pdfDocument.close();
  pdfFormDoc.close();
  }
  tmp.dispose();
  }
  }
- add the extractFormFieldData() method


  /**
  * Steve Deal - Added to parse PDF Form fields
  *
  * @param document
  * @param metadata
  * @throws TikaException
  */
  private void extractFormFieldData(PDDocument document, Metadata metadata)
  throws TikaException, IOException {      
     PDDocumentCatalog docCatalog = document.getDocumentCatalog();
    PDAcroForm acroForm = docCatalog.getAcroForm();
    if (acroForm != null) {
       List fldList = acroForm.getFields();
       Iterator fIter = fldList.iterator();
       while(fIter.hasNext()){
        PDField field = (PDField)fIter.next();
 
        addMetadata(metadata, field.getFullyQualifiedName(), field.getValue());     
        if (logger.isDebugEnabled())
    {
          String logMsg = "extracting: " + field.getFullyQualifiedName();
      logMsg += "  value: " + field.getValue();
    logger.debug(logMsg);
    }
      }   
    }
  }
- Use Maven to build Tika with maven goals of "clean and install" (This was new to me, since I've been using Ant).

If you're new to these tools and the language it will require learning but that's the fun of it :)

I hope this helps.

Steve

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken
Well that sure was an adventure. I've pretty much managed to follow your steps (woohoo!) but I just need a leeedle bit more help to get over the line. Hope it doesn't put you out at all and thanks very much for the help thusfar.

In case someone else sees this post looking for the same thing, I'll just go over my steps here:

-For starters, I tried to checkout the original projects using File>New>Project...>Maven>Checkout Maven Projects from SCM, but had some issues, none of my SVN connectors were showing up in the SCM type field. I spent almost an hour trying to troubleshoot this issue (which appears to be prevalent among eclipse indigo users) before I gave up, uninstalled indigo and installed Eclipse Juno (which looks a little more fancy anyway :p).
after installing juno, m2e and subversive, as well as the Maven SCM Handler for Subversive, I was able to use File>New>Project...>Maven>Checkout Maven Projects from SCM Normally to check out the two URI's you posted.

-I had some trouble resolving dependencies due to two issues. The first was a basic misunderstanding on my part of the difference between a dependency and a folder on the classpath. Once I figured out how to work with the POM.xml files this was sorted. The second was because eclipse took what I expected to be 2 projects and turned them into 14. For example, I had pdfbox-ant, pdfbox-app etc etc, as well as pdfbox and pdfbox-parent. The same was true in the tika files.

- Spent most of my day trying to figure out why I couldn't compile tika. I compiled pdfbox no issues, and made the changes to the pdf parser class, however I ran into a bunch of "error: cannot find symbol"'s when trying to compile. eventually I had to learn what an import statement was and add a few of these (still not sure why I had to import java.io.File, it seems like something that would have been in the file already if it was needed, but I spose that's why I'm not a Java dev).

So now I have managed to build both of these with the changes. I have 5 jar files:


•pdfbox-1.7.1.jar[/*]
•tika-app-1.1.jar[/*]
•tika-bundle-1.1.jar[/*]
•tika-core-1.1[/*]
•tika-parsers-1.1.jar[/*]

If I look on the Alfresco VM, in /opt/alfresco-4.0.d/tomcat/webapps/alfresco/WEB-INF/lib/ I have the pdfbox jar, as well as tika-core and tika-parsers. (no pdfbox or tika jars in the share lib). So I assume my next move from here is to either move the pdfbox jar, the tika-core jar and the tika-parsers jar into /opt/alfresco-4.0.d/tomcat/webapps/alfresco/WEB-INF/lib/ or /opt/alfresco-4.0.d/tomcat/shared/lib/. So that's my first question - will the shared lib work for this case?

Secondly, I know that the filenames are not exactly the same - the version numbers are different. So my second question is - should I delete/move the originals, should I change the names of the jar's I have built so that they will override/overwrite?
Re: Metadata Extractor for PDF Forms broken
Chris,
Excellent work! I can relate to the journey you took.

Deployment
You only need to deploy the modified tika-parsers and pdfbox jar files to supercede the original files. The actual names of the jar files are inconsequential, it is the specific packages and class names as well as method signatures that are critical.
package: org.apache.tika.parser.pdf
class: PDFParser
method: public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException

I see you read my other posting: https://forums.alfresco.com/en/viewtopic.php?f=9&t=44804 back in May where I stated:


Quote:
The only solution we have found is to rename the OOTB jar files and drop the modified jar files into the tomcat/webapps/alfresco/WEB-INF/lib.

I simply rename the OOTB jar files (e.g. tika-parsers-1.2-20120504.jar ==> tika-parsers-1.2-20120504.jar.original), copy my files to that same directory, and the class loader only loads files with the .jar extension.

I hope that makes it clear.

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken
whew, OK so I renamed the two original jars to .orig, moved the two new jars into the lib, and restarted alfresco. The server starts up fine, so I am left with just a few questions:

•When the server starts up I see in the logs the following:
INFO: Adding 'file:/opt/alfresco-4.0.d/alf_data/solr/lib/tika-parsers-1.1-20111128.jar' to classloader
07/08/2012 8:18:22 AM org.apache.solr.core.SolrResourceLoader replaceClassLoader
Which, I know, is in the solr lib not the alfresco lib. My question is - does this indicate a problem? Should SOLR be using the new jars too?[/*]

•Moving from here to extracting data from pdf forms: Do I need to define a new metadata extractor or extend the existing PDF one now?
For instance, say I have a test pdf form, created in Livecycle, with a field in it named "testData". If I also had defined in our content model an aspect: "my:testAspect" with a property "my:testData", would extracting common metadata on the pdf cause it to gain the my:testAspect aspect with the my:testData property set to whatever was entered in the form field? (without further modification)

Or would I need to first override the bean loading org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter and add a custom mapping for testData=my:testData?[/*]


Edit:

I've spent some time trying to get this to work, however I am still having trouble. My steps so far:
In /opt/alfresco-4.0.d/tomcat/shared/classes/alfresco/extension I added 2 files, custom-metadata-extractors-context.xml


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
  <bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter" parent="baseMetadataExtracter" >
  <property name="inheritDefaultMapping">
  <value>true</value>
  </property>
  <property name="mappingProperties">
  <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
  <property name="location">
  <value>classpath:alfresco/extension/custom-pdfbox-extractor-mappings.properties</value>
  </property>
  </bean>
  </property>
  </bean>
</beans>
and custom-pdfbox-extractor-mappings.properties

# Namespace Definitions
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
namespace.prefix.my=my.companyName.root
 
#Mapping Definitions
testData=my:testData
I already have a custom model deployed, so I added to it:


  <aspect name="my:testAspect">
  <title>Test Aspect</title>
  <properties>
  <property name="my:testData">
  <type>d:text</type>
  </property>
  </properties>
  </aspect>
I have created a few forms in Livecycle, each with a single text field named testData . The first was a dynamic XML form and the second a static pdf. With each of these I tried the following:


•Filling the form in using Acrobat X[/*]
•Extending the form using Acrobat X, then filling in with Reader X[/*]
•Distributing the form, opening in Reader X, submitting[/*]
Once I had uploaded them to Alfresco, I tried extracting common metadata with and without adding the testData aspect first but got no joy.
Is there something extra you had done to get this to work? I saw the javascript solution you posted in the other thread, and I am hoping this isn't it, as these forms will be filled in using Reader almost universally.

Thanks again btw for the help you've provided so far, which has been invaluable
Re: Metadata Extractor for PDF Forms broken
The log message is an INFO message so that's okay.
I'm no expert but I suspect that SOLR uses Tika to extract metadata during the index process. So yes, it should use the new jars as well.

No override of the classes is required. The new jars enable Tika and PDFbox to extract form field using the standard approach.

As you know each PDF Form field must have a name defined, then the extractor maps the form field name to a custom metadata field for that content type.
Custom Content Types: http://docs.alfresco.com/4.0/topic/com.alfresco.enterprise.doc/tasks/kb-define-custom-model.html
Metadata Extraction: http://docs.alfresco.com/4.0/topic/com.alfresco.enterprise.doc/tasks/metadata-config.html


Quote:
For instance, say I have a test pdf form, created in Livecycle, with a field in it named "testData". If I also had defined in our content model an aspect: "my:testAspect" with a property "my:testData", would extracting common metadata on the pdf cause it to gain the my:testAspect aspect with the my:testData property set to whatever was entered in the form field? (without further modification)

My prototype was developed using a content type with specific properties. I am pretty sure that the aspect will be added and the field mapped to the aspect property as you suggest.

Here's an example where I defined a namespace (myns) and used it both as a prefix for the form fields in Livedata as well as the namespace for custom metadata.
PDF Form Field myns_projectName
Custom Metadata myns:projectName


  <!-- This adds in the extra mapping for the Open Document extractor -->
  <bean id="extracter.PDFBox"  class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter" parent="baseMetadataExtracter">
  <property name="inheritDefaultMapping">
  <value>true</value>
  </property>
  <property name="mappingProperties">
  <props>
  <!--  Metadata extraction  -->
  <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
  <prop key="namespace.prefix.myns">http://www.acme.com/model/content/1.0</prop>
  <!--  My Namespace Project Model -->
  <prop key="myns_projectName">myns:projectName</prop>
  <prop key="myns_organizationName">myns:organizationName</prop>
  <prop key="myns_organizationAddress">myns:organizationAddress</prop>
  </props>
  </property>
  </bean>
 
Just to emphasize, I developed this only so far as to prove the concept. I wasn't afforded the time to test it rigorously and it has not been put into production. My hope is that Tika gets updated to support PDF form field metadata extraction and Alfresco is updated to use that with PDFbox 1.7.x so that this level of customization is not necessary.

I hope this answers your questions.
I'm on the road visiting colleges with my son the next couple of days so it'll be the end of the week before I can follow up on this thread.

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken
Hi Steve,

So does that work for you with properties entered only in the form fields (which is to say, does it work without the javascript from your other post setting custom properties from the form fields)?

The only way I am able to get the test data out of the form (config as per my previous post) appears to be by setting a custom property. If I use liveCycle to add a custom property called testData, whatever I put in for the value of that property is used as metadata by Alfresco. So that, at least, works. However I do not get anything from form fields named testData. I tried to implement the javascript from your other post, I'm not sure if we use a different version or what, but as far as I can tell, the this.info object (from the scope of a field) doesn't exist. I tried form.info and xfa.info and a few others but to no avail.

If I read your post correctly, the javascript shouldn't matter with the changes made to the tika parser, it should get the info from form fields directly. If this is the case, I am not sure what I am doing wrong as the only difference I see between your config and mine is that you specified the mapping as part of the context whereas I offloaded mine to a properties file. I see absolutely no reason for that to matter a whit, but that's the next thing I'll be giving a try, just in case. If I have read this wrong and your solution only works with javascript updating the custom properties in line with the form fields, do you know a more absolute path than 'this' to whichever object should have the info property?

I hope your son finds a good college and enjoys his time there!
Re: Metadata Extractor for PDF Forms broken
Chris,


chrisokelly wrote:
So does that work for you with properties entered only in the form fields (which is to say, does it work without the javascript from your other post setting custom properties from the form fields)?

The only way I am able to get the test data out of the form (config as per my previous post) appears to be by setting a custom property. If I use liveCycle to add a custom property called testData, whatever I put in for the value of that property is used as metadata by Alfresco. So that, at least, works. However I do not get anything from form fields named testData. I tried to implement the javascript from your other post, I'm not sure if we use a different version or what, but as far as I can tell, the this.info object (from the scope of a field) doesn't exist. I tried form.info and xfa.info and a few others but to no avail.



Nope, no JavaScript in the PDF Form here.

When I created the PDF Form I used Adobe Acrobat Pro X (not Livedata) to create the form. This process itself was clumsy thanks to the Adobe product. I used Pro X to create the form and then for each form field I set its Name property http://help.adobe.com/en_US/acrobat/pro/using/WS75136AD2-894B-414e-B296-C590121A789B.w.html
For example if I had a field on the form called Project Name I would set the field name property to be: myns_projectName
Then in the metadata extractor config file I would map it to: myns:projectName

Probably time to break it down.
I recommend that you create a fillable PDF form with a test field name set to: form_fieldName (to differentiate it from document metadata). Then use eclipse to run PDFBox and have it print out the values set for the field. That's how I diagnosed the problem originally and learned that PDFBox 1.6.0 wasn't parsing the form fields. Once you have a valid PDF Form field and PDFBox parses it correctly you can add more complexity by mixing it into Tika and Alfresco. I jumped directly from getting PDFBox to parse using eclipse to extracting metadata with Alfresco but YMMV.

Good luck.

â€â€
Regards, Steve
Re: Metadata Extractor for PDF Forms broken
Hi,

Sorry if I am hijacking your thread here, but this seems like something that will be relevant to anyone else following these steps.

So I pretty much have it figured out now, the above problem was that I was making the forms with Livecycle - this process doesn't help get data from xfa forms, only from Acrobat created forms. I was able to extract metadata fine from Acrobat forms up until I tried one with a signed signature field in it. The signature isn't something I need to get into metadata, however when running the "extract common metadata" action I would get no metadata. In the UI I saw no response, but in the logs I saw:


WARN  [content.metadata.AbstractMappingMetadataExtracter] [http-8443-6] Metadata extraction failed (turn on DEBUG for full error):
  Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@ae747d3
  Content:  ContentAccessor[ contentUrl=store://2012/8/16/15/6/51db06bc-26e2-4871-a613-5e25e823ffac.bin, mimetype=application/pdf, size=50453, encoding=UTF-8, locale=en_US]
  Failure:  Can't get signature as String, use getSignature() instead.null
This seems to be related to a deprecated function in PDFBox. I am vaguely aware that there would be some way to override tika to use the correct method; This is probably a far more elegant solution. My Java skills are scant however, and I know that we do not need the signature data brought into Alfresco metadata, so I just used a kludgy workaround. In org.apache.tika.parser.pdf.PDFParser.java, around line 325, I made the following change:


       while(fIter.hasNext()){
         PDField field = (PDField)fIter.next();
         String checkFieldType = field.getFieldType();
         if (checkFieldType != "Sig") {
           addMetadata(metadata, field.getFullyQualifiedName(), field.getValue());   
           }
       }
I realize this could have been accomplished on a single line (if (field.getFieldType() != "Sig") ), I just did it this way because it was easier to debug with breakpoints.

Just to make this abundantly clear - with this change, no metadata will be extracted from signature fields. Ever. At all. All it does is prevent the parser from falling over when it hits a signature field.