PDFA Embedded
Proposal under discussion:
Proposal for U.S. to submit a comment on Part 2 responding to the Editor's Note in section 6.8 which asks NBs to comment on whether they are OK with the addition which allows embedding of both PDF/A and XML files.
Relevant text from Section 6.8
A file specification dictionary, as defined in ISO 32000-1:2008, 7.11.3, may contain the EF key, provided that the embedded file is compliant with either part 1 or part 2 of ISO 19005 or is compliant with the Extensible Markup Language (XML) 1.0 specification.
A file’s name dictionary, as defined in ISO 32000-1:2008, 7.7.4, may contain the EmbeddedFiles key, provided that all of the embedded files are compliant with either part 1 or part 2 of ISO 19005 or are compliant with the Extensible Markup Language (XML) 1.0 specification.
NOTE The prohibition of non-PDF/A and non-XML compliant documents has the implicit effect of disallowing embedded files that can create external dependencies and complicate preservation efforts.
If the PDF contains any embedded files which are compliant with the Extensible Markup Language (XML) 1.0 specification, the conforming reader should not read their contents and instead shall treat them as a stream of data whose contents and format are private.
Action:
The U.S. TAG will use the PDF/A wiki to collaborate on the pros and cons of the above proposal. This discussion will contribute to a possible U.S. comment on Part 2 regarding embedding files. All discussion must be completed on the wiki by August 10 for the PDF/A TAG meeting on August 12. Post comments to the page or edit the page. If you edit the page, please select a color so we can see the comments more easily.
Background - Original German Comment on Part 2 CD
I would like to revisit this policy, I have talked to users from a number of organizations who always also archive original files (even if they are aware they might not be able to read them in any useful way in the far future). In some cases this happens for legal reasons (do not destroy evidence). In some (rare but important) cases it even make sense to virtually reanimate some rusty mechanism to be able to read an old/proprietary/outdated format. The reasoning behind all this is simple: most probably nobody is going to ever need the original source file, but in case someone needs it anyway, its better to have it than not.
Now, if such source files cannot be embedded in PDF/A they have to be maintained elsewhere, which opens up a real can of worms compared to a pure PDF/A only world where the file and its metadata can be stored in a stand alone fashion...
RECOMMENDATION (of Germany):
- revisit policy that prohibits embedding of non-PDF/A files
- introduce mechanism that specifically identifies embedded files as 'source files' - whether for the whole document or portions thereof (whatever applies has to be indicated appropriately)
- it is outside the scope of PDF/A to know what to do with such embedded files/data, for the reading of the PDF/A files as such the embedded non-PDF/A files shall be completely ignored except for the requirement that their presence must be indicated (including some name for the embedded file/data)
a more specific proposal how to achieve this is expected to be presented at the next meeting
Withdrawn in Hamburg
POSSIBLE U.S. Responses
Accept the current text in 6.8
Committee Member/Project Leader (Leonard Rosenthol):
Allow the embedding of XML files, which are not to be viewed by a conforming reader.
Details
When drafting the language for the spec, I specifically chose language that I believe is 100% in keeping with the language and scope of PDF/A. The paragraph in question reads:
If the PDF contains any embedded files which are compliant with the Extensible Markup Language (XML) 1.0 specification, the conforming reader should not read their contents and instead shall treat them as a stream of data whose contents and format are private.
This makes it quite clear that a conforming reader is not to consider this something to be rendered/displayed – but instead is to simply treat it as “private data” (that just happens to be at a known location in the PDF). Thus the presence of the XML will not, in any way, impact the “preservation of the static representation of page based electronic documents over time”.
It will, however, improve the ability of a PDF/A document to comply with the secondary and tertiary purposes listed in the introduction:
A secondary purpose of ISO 19005 is to define a framework for representing the logical structure and other semantic information of electronic documents within conforming files.
Another purpose of ISO 19005 is to provide a framework for recording the context and history of electronic documents in metadata within conforming files.
As the XML may contain the original structured content from which the presentation was derived, its presence in the PDF clearly would enable richer semantic gleaning of the contents. In addition, richer metadata or private metadata could be present in that format as well.
Reject the proposal.:
Committee member: Susan Sullivan (NARA):
NARA is not OK with this addition. In Hamburg, the 182DE comment was withdrawn after substantial discussion in which the JWG determined that embedding XML or any non-PDF/A file format was outside the scope of Part 2. We determined that PDF/A as a "container standard" would be appropriate for future versions, and that Part 2 should be an upgrade of Part 1, as advertised.
As I mentioned in June, there's a root cause issue here. Due to the extended development time we're getting "new" user requirements all the time. Some of these requirements sound good, so instead of putting them in a "parking lot", we are re-actively addressing them ad hoc, depending on who hears the requirements. Unfortunately, while well intended, this is a slippery slope and can compromise the standard's credibility. Our committee needs to be proactive, disciplined, and in total agreement that we can defend any change in scope. That said, I don't agree that allowing embedded PDF/As has already changed the scope from static visual appearance. We can defend embedding PDF/As in PDF/As from an archival and static visual appearance perspective.
Allowing "embedded files" that may rely on external software for viewing defeats the "self-contained" nature of PDF/A and can negatively impact its sustainability.
Committee Member: Butch Lazorchak, Library of Congress
The Library does not support this addition for version 2. The committee should concentrate on the original purpose of this version of the specification, which was to bring the specification into alignment with ISO 32000. The committee should put off any decisions about embedded files until the consideration of version 3.
The draft of 19005-2, as currently stated, privileges the embedding of XML data in addition to 19005-1 or -2 conforming objects in conforming 19005-2 documents. XML data, while arguably one of the more “open” data formats, remains merely one example of a multiplicity of arbitrary data formats that could conceivably be embedded in a compliant PDF/A document.
The question of embedding hinges on whether or not the embedded file has any effect on the rendering of the “static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files.”
If there is the possibility for embedded files to affect rendering, then 19005-2 should explicitly prohibit the embedding of any arbitrary data beyond compliant 19005-1 or -2 objects. On the other hand, if it is not possible, either due to inherent features of PDF or by a normative requirement of 19005-2, for the embedded file to affect rendering, then the embedding of arbitrary files, not just XML files, is permissible on technical grounds.
Note that without implementing additional safeguards to ensure the long-term interpretability of documents and their embedded materials, it still may be a questionable practice to allow arbitrary embedding. However, that is a matter of local policy, and the standard itself should strive to be policy-neutral.
The committee should carefully consider the implications of embedding non-19005 compliant data into a 19005-2 document. Unless it is clear that this can be done without compromising the fundamental requirement of PDF/A to ensure the integrity of the static visual appearance of the document, further efforts to privilege formats beyond 19005-compliant data should be shelved and considered more thoroughly for 19005-3.
Real World Business Cases/Needs
British patent office uses XML to collect patent data. The export the data and share it via PDF/A. They wish to embed the source XML in the PDF/A file as metadata.
Brazilian government (various agencies) use an XML-based system for authoring documents and have also standardized on PDF/A as the format for distribution of those documents outside the system. They wish to have the original "source" incorporated into the PDF/A when distributed for future processing in the native system.
The CEN standards committee on eInvoicing has chosen an XML grammar as the standard for data exchange of electronic invoicing information and PDF/A as the standard for human readable presentation of the invoice.
Are there any business cases from our own membership?
Yes. John Iobst has frequently commented that he wishes to use this functionality to incorporate information from his newspaper publishing system into the archival version.
Technical Note
By allowing for the inclusion of XML-based information in the PDF/A in a known location (aka EmbeddedFiles), we make it possible for future readers of the document to have access to both it and the rich presentation that is PDF/A.
Applicable Text from PDF/A-2
How do the above business needs apply to the purpose and scope below?
Introduction:
The primary purpose of ISO 19005 is to define a file format based on PDF, known as PDF/A, which provides a mechanism for representing electronic documents in a manner that preserves their static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files.
A secondary purpose of ISO 19005 is to define a framework for representing the logical structure and other semantic information of electronic documents within conforming files.
Another purpose of ISO 19005 is to provide a framework for recording the context and history of electronic documents in metadata within conforming files.
These goals are accomplished by identifying the set of PDF components that may be used, and restrictions on the form of their use, within conforming PDF/A files.
Scope:
This part of ISO 19005 specifies how to use the Portable Document Format (PDF) 1.7, as formalized in ISO 32000-1, for preserving the static visual representation of page based electronic documents over time.
Questions Raised
Please define "Private" in an archival file where all file content is discoverable and available to researchers.
A PDF document is made up of various types of objects (sometimes referred to as "Cos" objects). The most common type of these is a dictionary - a collection type that uses key/value pairs. ISO 32000-1 documents all of the known uses of the various objects including all the standard dictionary uses and the keys & values that are included. For example, the root object of a PDF is a dictionary called the Catalog. It might look like this:
1 0 obj << /Type /Catalog
/Metadata 2 0 R
/Page 3 0 R
>>
But there is nothing in either ISO 32000-1 nor in PDF/A (parts 1 or 2) that prohibit the inclusion of a custom key with whatever value the author wants. In fact, we specifically choose to ALLOW this during dicussions as part of PDF/A-1 to enable variosu workflows. For example, I might add /MyXML and then point to a stream of XML data.
Since a researcher isn't going to be "spelunking" around inside the guts of a PDF - but instead simply looking at the known information of the document through a conforming reader based on the standard - this information is therefore "private" and "hidden". Rather than force users into doing this in incompatible/inconsistent ways, we need to standardize a location where such things can be placed.
What is a "closed system"?
A closed system is defined as one where the creator and the consumer of the document are either one and the same or at least have some "out of band" communications to create an environment with features beyond the standard/norm.
Also, please explain what are the pros and cons of embedding XML versus including XML in the metadata?
XML metadata in PDF requires that the XML meet the syntax and language requirements for XMP, whi ch is based on a specialized XML technology called RDF. While this is excellent for metadata, it is very heavyweight and difficult for general XML usage. It also would require that existing XML grammars (which may, themselves, be standards) be changed simply to conform with RDF/XMP.
Return to PDF/A page.
