XML

 

 

 

XML

 

The Extensible Markup Language (XML) is a specification based on a subset of the Standard Generalized Markup Language (SGML) that defines a very strict set of rules for how to encode content of all types and languages as character strings in a formal and concise way that it is both human and machine readable and generally self describing and reasonable clear.

 

XML Documents

 

XML is typically encoded into what is commonly known as a document which contains descriptive markup used to describe the content and the content itself.  Descriptive markup is typically referred to as a tag and all other content being described is referred to as character content. 

 

Tags are generally markup constructs that always begin with a < and end with a >.  Tags typically either come in pairs of start-tags and end-tags (i.e. <tag></tag>) which may contain other child tags or self-closing tags which may not contain child tags (i.e. <tag/>).  Tag names are case sensitive and may not contain spaces or any of the characters !"#$%&'()*+,/;<=>?@[\]^`{|}~, and may only start with underscores or alpha characters.

 

Character content is the information stored in the document and is stored as (mostly) human-readable character strings with the exception of certain characters used in defining the markup language itself that are escaped with escape sequences beginning with an ampersand (&) followed my alphanumeric characters and ending with a semi-colon (;).  Common escape sequences include the ampersand (&amp;), single quote (&apos;), double quote (&quot;) greater than symbol (&gt;), less than symbol (&lt;), and a character set numeric value in either decimal or hexadecimal format (i.e. the letter "A" in UNICODE is represented as $#65; for decimal and $#x41; for hexadecimal).  In very rare cases you may encounter a CDATA tag which begins with a (<![CDATA[) and ends with a (]]>) and all content between the two is essentially raw content.

 

An attribute is a form of descriptive markup used to encode name-value pairs directly within the markup of a tag (i.e. <tag attribute="value"/>) wherein the name may only be specified once within a tag and the value is encoded and wrapped within a pair of either single-quote (') or double-quote (") characters and may not contain the the less-than (<) character or the quote character unless it is escaped.  Attribute names follow the same rules as tag names.

 

An element is a tag that logically groups related content and may contain child content and/or child elements.  Element names are case-sensitive and can contain any alphanumeric character but the only punctuation allows is the hyphen (-), underscore (_), and period (.).

 

A document may contain one (and only one) declaration tag which is used to communicate information to an XML parser for processing the document and if present must be the first tag in the document.  The declaration tag has the name xml, is delimited by <? and ?>, must be the first line of the document, and must be entirely in lowercase.  The tag must begin with a version attribute that names the XML standard used to parse the document and optionally may also contain an encoding attribute (default is utf-8) that names the encoding standard used to used parse the document and a standalone attribute that indicates whether the document can stand by itself or requires type definitions from an external source.  The following is an example of a typical declaration tag: <?xml version="1.0" encoding="utf-8" standalone="yes"?>

 

A document must contain one (and only one) root element which contains all other content elements and markup and this element must appear after the declaration tag (if present).  There is no defined limit to the number of elements or amount of content that may be contained in the root element.  A root element often contains many attributes used to reference content in external document type definitions or to import namespaces that organize element types from one or more libraries

 

A comment tag may appear anywhere in a document as long as it is not embedded within other markup and does not appear before the declaration element (if present).  Comment tags begin with a (<!--) and end with a (-->) and can contain any characters except double hyphens meaning that comments cannot be nested.

 

For more information on XML and to review the full specification visit https://www.w3.org/TR/xml.

 

XML Samples

 

The following sample is the most basic XML document possible featuring only a root element and no actual content.

 

<root />

 

The following sample is a very compact document representing a generic document as commonly encountered.

 

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

<JournalLines>

  <JournalLine Account="62000" Amount="700.000" Period="2003007" TxDate="2004-06-10T00:00:00" Reference="GRN/002044"/>

  <JournalLine Account="84050" Amount="-700.000" Period="2003007" TxDate="2004-06-10T00:00:00" Reference="GRN/002044" Description="Quantity &lt; 10" />

</JournalLines>

 

The following sample is a compact document demonstrating most of the features of an XML document as described above.

 

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

<!--Document that represents a set of journal lines.-->

<Import xmlns:common="http://pastransfer.com/Common" xmlns:ledger="http://pastransfer.com/Ledger" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">

  <BusinessUnit>ACME</BusinessUnit>

  <Client i:nil="true" />

  <!--The root element contains a single transaction and is using types defined in an external namespace named ledger.-->

  <Transaction>

    <!--The transaction is comprised two journal lines with one debit entry and one credit entry .-->

    <ledger:JournalLine Account="ABCDE" Amount="700.000" TxDate="2004-06-10T00:00:00">

      <!--Demonstrates a complex type imported from a second imported namespace named common.-->

      <common:Period>

        <Year>2020</Year>

        <Periodic>10</Periodic>

      </common:Period>

    </ledger:JournalLine>

    <ledger:JournalLine Account="12345" Amount="-350.000" TxDate="2004-06-10T00:00:00" Reference="GRN/002044" Description="Quantity &lt; 10">

       <!--This journal line demonstrates that it can contain content directly within the element as well as child element.-->

          Sample data &amp; with a demonstration of how to escape an ampersand character.

          <common:Period><Year>2020</Year><Periodic>10</Periodic></common:Period>

    </ledger:JournalLine>

    <ledger:JournalLine Account="67890" Amount="-350.000" TxDate="2004-06-10T00:00:00" Reference="GRN/002044">

       <!--This journal line demonstrates that it can contain content directly within the element as well as child element.-->

       <![CDATA[Character content that is allowed to contain characters that would normally be escaped & may be easier on the eyes.]]>

          <common:Period><Year>2020</Year><Periodic>10</Periodic></common:Period>

    </ledger:JournalLine>

  </Transaction>

</Import>

 

XML Correctness and Validation

 

In order to parse an XML document it must be well-formed meaning that the document contains only properly encoded characters, has a single root element containing all content, ensures all tags are case-sensitive and properly closed without any overlap, ensures all special characters are properly escaped, and ensures names conforming to the naming restrictions.

 

An XML document is considered valid if it is well-formed and the markup conforms to type declarations in external Document Type Definitions (DTD) or XML Schema Definitions (XSD) that describe data types and rules for how types may be used.

 

XML Data Querying with XPath

 

XPath is an expression language used to navigate and select data from an XML document.  XPath works by essentially converting an XML document into a tree structure consisting of nodes and then provides the ability to navigate those nodes relative to a given node context using XPath expressions.  XPath expressions fall into two categories: selection and predicates. 

 

XPath selection expressions are used to select nodes from the document tree.  Language elements include the forward slash (/), the period (.), and the pipe (|)..  The forward slash at the beginning of the expression is used to indicate the tree root.  Anywhere else in the expression it is used as a level separator.  A double forward slash indicates that the node can be a descendant.  The period is used in value selection expressions to express the current node and a double period is used to express the parent node.  The pipe takes two selection expressions and combines them into a set. 

 

XPath predicate expressions are used to find a specific node or a node with a specific value.  These expressions are embedded in square brackets and may include selection expressions, comparison operators, and functions.  Multiple pieces of criteria can be chained together with the and and or keywords.  The full XPath language is documented on various Internet sites and beyond the scope of this documentation.

 

The examples below will use the following XML sample document:

 

<Activity>

  <Ledger Name="A">

    <Transactions>

      <Transaction Entry="1" TransactionDate="2020-01-01T00:00:00" Reference="20200101A">

        <Period Year="2020" Month="1" />

        <Description>First transaction</Description>

        <Line Account="CASH" Amount="1000.00" LineNumber="1" />

        <Line Account="A1" Amount="-1000.00" LineNumber="2" />

      </Transaction>

      <Transaction Entry="2" TransactionDate="2020-01-01T00:00:00" Reference="20200101B">

        <Period Year="2020" Month="1" />

        <Description>Second transaction</Description>

        <Line Account="CASH" Amount="1000.00" LineNumber="1" />

        <Line Account="A2" Amount="-1000.00" LineNumber="2" />

      </Transaction>

    </Transactions>

  </Ledger>

  <Ledger Name="B">

    <Transactions>

      <Transaction Entry="1" TransactionDate="2020-02-01T00:00:00" Reference="20200201A">

        <Period Year="2020" Month="2" />

        <Description>Third transaction</Description>

        <Line Account="CASH" Amount="750.00" LineNumber="1" />

        <Line Account="A1" Amount="-750.00" LineNumber="2" />

      </Transaction>

      <Transaction Entry="2" TransactionDate="2020-02-01T00:00:00" Reference="20200201B">

        <Period Year="2020" Month="2" />

        <Description>Fourth transaction</Description>

        <Line Account="CASH" Amount="750.00" LineNumber="1" />

        <Line Account="A2" Amount="-750.00" LineNumber="2" />

      </Transaction>

    </Transactions>

  </Ledger>

</Activity>

 

The following are different examples for selecting Line elements with XPath selection expressions and predicate expressions against the example above for use in pasTransfer:

 

Select all Line elements regardless of where they appear:  //Line

Select all Line elements using an absolute path:  /Activity/Ledger/Transactions/Transaction/Line

Select all Line elements directly descendant of a Transaction element regardless of where it appears:  //Transaction/Line

Select all Line elements that are the first in their Transaction//Transaction/Line[1]

Select all Line elements from the last Transaction//Transaction[last()]/Line

Select all Line elements that are the not the first in their Transaction//Transaction/Line[position()>1]

Select all Line elements that are part of a Transaction in month 2//Transaction[Period/@Month='2']/Line

Select all Line elements in Ledger A:  //Ledger[@Name='A']//Line

Select all Line elements with account A1 or A2:  //Line[@Account='A1' or @Account='A2']

Select all Line elements in Ledger A in CASH and Ledger B in A1: //Ledger[@Name='A']//Line[@Account='CASH'] | //Ledger[@Name='B']//Line[@Account='A1']

 

The following are different examples for values relative to Line elements for use by pasTransfer in pasTransfer:

 

Select the account name attribute value for the Line element: @Account

Select the transaction date attribute from the parent Transaction element of the Line element: ../@TransactionDate

Select the Ledger name attribute for the Line element: ../../../@Name

Select the year portion of the Period from the parent Transaction of the Line element: ../Period[1]/@Year

 

The version of XPath used by the version of the Microsoft .NET Framework this application is based on is version 2.0.

 

For more information on XPath and to review the full specification visit https://www.w3.org/TR/xpath.

 

XML with SQL Server

 

Several pasTransfer connections can send data to SQL Server for advanced processing.  It is important to note that when this happens the XML documents we send in have their declarations removed and the character set for the document is converted to UNICODE as that is the default character set for an application built using Microsoft .NET Framework and more importantly what SQL Server expects by default when working with XML.

 


Copyright © 2024 pasUNITY, Inc.

 

Send comments on this topic.