Article : Validating XML in PHP

Validating XML in PHP

PHP developers commonly require the services of an Extensible Markup Language (XML) parser in their code. Along these lines, they frequently find it necessary to validate XML input. Fortunately, you can easily accomplish this in PHP. This article shows you how to validate XML documents within PHP and determine the cause of validation failures.

Why XML validation?

XML is a markup language that enables you, as a developer, to create your own custom language. This language is then used to carry, but not necessarily display, data in a platform-independent fashion. The language is defined with the use of markup tags, much like Hypertext Markup Language (HTML).

XML has gained in popularity in recent years because it represents the best of two worlds: It is easily readable by humans and computers alike. XML languages are expressed in tree-like structure with elements and attributes describing key data. The element and attribute names are usually written in plain English (so humans can read them). They are also highly structured (so computers can parse them).

Now, for example, suppose you create your own XML language, called LuresXML. LuresXML simply defines a means for defining various types of lures that are offered on your Web site. First, you create an XML schema that defines what the XML document should look like, as in Listing 1.

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:element name="lures">
 <xs:complexType>
  <xs:sequence>
  <xs:element name="lure">
  <xs:complexType>
  <xs:sequence>
  <xs:element name="lureName" type="xs:string"/>
  <xs:element name="lureCompany" type="xs:string"/>
  <xs:element name="lureQuantity" type="xs:integer"/>
  </xs:sequence>
  </xs:complexType>
  </xs:element>
  </xs:sequence>
 </xs:complexType>
</xs:element>
</xs:schema>

Listing 1. lures.xsd

This is, quite intentionally, a fairly simple example. The root element is called lures. It is the parent element of one or more lure elements, each of which is the parent of three other elements. The first element is the lure name (lureName). The second element is the name of the company that manufactures the lure (lureCompany). And, finally, the last element is the quantity (lureQuantity), or how many lures your company has in inventory. The first two of these child elements are defined as strings, whereas the lureQuantity element is defined as an integer.

Now, say you want to create an XML document (sometimes called an instance) based on that schema. It might look something like Listing 2.

Listing 2. lures.xml

<lures>
 <lure>
  <lureName>Silver Spoon</lureName>
  <lureCompany>Clark</lureCompany>
  <lureQuantity>Seven</lureQuantity>
 </lure>
</lures>

This is a simple XML document instance of the schema from Listing 1. In this case, the document instance lists only one lure. The name of the lure is Silver Spoon. The manufacturing company is Clark. And the quantity on hand is Seven.

Here is the question: How do you know that the XML document in Listing 2 is a proper instance of the schema defined in Listing 1? In fact, it isn't (this is also intentional).

Note the lureQuantity element as defined in Listing 1. It is of type xs:integer. Yet in Listing 2 the lureQuantity element actually contains a word (Seven), not an integer.

The purpose of XML validation is to catch exactly those kinds of errors. Proper validation ensures that an XML document matches the rules defined in its schema.

Continuing with this example, when you attempt to validate the XML document in Listing 2, you get an error. You fix this error (by changing the Seven to a 7) before using the document within your software application.

XML validation is important because you want to catch errors as early as possible in the information interchange process. Otherwise, unpredictable results can occur when you attempt to parse an XML document and it contains invalid data types or an unexpected structure.

Simple XML parsing in PHP

It is beyond the scope of this article to provide an exhaustive overview of parsing XML documents in PHP. However, I look at the basics of loading an XML document in PHP.

Just to continue to keep things simple, keep using the schema from Listing 1 and the XML document from Listing 2. Listing 3 demonstrates some basic PHP code to load the XML document.

Listing 3. testxml.php

<?php

$xml = new DOMDocument();
$xml->load('./lures.xml');

?>

Nothing

 is complicated about this either. You are using the DOMDocument class to load the XML document, here called lures.xml. Note that for this code to work on your own PHP server, the lures.xml file must reside on the same path as the actual PHP code.

At this point, it is tempting to start parsing the XML document. However, as you have seen, it is best to first validate the document to ensure that it matches the language specifications set forth in the schema.

Simple XML validation in PHP

Continue adding to the PHP code in Listing 3 by inserting some simple validation code, as in Listing 4.

Listing 4. Enhanced testxml.php
<?php

$xml = new DOMDocument();
$xml->load('./lures.xml');

if (!$xml->schemaValidate('./lures.xsd')) {
  echo "invalid<p/>";
}
else {
  echo "validated<p/>";
}

?>

Once again, note that the schema file from Listing 2 must be in the same directory where the PHP code is located. Otherwise, PHP returns an error.

This new code invokes the schemaValidate method against the DOMDocument object that loaded the XML. The method accepts one parameter: the location of the XML schema used to validate the XML document. The method returns a Boolean where true indicates a successful validation and false indicates an unsuccessful validation.

Now, deploy the PHP code from Listing 3 to your own PHP server. Call it testxml.php because that is the name given in Listings 3 and 4. Ensure that the XML document (from Listing 2) and XML schema (from Listing 1) are both in the same directory. Once again, PHP reports an error if this is not the case.

Point your browser to testxml.php. You should see one simple word on the screen: "invalid."

The good news is that the schema validation is working. It should return an error, and it did.

The bad news is that you have no idea where the error is located within the XML document. Okay, you might know because I mentioned the source of the error earlier in the article. But pretend that didn't happen, okay?

There is an error, but where?

To repeat: The bad news is that you have no idea where the error is located within the XML document. Just play along. It would be nice if the PHP code actually reported the location of the error, as well as the nature of the error, so that you can take corrective action. Something along the lines of "Hey! I can't accept a string for lureQuantity" would be nice.

To view the error that was encountered, you can use the libxml_get_errors() function. Unfortunately, the text output of that function doesn't specifically identify where in the XML document the error occurred. Instead, it identifies where in the PHP code an error was encountered. Because that's fairly useless, you look at another option.

There is another PHP function called libxml_use_internal_errors(). This function accepts a Boolean as its only parameter. If you set it to true, then that means that you are disabling the libxml error reporting and fetching the errors on your own. That's what you do.

Of course, that means that you have to write a bit more code. But the trade-off is more specific error reporting. In the long run, this saves a lot of time.

Listing 5 shows the finished product.

Listing 5. The final testxml.php
<?php
function libxml_display_error($error)
{
$return = "<br/>\n";
switch ($error->level) {
case LIBXML_ERR_WARNING:
$return .= "<b>Warning $error->code</b>: ";
break;
case LIBXML_ERR_ERROR:
$return .= "<b>Error $error->code</b>: ";
break;
case LIBXML_ERR_FATAL:
$return .= "<b>Fatal Error $error->code</b>: ";
break;
}
$return .= trim($error->message);
if ($error->file) {
$return .= " in <b>$error->file</b>";
}
$return .= " on line <b>$error->line</b>\n";

return $return;
}

function libxml_display_errors() {
$errors = libxml_get_errors();
foreach ($errors as $error) {
print libxml_display_error($error);
}
libxml_clear_errors();
}

// Enable user error handling
libxml_use_internal_errors(true);

$xml = new DOMDocument();
$xml->load('./lures.xml');

if (!$xml->schemaValidate('./lures.xsd')) {
print '<b>Errors Found!</b>';
libxml_display_errors();
}
else {
echo "validated<p/>";
}

?>

First, notice the function at the top of the code listing. It's called libxml_display_error() and accepts a LibXMLError object as its only parameter. Then it uses the all-too-familiar switch statement to determine the error level and craft an error message appropriate to that level. When the level is determined, the code produces a string that reports the appropriate level.

Then, two more things happen. First, the error object is examined to determine whether or not a file property contains a value. If so, then that file value is appended to the error message so the location of the file is reported. Next, the line property is appended to the error message so the user can see exactly where in the XML file the error occurred. Needless to say, this is extremely important for debugging purposes.

It should also be noted that libxml_display_error() simply produces a string that describes the error. The actual printing of the error to the screen is left up to the caller, in this case libxml_display_errors().

The function below that is the previously mentioned libxml_display_errors(), which takes no parameters. The first thing this function does is call libxml_get_errors(). This returns an array of LibXMLError objects that represent all of the errors encountered when the schemaValidate() method was invoked on the XML document.

Next, you step through each of the errors you encountered and invoke the libxml_display_error() function for each error object. Whatever string is returned by that function is then printed to the screen. One great benefit of handling errors this way is that all of the errors are printed at once. This means that you only need to execute the code once to view all of the errors specific to that particular XML document.

Finally, libxml_clear_errors() clears out the errors recently encountered by the schemaValidate() method. This means that if schemaValidate() is executed again within the same code sequence, you will start with a clean slate, and only new errors will be reported. If you don't do this and you execute schemaValidate() again, then all of the errors from the first invocation of schemaValidate() remain in the array returned by libxml_get_errors(). Obviously, that presents problems if you're looking for a fresh set of errors.

It's also important to note that I made a slight change to the if-then statement at the bottom of the code in Listing 5. If an error is encountered, it prints "Errors Found!" in bold and then invokes the aforementioned libxml_display_errors() function which displays all of the errors encountered before clearing out the error array. I opted for this solution instead of just printing out "invalid" as I did in Listing 4.

Second test

Now, it's time to test again. Move the PHP file from Listing 5 to your PHP server. Keep the file name the same (testxml.php). As before, ensure that both the XML Schema Definition (XSD) file and the XML files are in the same directory as the PHP file. Point your browser to testxml.php once again, and now you should see something like this:

Errors Found!
Error 1824: Element 'lureQuantity': 'Seven' is not a valid value of the atomic type 'xs:integer'. in /home/thehope1/public_html/example.xml on line 5

Well, that's fairly descriptive, isn't it? The error message tells you on what line the error occurred. It also tells you where the file is (as if you didn't know). And it tells you exactly why the error occurred. That's information you can use.

Fixing the problem

You can now leave the PHP file alone and work on fixing the problem in your XML document.

Because the error reportedly occurred on line 5 of the XML document, it's a good idea to look at line 5 and see what's there. Unsurprisingly, line 5 is the location of the lureQuantity element. And, as you look at it carefully, you suddenly have an epiphany that Seven is a string, not a number. So you change the string Seven to the numeral 7. The final copy of the XML document should look like Listing 6.

Listing 6. Updated XML file

<lures>
 <lure>
  <lureName>Silver Spoon</lureName>
  <lureCompany>Clark</lureCompany>
  <lureQuantity>7</lureQuantity>
 </lure>
</lures>

Now, copy this new XML file to your PHP server. And, once again, point your browser to testxml.php. You should see just one word: "validated." This is excellent news for two reasons. First, it means that the validation code is working properly because the XML document is, in fact, valid. Second, you have probably just validated your first XML document in PHP. Congratulations!

As I always advise, now it is time to tinker. Modify lures.xsd to make it a more complex schema. Modify lures.xml to make it a more complex instance of that schema. Copy those files to the PHP server and, once again, execute testxml.php. See what happens. Intentionally produce an invalid document for several reasons and see what happens.

Also, note that when you tinker, you don't need to change the PHP code at all. Just make sure that the file names (lures.xml and lures.xsd) are the same and you can modify them to your heart's content.

Conclusion

PHP makes it easy for developers to validate XML documents. Using the DOMDocument class in conjunction with the schemaValidate() method, you can ensure that your XML documents comply with the specifications in their respective schemas. This is important to ensure data integrity in your software applications.