Handling ampersands when parsing XML

I was recently writing an import system in PHP. I was taking am XML data feed into a PHP script, converting it to a SimpleXMLElemment object and manipulating it to save off the values that were needed.

There was just one problem.

About 10,000 records in, the importer just stopped. No error messages, just didn’t want to do anything. I did my standard checks and the XML looked fine with no extra is mismatched tags. It had to be something else. I ran the XML through an online formatter, and it came back with an error… saying that ‘copy’ wasn’t a defined entity.

I looked at the XML code and I found it. There was a few instances of © in the code as it needed the standard copyright symbol. That was the problem!

XML formatting doesn’t allow for ampersands – apart from having them set as & – so any other HTML tag won’t run through a formatter or give a valid XML element.

To get around this I’ve used this regular expression:

$xml = preg_replace ('/&(?!amp;)/', '&', $xml);

It works, and it now I can inport my XML correctly!

Leave a Reply

Your email address will not be published. Required fields are marked *