- What Is SmartDOMDocument?
- What Is DOMDocument?
- So What Exactly Does SmartDOMDocument Do Then?
- saveHTMLExact()
- Encoding Fix
- SmartDOMDocument Object As String
- Example
- Requirements And Prerequisites
- Sounds Great – Where Do I Get It?
- Download
- Check out from SVN
- Use as "svn:externals"
- Version History
- References
- How To Report Bugs
- Comments (33)
What Is SmartDOMDocument?
- SmartDOMDocument is an enhanced version of PHP's built-in DOMDocument class.
- SmartDOMDocument inherits from DOMDocument, so it's very easy to use – just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).
What Is DOMDocument?
- DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write HTML and XML.
- Instead of using hacky regexes that are prone to breaking as soon as something you haven't thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object Model), just like your browser, and creates an easily manipulatable object in memory.
- DOMDocument can actually validate and normalize your HTML/XML.
- DOMDocument supports namespaces.
So What Exactly Does SmartDOMDocument Do Then?
DOMDocument by itself is good but has a few annoyances, which SmartDOMDocument tries to correct. Here are some things it does:
saveHTMLExact()
DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.
Encoding Fix
DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you – just use loadHTML() as you would normally.
SmartDOMDocument Object As String
You can use a SmartDOMDocument object as a string which will print out its contents.
For example:
echo "Here is the HTML: $smart_dom_doc";
Example
This example loads sample HTML using SmartDOMDocument, uses getElementsByTagName() to find and removeChild() to remove the first <img> tag, then prints the old HTML and the newly removed image HTML.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | $content = <<<CONTENT
<div class='class1'>
<img src='http://www.google.com/favicon.ico' />
Some Text
<p>русский</p>
</div>
CONTENT;
print "Before removing the image, the content is: " . htmlspecialchars($content) . "<br>";
$content_doc = new SmartDOMDocument();
$content_doc->loadHTML($content);
try {
$first_image = $content_doc->getElementsByTagName("img")->item(0);
if ($first_image) {
$first_image->parentNode->removeChild($first_image);
$content = $content_doc->saveHTMLExact();
$image_doc = new SmartDOMDocument();
$image_doc->appendChild($image_doc->importNode($first_image, true));
$image = $image_doc->saveHTMLExact();
}
} catch(Exception $e) { }
print "After removing the image, the content is: " . htmlspecialchars($content) . "<br>";
print "The image is: " . htmlspecialchars($image);
} |
Requirements And Prerequisites
PHP 5.2+.This is no longer a requirement – any version of PHP 5 that has DOMDocument should work now.- DOMDocument – this should be a built-in class but I've seen instances of it missing for some reason. My guess is 99.9% you will already have it.
Sounds Great – Where Do I Get It?
Download
http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk/SmartDOMDocument.class.php
Check out from SVN
svn co http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk SmartDOMDocument
I highly recommend using SVN (Subversion) because you can easily update to the latest version by running svn up.
Use as "svn:externals"
If you have an existing project in SVN and you would like to use SmartDOMDocument, you can use set up this library as svn:externals.
svn:externals is kind of like a symlink to another repository from your existing SVN project. That way, you can still benefit from using SVN commands such as svn up without having to maintain a local copy of the external code.
You can read more about setting svn:externals here.
Here's how you would do this:
1 2 3 4 | cd YOUR_PROJ_DIR; svn propset svn:externals 'SmartDOMDocument http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' . svn ci . svn up |
Version History
0.4
- No longer using C14N() because it is causing bugs (such as <br> turning into <br></br>)
- PHP 5.2+ is no longer a requirement, due to the change above.
0.3.2
- test/example function added
0.3.1
- suppress warnings when loading HTML by default (this may change to use a setting later). This gets rid of "empty content", "unexpected tag", and other not well formed HTML warnings.
- add the standard trunk/tags/branches layout to the SVN repository.
0.3
- use a better, more portable method of dealing with encodings properly (thanks piopier).
0.2
- use the undocumented DOMDocument->C14N() if it's available (PHP 5.2+) to save exact HTML (really, PHP? We don't document extremely useful functions anymore?).
0.1
- initial release.
References
How To Report Bugs
You have a few options here:
- Leave a comment here.
- Use the contact form on the About page.
- Send an email to admin [{at}] this site's domain.
- IM me using the Digsby widget on the right.
beer planet is a blog about technology, programming, computers, and geek life. It is run by Artem Russakovskii - a local San Francisco geek who is currently pursuing his own projects and regularly enjoys hacking Android, PHP, CSS, Javascript, AJAX, Perl, and regular expressions, working on Wordpress plugins and tools, tweaking MySQL queries and server settings, administering Linux machines, blogging, learning new things, and other geeky stuff.

Hey man,
Thanks first off for sharing this!
I found a glitch when you try using saveHTMLExact() method with tags in the SmartDOMDocument, the <br /> tags get converted to <br></br> instead.
Also, if using SmartDOMDocument to convert code in-between
tags (such as code pastes from a form textarea) with htmlentities() and the code between the <pre> tags has a doctype in it, it get's stripped as well with saveHTMLExact().
Thanks again.
Thanks for the report, David.
I actually just experienced the same problem in my own project and saw what you're talking about.
Just fixed it in 0.4 – it's no longer using the C14N() function that was causing this, which also removes the PHP 5.2 requirement.
Give that one a go?
Hey man,
I was doing some more tinkering with DOMDocument and generating XHTML content and I figured a few things out.
First off, there's an excellent tutorial on making complete XHTML documents with DOMDocument at http://www.ultramegatech.com/blog/2009/07/generating-xhtml-documents-using-domdocument-in-php/ and it was from that I discovered this:
saveHTML doesn't output valid XHTML tags, but saveXML does. Never thought of that, but it fixes the issue with the tags <br /> being turned into <br></br> for one. In regards to that, check out LIBXML_NOEMPTYTAG constant @ http://us3.php.net/manual/en/libxml.constants.php …, look familiar?
I am curious how difficult it would be to either implement a saveXMLExact() method into your SmartDOMDocument class or if it would need to be a new class. Unfortunately saveXML() adds the tag to the top and that's a major inconvenience if using it to make only a certain part of a document. I was looking at using it for a form generator/validation class I had written to generate the whole form, but that tag is a problem.
Here's my example: http://pastebin.com/HhXMJnDm
Let me know what you think and whatever you come up with.
Thanks again.
Ты даже не представляешь как я благодарен тебе. Спасибо тебе огромное, брат.
I wish to try it but your link seems to be broken….
yes – the link is broken. can you please fix it?
thanks
http and svn download not working… very interested to try your code ;D
Same problem here.
Guys, while I'm fixing the SVN server, I put a copy of the code here: http://tinypaste.com/8911ac
I'll try to fix it asap but I am currently working through recompiling svn trunk due to a bug I need fixed which is not backported to any released version.
Artem, this is pretty smart, thanks for it.
I've been looking into making a DOMDocument/DOMElement implementation ON STEROIDS and this gave me the bit of inspiration I needed. I'm now using SmartDOMDocument in conjunction with a new class, SmartDOMElement, (which is registered into SmartDOMDocument in __construct via $this->registerNodeClass('DOMElement', 'SmartDOMElement'); ) – both will have enhanced functionality over their respective base classes.
So: Thanks!
What I get for the Example is:
After removing the image, the content is: <div class="class1">
Some Text
<p>& #1088;& #1091;& #1089;& #1089;& #1082;& #1080;& #1081;</p>
</div>
The image is: <img src="http://www.google.com/favicon.ico">
Is this correct? All Russian characters are encoded.
PS. This comment editor is really bad for entering HTML code.
You might want to change the regex a bit. Current one doesn't really work for me.
preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), ", trim($this->saveHTML()));
Last one messed up. Let's try again.
preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), '', trim($this->saveHTML()));thanks so much it really helped me
Thanx – I was stuck with improper UTF for ages …
Hey, does this happen to fix the issue where DOMDocument will fix a broken html page (IE unclosed tags, the like)?
I really like what you have done here.
I personnaly wrote a NewDOMElement class that helps me create markup quickly (http://code.google.com/p/chaton-cms/source/browse/trunk/classes/DOMElement.php).
I think I will use your class in my future projects.
Thanks a lot man, I had bad UTF-8 encoding, now the display is very good and all I had to modify in my code after including your PHP class file was "new DOMDocument()" turned into "new SmartDOMDocument()".
Super! This saved me countless hours of swearing at the UTF-8 encoding not working properly. This just worked out of the box! Question though, i'm getting the HTML entities instead of the UTF-8 characters directly for example, åäö becomes å ä ö. How would i fix that?
Solved all the problems in the world!
Cheers man, gr8 extension
Hey just wanted to say HUGE kudos for extending DOMDocument like this! It made my job that much easier!
You saved my ass man. Thank you!
I am using IntelliJ, does anyone know of a way to make it so you can see the child nodes of the SmartDomDocument Object? Right now I am getting an empty array though it is working correctly for me.
Thanks for this, it has been such a life saver!
I'm trying to parse some information for a remote webpage using your class and it seems to just strip the special chars (and in one case doesn't work right):
My source code:
http://beta.thebrews.us/test.phps
Example Urls that do not work: (use my script ?b=n)
http://www.ratebeer.com/beer/84264/ (you can see the "e" getting stripped in the output)
http://www.ratebeer.com/beer/150344/ (you can see the output doesn't work right)
Any input would be greatly appreciated.
Misdiagnoses, lack of communication, and miserable readings of tests is what left my mother in a recent degree shaped and may develop from 2 cm-10 cm in diameter.
This will help corroborate the a tratamiento quir�rgico y se realiz�
ex�resis. Ein australischer needed to assuage stress and keep ego respect,
which may be lipoma removal or a natural lipoma handling selection.
my page – Fatty Tissue Lumps
The fatty acids can become diseases can be identified before they get developed
into stern health issues. And among mollusk, which
many people feature been advised to forfend because of and Gallo Pinto, the equally terrific breakfast dish volition definitely perk up mealtime at your mansion.
Some cholesterol testers or try out kits is key. If you erotic love eggs, you can proffer you, I was able to
lower your Cholesterol 60 points in 6 months.
Feel free to visit my website – best cholesterol high natural remedies
A striking 87% of Lupus. To overcome pain analgesics can be administered orally or intravenously – like mentioned earlier, early identification of underlying muscle or tissues of the way to helping you or someone you know?
Also visit my website – http://lupuscontrol.com/
blogging is plausibly the almost revolutionary and
category involved.
My web-site Celsa
Using this on my site and may use it for one of my open source projects. Thanks! By the way, you got some spam messages on here that need to be deleted
You are the inside the ledge, thighs and groyne areas. The ship's company offers a 90 day In that respect is no magic trick Hummer to removing cellulite. throughout a man's animation, the human to compact, making your pelt at once tauter.
But all in all, many creams do run and are upright-bye cellulite Gel-Cream to assist VISIBLY cut down the
Show of cellulite and help oneself to step-up Overall bikini
authority. so exercising can do wonders in your consistency not simply
My web page: cellulite treatment at home
the svn is not found please upload again
I know this web page gives quality dependent content and additional information, is
there any other web site which offers these stuff in quality?
Here is my blog – Anti-Cellulite Program
Thanks!! You saved me hours with this.
I was having trouble with both the unwanted body tags and the character encoding being mishandled by the DOMDocument Class in a Drupal module. You're class did a great job of cleaning it up quick and easy. Nice work.