Home > Linux >
HTML/XHTML Validation | Sitemap Search |
|
Sections Membership Features
Recent comments
very difficult by alfin Taking the credit for another persons work ? by curious dude. |
HTML/XHTML ValidationPosted by martin on 1 Oct 2001, last updated on 14 Sep 2002. Unix shell and Python scripts for automatic HTML/XHTML validation of pages, works also with generated pages. If you want to check HTML generated code with a shell script you will need tidy and Lynx or some other browsers that supports dumping the source of the page to stdout. #!/bin/bash PATH=/bin/:/usr/bin lynx="/usr/bin/lynx" tidy="/usr/bin/tidy" TMPFILE=`mktemp -q /tmp/$0.XXXXXX` if [ $? -ne 0 ]; then echo "$0: Can't create temp file, exiting..." exit 1 fi files="index.php foo.php\ bar.php" for i in `echo $files` do printf "\n$i\n" >> $TMPFILE $lynx -source http://localhost/$i | $tidy -eq 2>&1 | grep line >> $TMPFILE done less $TMPFILE rm -f $TMPFILE You'll have to change the first two lines to point to the location of the executables on you system. If your site isn't in the DocumentRoot you also need to make some other modifications. The last thing in the pipe If you have XHTML or XML which you want to check for validity you can use a validating XML parser which should find some errors that tidy won't report at all. I use Xerces C++ and again a browser to dump the source. #!/bin/bash lynx="/usr/bin/lynx" parser="/usr/bin/StdInParse" files="index.php foo.php\ bar.php" cd /var/www/html for i in `echo $files` do printf "\n$i\n" $lynx -source http://localhost/$i | $parser 2>&1 done This one is without using a temporary file to achieve the effect of the previous one you can use <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "dtd/xhtml1-strict.dtd"> Xerces is a good XML parser but sometimes outputs really weird error messages, fortunately there is a great parser by James Clark called nsgmls included in SP and in openjade; it is a SGML Parser so you can also validate HTML. This one is used also by the W3C Validation Service. The nice thing about the validator is that it can output your generated source, which is very useful. The next one is a Python script that uses nsgmls and because Python has module support for HTTP you don't need an external application (browser) to get the source.
#!/usr/bin/env python
import httplib, os
catalogue = '/usr/share/sgml/dtd/xhtml.soc'
options = '-wxml -s -c' + catalogue
parser = '/usr/bin/nsgmls'
files=['index.php',
'cert.php',
'javascript.php', 'slideshow.php', 'bounce.php', 'fading.php',
'linux.php', 'valid.php']
errors_name = os.tempnam()
for file in files:
h = httplib.HTTP('localhost')
h.putrequest('GET', '/' + file)
h.endheaders()
errcode, errmsg, headers = h.getreply()
print file, errcode, errmsg
f = h.getfile()
data = f.read()
f.close()
pipe = os.popen(parser + ' ' + options + ' -f ' + errors_name, 'w');
pipe.write(data);
pipe.close()
errors = open(errors_name)
err = errors.read()
err = err.split(':')
if len(err) > 1 :
data = data.split('\n')
for i in range(1,len(err)):
if i % 5 == 0 :
print 'column:', err[i-2], err[i].split(parser)[0]
if i % 5 == 2 :
print data[int(err[i])-1]
print
errors.close()
os.remove(errors_name)
First tell the shell to use python to execute the code, then import the needed libraries and define some constants (you may need to change some of them). The main loop that iterates over all of the files, connect to the server first and then send a Then we check if the errors array has more than one elements (each error line is parsed into 5 elements), if so then create an array with each line of the source. Iterate over each element and use each 2nd element as an index for the line of the source, printing also the column number which is the 3rd element in the error array and the error message (each 5th element). Print an empty line to make output easier to read. We finish off with closing the temporary file and deleting it. A simple improvement which can be made (if you are using PHP's sessions with trans-sid) is to add a line to the headers that tells the PHP module that we already have a session, which prevents it from mangling links and forms. This can be done with a
line like the one below that's added before
|