Wednesday, March 9, 2011

Building a coldfusion based spider - Crash course

So you are interested in building a spider in coldfusion. Well i will try to explain the basic principles of building your own spider. First we will have to access data on the web. Below a simple example of downloading data through the cfhttp tag.

<cfhttp method="get" timeout="30" redirect="no" getasbinary="yes" url="http://coldfusion9.blogger.com" charset="utf-8" userAgent="Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv: Gecko/20110303 Firefox/3.6.15">
 <cfhttpparam type="header" name="Accept-Encoding" Value="no-compression"> 
 <cfhttpparam type="header" name="Cache-Control" value="no-cache"> 

You might have noticed that i set the getasbinary argument to yes. The reason for this is to be able to convert ISO-8859-1 (or any other character set) to utf-8. One (of the many) problems you will encouter is determining which character set the spidered page is using. In many cases the cfhttp tag won't return the proper character set. The code below finds the proper character set and if necessary converts the content to utf-8.

<cfif cfhttp.errorDetail eq ''>
 <cfset mycharset = "">
 <cfset mycharset = #cfhttp.Charset#>
 <cfset filec = #ToString(cfhttp.fileContent)#>
 <cfif len(mycharset) eq 0>
  <cfset mycharset = "ISO-8859-1">
  <cfset pat = "(?i)<META(.*)charset=\s*([^\s|^""|^']*)"> 
  <cfif #refindnocase(pat,filec)# gt 0>
   <cfset local_re = #refindnocase(pat,filec,1,true)#>
   <cfif local_re.len[3] gt 0>
    <cfset mycharset = #Mid(filec,local_re.pos[3],local_re.len[3])#>
    <cfset mycharset = #Mid(filec,local_re.pos[2],local_re.len[2])#>
    <cfset mycharset = trim(rereplacenocase(mycharset,'.*charset=([^>^' & "'" & '^"]+).*','\1','one'))>
 <cfif mycharset neq 'utf-8'>
   <cfset filec = CharsetEncode(cfhttp.fileContent, mycharset)>
   <cfset filec = #ToString(filec)#>
   <cfcatch type="any">
    <cfset filec = #ToString(cfhttp.fileContent)#>

Okay we have downloaded content from an url and converted this content to utf-8. The next step is applying patterns to the content. This is where the fun begins. As of coldfusion version 8 you can use the REMatch function to find and extract patterns. However i will advice you not to use this tag because is has some limitations. Use the following function instead:

<cfset objPattern = CreateObject("java","java.util.regex.Pattern").Compile('#yourpattern#') />
<cfset objMatcher = objPattern.Matcher(filec) />
<cfloop condition="objMatcher.Find()">

Now you have learned some basic principles about spidering and extracting data from an webpage. If you like this information and like to know more about building a web spider in coldfusion please leave a comment.

No comments: