Menu
Join our Mailing List

Name:

Email:

Basic Data Scraping Using Curl and PHP

If you've read the two previous tutorials you found some good basics on how to scrape however using the php function file_get_contents() is rather limited. Some of the questions I've received are:

To do all these things we need something much more versatile. This is where CURL comes into play, it's a library that's going to allow you to do all these things. I'm going to go over the exact same examples we already did. Only this time we're going to use the CURL library rather then file_get_contents().

Basic getting a page using curl and php

Script -

<?
$url = "oooff.com";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
echo $curl_scraped_page;
?>

Script Explanation for Curl and PHP scraping -

Starting at line 3, if you don't know line 2 go back to the first php scraping tutorial.

Line 3.
$ch = curl_init($url);
This is going to initialize the CURL library in your PHP script. On initialization we're setting the url to be scraped which is $url in this case. The way CURL works is we're going to set a bunch of settings like url and what we want to do with the returned data in this case. You can set 50 options then you finally call the CURL library to do some work for us at the end. Also you'll see we initialize the library and set it to what's called a "handle". This is where this specific instance is initialized to be stored so that we can work with it.

Line 4.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
In line 4 we see what looks like a function, well it is. The curl_setopt() function is the way we make our settings in our CURL object so it's all ready to go once we decide to execute is in our php script. In this case we're setting an option in the handle we created/initialized on line 3, $ch.The option we're setting is CURLOPT_RETURNTRANSFER, what this tells the curl object/handle is to return the transfer to a variable, in this case $curl_scraped_page, and not the screen. We want to return it to a variable because then we can do something with the output like write to a file, use regex to parse the data, insert it into a db or just echo it to the screen in this case. If we didn't put this option is would just print to the screen and we couldn't do anything with it. If you want comment out this line using a // in front of it and run the script again. The finally part of the values we're passing to the function is 'true' this means yes we want this enacted. If we were to just put false here it'd be the same as not having it at all.

Line 5.
$curl_scraped_page = curl_exec($ch);
As you might have guessed here's where we actually say "run all the stuff we've set" in the lines preceding this and return the data scraped to the variable $curl_scraped_page. Now we have the same thing as we did in the previous examples using file_get_contents. You're probably asking yourself isn't this just a ton more code to do the same thing as $curl_scraped_page = file_get_contents("http://oooff.com"); and you're absolutely right. In this example it doesn't make a lot of sense but this is going to allow us to do some other things we can't with just file_get_contents.

Trying Things Out -

Go ahead and make a script on your server or on your local machine if you're running php locally and cut and paste the script from above into it and then run it. What you should get is the http://www.oooff.com homepage without the styling... is that what you see? If so CURL is working on your server. If for some reason you get an error like "Fatal error: Call to undefined function curl_init() in Filename on line 3" that means curl isn't enabled on your server. You need to enable the CURL module in your PHP install. If you have hosting somewhere ask them to do it is the easiest. If it's at home search for php.ini and then do a find on that file for curl, you'll probably see a line with a semi-colon in front of it. If you do delete the semi-colon and then save the file. Restart apache and try running your script again. Hopefully all is well, worst case scenario there's always Google.

Other things to try -

  1. Comment out the curl_setopt line and see what happens
  2. Try other url's and see how they work.
  3. Look up CURL on http://www.php.net and play with some of the other options

Conclusion -

Although it's a bit more code the CURL lib is going to let us do a lot of things moving forward that file_get_contents() just won't. Stay tuned for the next lesson and we'll talk about some of the curl options, how to set them and what the curl options are good for.

Next: How to Get the Data You Want from the Scraped Data