Very basic Parsing, on returned web data - tutorial
Alright, I'm sure you're saying to yourself, ok I have all this data (web page, file data, it's all the same to us) but I really want to extract some very specific data out of it. Does that sound like what you're looking for? Well what we'll do is a basic php web scrape just like in the first tutorial, but we're going to take and pull some data out of it. For our example what we'd like to do is find out how many pages of our site is indexed by MSN and just return that scraped number. Sound like something useful? Hopefully this is going to give you the very basics of parsing out data. So lets go!
Most Basic Web Data Parsing Script
Whole script -
The whole script minus the line numbers of course. Those are just their for our reference.
2. $data = file_get_contents('http://search.msn.com/results.aspx?q=site%3Afroogle.com');
3. $regex = '/Page 1 of (.+?) results/';
6. echo $match;
Script Explanation -
Ok here goes with the basic explanation...
$data = file_get_contents('http://search.msn.com/results.aspx?q=site%3Afroogle.com');
Now if you studied up on the first tutorial you'll know that we're pulling data from MSN search using the file_get_contents command and assigning the data to the $data variable.
However we're also passing some data in the url to get the specific page from MSN that we want to scrape. If you already know about passing variables in the url you can go to Line 3.
You might be asking what is all that stuff after the MSN url? I'm sure you've seen it a lot of times but might not been sure what it was. Basically what all that stuff is, is just like passing a variable in a php script but you're doing it through a url. Lets take a peak at the url we're using here to get a better understanding. Our url if you don't remember is "http://search.msn.com/results.aspx?q=site%3Afroogle.com".
Let's break it into two parts split on the question mark. Why you ask? That's where the url ends and the data being passed begins. With is separated we have:
Now I hope I don't need to go into an explanation on the first part so I'm really only going to talk about the second. Also I'll do some basic tutorials on accepting data later so you have an understand what happens to this url on the other side. When you look at the second part of the url you'll always see a field and a value for the field, although sometimes that value is blank. How do you know which is the field and which is the value you ask? The field is always going to come before the equal sign = and the value will come after. Basically think of it like assigning a variable a value. In this data being passed by the url our field is "q" if you didn't already guess and our value is site%3Afroogle.com. The field 'q' that MSN takes stands for query. So passing data assigned to the 'q' field is telling MSN, "hey look this search/query up for me."
The value assigned to the field 'q' is site%3Afroogle.com. First thing you're probably thinking is what in the world is that %3A, I didn't type that. Well to keep things very simplistic, there's certain variables that can't be passed through url's things like colon's, quotes, semi-colon's etc, because these are protected and mean certain things to a web server when they see them. So we need to use some other form of formatting. In this case we're converting the ':' in site:froogle.com to a encoded value (more on that later). So what we're asking for by the site: command in MSN is how many pages from site X are in your search engine. So specifically how many pages from froogle are indexed in MSN.
$regex = '/Page 1 of (.+?) results/';
First things first when we're scraping a page we're scraping the source code of the page, so that's always what we're going to want to be looking at when we're picking out what we want to grab. If you know know this and you better or you're probably lost. Go to view source in your browser then search for what you're looking to pull out. Here's a chunk of the source code we're going to pull our value out of.
div id="search_header"><h1>site:froogle.com</h1><h5>Page 1 of 9,138 results</h5> <b>
Now that we have our data we want to to get the result from, we can get into the meat of the parsing. I know to most of you regex is big scary thing with all those crazy symbols and patterns. And well if you want to be a regex master yes, it's pretty daunting. But don't let all those funny chars scare you cause there's a real simple way to use regex. The regex guru's and preachers will mock you and say you're bastardizing it but I say whatever works.
I'm not going to go into we're just assigning a string to a variable in this statement. Anytime you see a $varname = 'something here'; or $varname = "something here"; you know it's just a value being assigned to a variable. Also note you can use single ' and double " quotes interchangeably.
(.+?) is our best friend when it comes to regex, it basically means match everything starting from the text ( I'll call that text anchors too, so be prepared for me to use the interchangeably) in the beginning and stopping at our end text/anchor. Something like this:
opening anchor text here ( .+?) closing anchor text here
Pretty easy huh? Yeah I thought so. The only other thing to note in this is that there is the forward slashes in the '/stuff/'; that's a regex thing. Just know that in php you always need to let regex know what to match inside of forward slashes.
Of course I can talk about regex all day and type 1000 pages on it. But for now I'm trying to keep it super simple.
Ah a new function's in town, preg_match(). Preg_match() is the PHP function to call regex for a single match. So anytime we want to match one thing in our data we're going to call the parsing function preg_match().
With preg match we're doing something called passing data to the function for it to work on. In this case we're passing $regex, $data, $match. We know what both $regex (parsing string we just made) and $data (scraped page from MSN) are but what is the $match variable? It's just the variable that our parsed data is going to be returned to. In plain english we're saying take $data and then apply the filter $regex to it. Then whatever comes through that filter dump out into $match. Make sense?
I sure hope you said yep, that's easy.
The function var_dump() is your best friend as a programmer. It says whatever is in this variable or array dump it out onto the screen so I can see what's happening. So this line will output this onto the screen.
string(23) "Page 1 of 9,138 results"
Array? What's that? Well this is as good a time as any to introduce what an array is. They're extremely useful tools for you to know. So lets backup a little we know that a variable is something that holds 1 thing, right? Well an array is just like a variable except it holds multiple things. I like to think of it like this. Stop and imagine a train for a second it has all these cars on it that hold things right? well a variable is a single car and can only hold a single thing. Where an array is like a train that has multiple cars to hold things. In the output above we have a two (2) cell array, which is just like a 2 car train. In car 0 we have the string 'Page 1 of 9,138 results' and in car 1 we have the string '9,138', which is the result we want right? You might be asking why does preg_match return an array rather then just a simple string. It does this two give you two options on how to match things. You'll notice car/cell 0 has the anchors included as well as the matched text. Where car 1 only has the text inside the anchors.
What's with the new notation? If you hadn't already guessed that's how we access the cars in our train. We know if we have a array and what we want is in car 1 we access that by 'referencing' that car which is what the  means. We want to output only what's in the second cell because we don't want the anchors included. This will output to our screen:
Which is exactly what we aimed to do.
Other things to try -
So fun stuff to try using our new skills.
1. Use the link: command in MSN and see if you can get the number of links for a domain. Don't forget that : = %3A
2. See if you can get the title of a web of any web page. Hint: anchors are going to be <title> and </title>.
You can make some pretty cool tools with just the two very basic things I've shard with you so far. Pulling data from somewhere using the file_get_contents() function and the data parsing preg_match() function. Have fun with it and I'll see you on the next data scraping tutorial.