Home arrow XML arrow Page 2 - Google Docs and Xpath Data Functions

Parsing Data with the Xpath ImportXML Function - XML

In this tutorial, we will teach you how to use Xpath data functions (such as ImportXML) to retrieve information from Google Docs and other outside sources.

TABLE OF CONTENTS:
  1. Google Docs and Xpath Data Functions
  2. Parsing Data with the Xpath ImportXML Function
By: Codex-M
Rating: starstarstarstarstar / 2
March 15, 2011

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement


But what if you are interested in creating a more complicated Xpath query to retrieve data from external sources?

Example2: Suppose you are interested in automatically retrieving the top 100 ranking URLs in the Yahoo search engine for the keyword “php developer,” as indicated in the screen shot shown below:



Of course, manually retrieving each of the URLs from position 1 to position 100 is a tedious and time-consuming activity. You can use the ImportXML function to automate this process neatly within the Google Docs environment and save time.

First, you need to define the URL from which to import the data. To do this, simply go to www.yahoo.com and type the keywords "php developer" in the Yahoo search box. Once you see the results, click “Options” (located beside the search button), and then click “Advanced Search.” Under “Number of Results,” change “10 results” to “100 results” and click “Yahoo Search.”.

The search results will then be updated to provide the Top 100 results. Now get the Yahoo search result URL you see in the browser address bar, which should look something like this:

http://search.yahoo.com/search?n=100&ei=UTF-8&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vd=all&vst=0&vf=all&vm=p&fl=0&fr=yfp-t-701&p=php+developer&vs=

This is value of the URL which you will be using in your ImportXML function.

Finally, you will need to define your Xpath query. As you may have suspected, this can be quite complex because the Yahoo search result page contains a lot of elements that you can grab and analyze.

To easily formulate your Xpath query for such complex pages as this, you can use your favorite browser developer tools, such as Firebug or the Inspect Element feature in Google Chrome. There are also developer tools in Internet Explorer which you can use as well.

Inspect/focus it on the area of interest until you find the div section of the search results. In this example the div region covering the search result is:

<div id="main">

<!--Yahoo search result within this div-->

</div>

Below is a screen shot of the affected code region (encircled in red):



The next thing you will want to take note of is under the element: <div id="main">; here you will see two hyperlinks which are a child or a descendant of the element <div id="main">. The first hyperlink element contains the correct URL you need to retrieve (enclosed within the bigger red circle), while the other hyperlink element is not the correct one because it is pointing to this URL: http://search.yahoo.com/r/_ylt, which is not part of the search result URLs pointing to external websites.

What's more interesting is that the correct hyperlink you need to retrieve contains a unique identifiable attribute:

class='yschttl spt'

If you are not going to specify this attribute in your Xpath query later, keep in mind that it will  retrieve the wrong hyperlink element.

Finally, to retrieve the URL of the hyperlink, you need to grab the value of the href attribute.

So what is the final Xpath query? You can find it below:

//div[@id='web']//a[@class='yschttl spt']/@href

This tells the ImportXML function to “extract all the values of the href attribute in the hyperlink element a where it has an attribute of  “class” with value='yschttl spt', which is also descendant of the div element with an attribute id whose value is equal to 'web'”

This is the exact Xpath query required to retrieve the URLs of the ranking pages in the Yahoo search result.

Implementation On Google Docs

1. Implement the function below by pasting it into the formula window in your Google Docs spreadsheet:

=importXml("http://search.yahoo.com/search?n=100&ei=UTF-8&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vd=all&vst=0&vf=all&vm=p&fl=0&fr=yfp-t-701&p=php+developer&vs="; "//div[@id='web']//a[@class='yschttl spt']/@href")


2. It will automatically parse the URL of the Top 100 URLs in Yahoo search result for the keyword “php developer.” You can then apply any formatting and labels to the Google Docs spreadsheet so that the report will be presentable and clear.



You can view the actual Google Docs spreadsheet here: https://spreadsheets.google.com/ccc?key=0Aihbrnddtq2EdGxlbFVOTTh5NURNTHJjVDVpYnlmSWc&hl=en&authkey=CLGFqYQI illustrating Example1 and Example2 as explained in this tutorial.

Example1: Uses the Basic and Simple Xpath query in ImportXML function to retrieve URLs.

Example2: Uses a more advanced Xpath query to retrieve search result URLS in Yahoo Search Engine for a specific keyword.

Where can I actually find the ImportXML function formulas implemented in the Google Docs sheet?:

In the Example1 sheet: Go to Cell A1

In the Example2 sheet: Go to Cell B6

Of course there are a lot of ways you can retrieve important information from web pages using the Xpath ImportXML function, not just the URLs in the hyperlinks.



 
 
>>> More XML Articles          >>> More By Codex-M
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

XML ARTICLES

- Google Docs and Xpath Data Functions
- Flex Array Collection Sort and Filtering
- The Flex Tree Control
- Flex List Controls
- Working with Flex and Datagrids
- How to Set Up Podcasting and Vodcasting
- Creating an RSS Reader Application
- Building an RSS File
- An Introduction to XUL Part 6
- An Introduction to XUL Part 5
- An Introduction to XUL Part 4
- An Introduction to XUL Part 3
- An Introduction to XUL Part 2
- An Introduction to XUL Part 1
- XML Matters: Practical XML Data Design and M...

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: