In this tutorial, we will teach you how to use Xpath data functions (such as ImportXML) to retrieve information from Google Docs and other outside sources.
Google docs has already proven itself to be useful to office workers highly dependent on the Internet for sharing information. It allows users to share spreadsheets, active collaboration, and create questionnaires/forms more easily than any other solution out there.
This tutorial will take a look at some of the rare and important uses of Google docs which many ordinary users may not know. These are called external data functions. These functions are meant to retrieve data from external sources outside of your Google docs spreadsheet. For example, suppose you are working on a Google docs spreadsheet and you need to retrieve the list of hyperlink URLs found on web page X, or you need to grab all of the top ranking URLs from the search result of keyword X.
This tutorial will focus on the implementation of the ImportXML function, one of the important external data functions available in Google Docs used to retrieve data from outside sources.
Basic Definition of ImportXML Function
The syntax of this function is:
importXML("URL";"query")
Definition:
URL- the URL of the website HTML page. It can also be the URL of an XML page. Query - the Xpath query you would like to run that defines which specific data you would like to retrieve in a URL or XML page.
If you are completely new to Xpath queries, I recommended you read the tutorial over at W3Schools first: http://www.w3schools.com/xpath/xpath_syntax.asp for reference.
Example 1: Suppose you want to grab the URL of all the hyperlinks on this page: http://www.php-developer.org/site-map/
Solution:
A hyperlink uses this format: <a href="http://www.example.com/">Anchor text</a>. To extract the URL of the hyperlinks, you should use this Xpath query:
Xpath query = //a/@href
Discussion: According to the Xpath documentation at: http://www.w3schools.com/xpath/xpath_syntax.asp , // instructs to select nodes in the document from the current node that match the selection no matter where they are. Since you are interested in picking up hyperlink URLs regardless of where they appear on the page, you should start the query with //.
This also implies that //a means to select all a elements no matter where they are on the page. Once the a element has been selected, the rest of the query /@href means to extract the value of the attribute href of the element.
Therefore the final instruction of the Xpath query: //a/@href means literally to extract all the values of the href attribute element regardless of where they appear on the page: http://www.php-developer.org/site-map/
Implementation in Google docs:
To implement the above function in your Google docs spreadsheet as well as to retrieve the data, follow the steps below:
1. Log in to http://docs.google.com/ with your Google account.
2. In the empty cell A1, paste the function below in the formula window:
3. Google docs will automatically populate the columns with all the URLs of the hyperlinks that are retrieved from the page. See the screen shot below: