But what if you are interested in creating a more complicated Xpath query to retrieve data from external sources? Example2: Suppose you are interested in automatically retrieving the top 100 ranking URLs in the Yahoo search engine for the keyword “php developer,” as indicated in the screen shot shown below: ![]() Of course, manually retrieving each of the URLs from position 1 to position 100 is a tedious and time-consuming activity. You can use the ImportXML function to automate this process neatly within the Google Docs environment and save time. First, you need to define the URL from which to import the data. To do this, simply go to www.yahoo.com and type the keywords "php developer" in the Yahoo search box. Once you see the results, click “Options” (located beside the search button), and then click “Advanced Search.” Under “Number of Results,” change “10 results” to “100 results” and click “Yahoo Search.”. The search results will then be updated to provide the Top 100 results. Now get the Yahoo search result URL you see in the browser address bar, which should look something like this: http://search.yahoo.com/search?n=100&ei=UTF-8&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vd=all&vst=0&vf=all&vm=p&fl=0&fr=yfp-t-701&p=php+developer&vs= This is value of the URL which you will be using in your ImportXML function. Finally, you will need to define your Xpath query. As you may have suspected, this can be quite complex because the Yahoo search result page contains a lot of elements that you can grab and analyze. To easily formulate your Xpath query for such complex pages as this, you can use your favorite browser developer tools, such as Firebug or the Inspect Element feature in Google Chrome. There are also developer tools in Internet Explorer which you can use as well. Inspect/focus it on the area of interest until you find the div section of the search results. In this example the div region covering the search result is: <div id="main"> <!--Yahoo search result within this div--> </div> Below is a screen shot of the affected code region (encircled in red): ![]() The next thing you will want to take note of is under the element: <div id="main">; here you will see two hyperlinks which are a child or a descendant of the element <div id="main">. The first hyperlink element contains the correct URL you need to retrieve (enclosed within the bigger red circle), while the other hyperlink element is not the correct one because it is pointing to this URL: http://search.yahoo.com/r/_ylt, which is not part of the search result URLs pointing to external websites. What's more interesting is that the correct hyperlink you need to retrieve contains a unique identifiable attribute: class='yschttl spt' If you are not going to specify this attribute in your Xpath query later, keep in mind that it will retrieve the wrong hyperlink element. //div[@id='web']//a[@class='yschttl spt']/@href This tells the ImportXML function to “extract all the values of the href attribute in the hyperlink element a where it has an attribute of “class” with value='yschttl spt', which is also descendant of the div element with an attribute id whose value is equal to 'web'” Example2: Uses a more advanced Xpath query to retrieve search result URLS in Yahoo Search Engine for a specific keyword. In the Example2 sheet: Go to Cell B6
blog comments powered by Disqus |
|
|
|
|
|
|
|