In this chapter we will scrap the web site of Magic the gathering and in particular the card database. (Yes I play Magic not super good but well I have fun).
Here is one example http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430 as shown in Figure .
Now we will try to show you how we explore the HTML page using the excellent Pharo inspector: diving in the tree nodes and checking live their attributes or children is simply super cool.
Getting a tree
The first thing was to make sure that we can get a tree from the web page. For this task we used the XMLHTMLParser class and sends it the message parseURL:. How did we find this message... Simply looking on the class side methods of the class.
How did we find the class, well looking at the subclass of XMLDOMParser because HTML is close to XML or the inverse :).
Toying with the inspector, we come up with the following ugly expression to get the name of the JPEG
Ugly isn't it? This happens often when scraping HTML, but we can do better.
By the way note also that we start to enter directly XPath command using the XPath pane and using the doit and go facilities of the inspector.
This way we do not have to get the page from internet all the time.
We could not really show you such ugly expressions so we had to find a better one.
So first we look at the img that has src as atttribute as shown below and in Figure .
Then as shown in Figure we inspected the right node.
Finally since we were on this exact node, we looked in its class to see if we could get an API to get the attribute in a nice way as shown in Figure .
Now that we have the visual path, we can use the HTTP client of Pharo to get the image as shown in Figure .
Since this web page is probably generated, we look for example for the artist string in the source and we found the following matches:
This one is more interesting:
We can build queries to identify node elements having this id.
To avoid to perform an internet request each time, we typed directly XPath path in the XPath pane of the inspector as shown in Figure .
Now trying to get faster we looked at all the class="row" as shown in Figure .
The following expression returns the pair label and value for example for the card name label and its value.
So we can now query all the fields
Now we can convert this into a dictionary
And convert it into JSON for fun
Now we can apply the same technique to access all the cards and also different pages to extract all the card unique id and query the database.
But this is left as an exercise.
We show you how we could access the page and navigate interactively through it using XPath and live programming feature of Pharo.
This chapter should show the great value to be able to tweak you live a document and navigate to find the information you really want.