In my previous post, I was parsing data out of HTML tables so that I could glean some trivia from it. My true goal was to compile data out of dozens of such tables and so I needed a way to do the whole process in Maple. So, now, here is how I used Sockets and StringTools to automate the whole process.
First of all, getting Maple to download a webpage is relatively easy, take a look at Alec's SearchPi Procedure, for example. A small problem with the SSA Baby name pages is, however, that they use POST style form data submission instead of the easier to examine GET. So you either have to inspect the source of the <form> elements or use a FireFox plugin like Live HTTP to figure out what data is being sent.
In this case, there are just three elements: year, top, and number. So, the top 20 names for the year 2000 with numbers for each name is:
The complete POST request in HTTP looks like:
POST /cgi-bin/popularnames.cgi HTTP/1.0
Of course, we are going to replace parts of the request with different years.
So we will build a string for the request making sure the Content-Length
header matches the length of our form data:
year := "2006";
top := "1000";
number := "n";
postcontent := cat("year=", year, "&top=", top, "&number=", number);
q := cat("POST /cgi-bin/popularnames.cgi HTTP/1.0\n",
Now that we have the request, we need to send it to the server using Sockets. First we open a connection to the web server, then we send our request. The page is big, so it may take a couple calls to Sockets[Read] to get the whole page, so we use a while loop.
sock := Sockets[Open]("www.ssa.gov", 80);
s := "";
p := Sockets[Read](sock):
while p <> false do
s := cat(s,p);
p := Sockets[Read](sock, 100):
Now "s" should contain the HTML source for the page with all the name data for the year 2000. Unfortunately, we cannot just send this string to XMLTools[ParseString] since like many webpages, the SSA page doesn't use completely valid XHTML.
To get around this, I did two things. First, I tore the page apart with StringTools[RegSplit] and kept just the tables by using the regular expression "<table[^>]*>|</table>". By inspection, the fifth piece is the table we want (you could also automatically select the largest piece), and we need to remove a badly formed bit from the end that starts with a "<tr>" tag.
thetable := cat("<table>", StringTools:-RegSplit(
thetable := cat(thetable[1 ..
StringTools[Search]("<tr>", thetable)-1], "</table>");
Now, we should have something that we can parse with XMLTools and proceed like we did before with our hand edited HTML:
xt := XMLTools[ParseString](thetable);
But now we can put the whole thing into a loop over all the years we want to look up, and process all the data together.
maletable := table(sparse);
femaletable := table(sparse);
for yy from 1977 to 2007 do
# the code to retrieve and parse the page into
# the listlist "thedata" goes here
for x in thedata do
maletable[x] := maletable[x] + x;
femaletable[x] := femaletable[x] + x;
At this point it is a good idea to save this data so you don't have to download and parse it again:
save maletable, femaletable, "SSAData-1977-2007.mpl";
And, finally, the Top 20 unisex (no more than 67% male or 67% female) baby names in the US from the last 30 years:
# 1 - 150381 - Casey
# 2 - 103100 - Reilly, Riley
# 3 - 89101 - Payton, Peyton
# 4 - 76840 - Jaime
# 5 - 61732 - Avery
# 6 - 55989 - Jessie
# 7 - 50821 - Kendall
# 8 - 41180 - Skyler
# 9 - 28296 - Carey, Kerry
#10 - 25452 - Harley
#11 - 24502 - Sidney
#12 - 19206 - Justice
#13 - 17609 - Jackie
#14 - 17297 - Reese
#15 - 16382 - Jean
#16 - 15997 - Jody
#17 - 14413 - Jaiden
#18 - 14074 - Elisha
#19 - 13845 - Loren
#20 - 13572 - Sage
A couple final notes:
- In the whole list "Baby" (#24), "Infant" (#29), and "Unknown" (#32) show up, presumably due to temporary birth certificate data sneaking into the database.
- The Metaphone algorithm combines names a little too aggressively in some cases. I would not have chosen to combine Carey and Kerry ("KR"). It will also combine Jan, Jane, Jean, John, and Joan ("JN"), for example, which is definitely not desirable.
- If you save information about year-on-year name frequency changes (instead of just the sums), you can make some cool graphs like this site.
- I want a good database of common nicknames (especially unisex nicknames) to catch all the Sams and Alexs, etc. But I don't want to have to make it by hand. Alas.
Attached is the file with my parsed tables: