Web Scraping Data

FileMaker can be such a wonderful desktop tool for harvesting and managing a lot of data. It’s because the user interface is baked right there into the backend database. You can whip up a powerful data parsing solution in no time. Can it handle the data? Yep! Can you build the interface right there? Yep! This sense of “data power” can be compounded even more with a little bit of know-how.

When you’re not afraid to step, just a bit, outside of FileMaker’s user interface and simply plug-in to another technology or programming language, you’ll find BIG benefits when seeking the holy grail of code leverage.

In this video, I walk you through some serious insight into how the big boys and girls like to parse their web data. If there’s ever a source of content you simply can’t access in an importable format, then you have to know how to web scrape like a pro within FileMaker - Pro, that is.

AttachmentSize
WebScrapingData.zip383.91 KB

Comments

Some confusion here...

I ran:

$ sudo easy_install pip
Password:

...and got:

Searching for pip
Reading https://pypi.python.org/simple/pip/
Best match: pip 9.0.1
Downloading https://pypi.python.org/packages/11/b6/abcb525026a4be042b486df43905d6893...
Processing pip-9.0.1.tar.gz
Writing /tmp/easy_install-N3Lx8L/pip-9.0.1/setup.cfg
Running pip-9.0.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-N3Lx8L/pip-9.0.1/egg-dist-tmp-RUncTU
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'python_requires'
warnings.warn(msg)
warning: no previously-included files found matching '.coveragerc'
warning: no previously-included files found matching '.mailmap'
warning: no previously-included files found matching '.travis.yml'
warning: no previously-included files found matching '.landscape.yml'
warning: no previously-included files found matching 'pip/_vendor/Makefile'
warning: no previously-included files found matching 'tox.ini'
warning: no previously-included files found matching 'dev-requirements.txt'
warning: no previously-included files found matching 'appveyor.yml'
no previously-included directories found matching '.github'
no previously-included directories found matching '.travis'
no previously-included directories found matching 'docs/_build'
no previously-included directories found matching 'contrib'
no previously-included directories found matching 'tasks'
no previously-included directories found matching 'tests'
Adding pip 9.0.1 to easy-install.pth file
Installing pip script to /usr/local/bin
Installing pip2.7 script to /usr/local/bin
Installing pip2 script to /usr/local/bin

Installed /Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg
Processing dependencies for pip
Finished processing dependencies for pip
---end of snip

Then I ran:
sudo pip install beautifulsoup4

And got:

The directory '/Users/<myUserHomeName>/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/<myUserHomeName>/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Requirement already satisfied: beautifulsoup4 in /Library/Python/2.7/site-packages/beautifulsoup4-4.5.3-py2.7.egg
---end of snip

So I used the -H flag...

and ran:

$ sudo -H pip install beautifulsoup4

Requirement already satisfied: beautifulsoup4 in /Library/Python/2.7/site-packages/beautifulsoup4-4.5.3-py2.7.egg
---end of snip

So at this point I am not sure if I am good to go or if I should have used a different directory?

I am running 10.11.6 (El Capitan) Any ideas?

If anyone else has any ideas, please chime in. Matt is probably really busy.

Ah, more challenges...
The script in the third tab is a web check for which modules are installed:
<snip>
This script is for determining which python modules are installed.
import pip
installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
for i in installed_packages])
print(installed_packages_list)
</snip>

So I replaced it with:
<snip>
#beautifulsoup webscraper

from bs4 import beautiful soup
import urllib,string,csv,sys,os,unicodedata

#get html data
url = 'https://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes'
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")

parsed = soup.findAll('td')

for td parsed:
row = td.b
prefix = row.contents(0).string.replace(' ','').encode('ascii',"ignore')")
state = row.a.string.encode('ascii','ignore') if row.a != None else ''
print(prefix + ',' + state)

""""
created: 3/30/16 by Matt Petrowsky
</snip>
and I get:
"6
PythonExecute: No compiled script object."

I guess I am done with this until I get some feedback from somebody.
:-/

Ok, I just realized I missed that you said the verify script for python was in record one and the webscraping example was in record two. So I replaced the original text and moved to record two and ran your version of the typed script and I got:
<snip>
6
Python exception: <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'contents'
</snip>
Should I be getting the correct scraped text or is the exercise for me to debug this on my own?

Same issue that MacDevGuy was having. Has this been clarified?
When running extended script I receive the follow error.
Traceback (most recent call last):
File "Untitled.py", line 30, in <module>
if row.a == None:
AttributeError: 'NoneType' object has no attribute 'a'

Dwight

So glad it wasn't just me Dwight. Since there's been no response on this from Matt I am sending him an email on another topic and will ask him to take a look at it.
John

Matt,
Trying to move forward in spite of the errors. I started from the beginning and got the following:
<snip>
MacHD:~ MacDev$ sudo easy_install pip
Password:
Searching for pip
Best match: pip 9.0.1
Processing pip-9.0.1-py2.7.egg
pip 9.0.1 is already the active version in easy-install.pth
Installing pip script to /usr/local/bin
Installing pip2.7 script to /usr/local/bin
Installing pip2 script to /usr/local/bin

Using /Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg
Processing dependencies for pip
Finished processing dependencies for pip
MacHD:~ MacDev$ sudo pip install beautifulsoup4
The directory '/Users/MacDev/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/MacDev/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Requirement already satisfied: beautifulsoup4 in /Library/Python/2.7/site-packages/beautifulsoup4-4.5.3-py2.7.egg
MacHD:~ MacDev$ sudo -H pip install beautifulsoup4
Requirement already satisfied: beautifulsoup4 in /Library/Python/2.7/site-packages/beautifulsoup4-4.5.3-py2.7.egg
MacHD:~ MacDev$ from bs4 import BeautifulSoup
from: can't read /var/mail/bs4
MacHD:~ MacDev$
<end snip>
So I continued by pasting your script into a new record and got the following error after running evaluate:

<snip>
6
Python exception: <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'contents'
<end snip>

I sure wish you could comment on this or one of the other posts so we can move forward. Seems there are other people who have also experienced the same error.

Even just a comment that you are looking into it would be helpful.

Here is the long version of what you need to know.

It's possible to have multiple installations of Python on a given computer. To that end, the FileMaker plugin bbox from Beezwax can't possibly know about any other installations than the default one supplied by Apple.

When it comes to extensions made available to Python these are stored in a specific location.

/Library/Python/2.7/site-packages

You can always visually see what is in a folder within the Finder if you type this in the Terminal.

open /Library/Python/2.7/site-packages

Within that folder you should see at least something that looks like this. But the version numbers may be different.

beautifulsoup4-4.6.0.dist-info
bs4
pip-9.0.1-py2.7.egg
easy-install.pth
Extras.pth
README

Here are some commands I ran on my machine to show how I also have Python3 installed.

-> % type python
python is /usr/bin/python
matt@iMac5K [22:16:12] [/Library/Python/2.7]
-> % type python3
python3 is /usr/local/bin/python3
matt@iMac5K [22:16:21] [/Library/Python/2.7]

Notice, that I have an install of Python3 at /usr/local/bin

In my case the python modules end up going into another folder.

In particular here.

/usr/local/lib/python3.6/site-packages

Essentially, the modules need to go into the site-packages for the Apple installed 2.7 version of Python. This is what the bbox plugin is uses. If you have any other installations of Python you WILL experience issues - unless you know how to deal with them.

pip is what is known as a package manager. It manages the "extras" that you install for Python. So, you need to make sure the extras go into the right location.

Again, if you have any other installs of Python then you'll likely have issues. When a module is installed it is "saved" (for lack of a better word) for the version of Python you are using. In order for the FileMaker plugin to access and use these, they MUST be in the right location.

When installing pip, the package manager, you can always look at the help by doing this.

easy_install --help

So, for example, I ran this to see what easy_install was doing. The -v means verbose mode.

sudo easy_install -v pip

After I install pip and then beautifulsoup4, the scripts will work within FileMaker - using the bbox plugin. However...

A key thing to note is that the script was written against HTML. ANYTHING on the web can change AT ANY TIME and ESPECIALLY Wikipedia pages. When I ran the script while composing this reply, it looks like they had changed the page which is causing the script to fail. It simply needed a slight modification in order to work again.

It needed a condition like this.

if row != None:

The key take away here is that the FileMaker bbox plugin will return an All or Nothing result. All it can do is return an error if it hits one. However, running the older version of the script within something like CodeRunner will reveal where the error is actually happening. This type of know-how is critical for debugging purposes if you're going to venture into the world of Python scripting. The bbox plugin may have a way to report the errors but I've not looked into doing so since I compose my scripts within CodeRunner.

I've uploaded a revised version of the file to this article.

I hope this helps out those who are having issues!

-- Matt Petrowsky - ISO FileMaker Magazine Editor

The python script stored in record 1 (for listing installed packages) doesn't work if the installed version of pip is too new. From the command line (terminal) determine your pip version:

$ pip --version

You should get something like:

pip 6.1.1 from /Library/Python/2.7/site-packages

My pip version was newer:

pip 10.0.0b2 from /Library/Python/2.7/site-packages/pip-10.0.0b2-py2.7.egg/pip (python 2.7)

To fix things, first uninstall pip:

$ sudo pip uninstall pip

Now, install the older version of pip:

$ sudo python -m ensurepip

I got some of the funky printouts also, but was successful in running the python script. Great help, have been struggling with large data sets while using FM... been -slowly- learning python. Thanks, big help! Onward/upward.

t, honolulu- data dabbler

I'm using the basic webscraping example and am running into an error when there are two lines of code that are before the data i need. example:

<td class="datalabel">Address:</td>
<td colspan="3" class="datavalue">
628 South Illinois Street, Litchfield, IL 62056
</td>

I know to put a \ before the quotations, but i think the extra line is throwing my calc off. Any help is much appreciated!

Brad Schopp

If possible, please demo how to do webscraping for a amazon web page including a pictures.

Vincent Lu

Hay guys,

I've been trying to solve this problem in python that I can't seem to solve it. I keep getting this error.

AttributeError: 'NoneType' object has no attribute 'a'

It's in regards to this section of the code starting at line 28:

for td in parsed:
row = td.b
if row.a == None: <----- the error points to this line}
state = ''
else:
state = row.a.string

I'm wondering if it has to do with python being version 3. I'm thinking maybe I'm doing something illegal with a NoneType.
Sorry for asking something that might be really simple, but I've been pulling my hair out for a couple of days with this issue. I've been surfing the web to find something to explain what this problem is, but still nothing.

Could anyone please enlighten me on what is going on?