Meet pup
Today, when browsing the IndieWeb chat logs, I found a link to a tool called pup. Pup parses HTML and returns to STDOUT. More than this, it can also output the result as JSON. This means you can do clever things, like extract attributes, without FSM such as Regex or write your own parser.
Quick usage
It is really simple to use pup. In this example, I pair with jq and curl.
curl -s https://www.lewiscowles.co.uk | pup --color '#content a json{}' | jq 'map({text, href})'
You do not have to use curl or jq with pup. You could for example use the command-line utility cat as you may notice if you visit the GitHub issue I raised below.
Why?
I love HTML. I love staring at it, storing it, using it, authoring it, and generally having it in the world. I do not like parsing HTML, and I expect I am not alone in this.
- What about when I just want the information?
- What about when I am not present?
HTML is a total pain in unpleasant places for the first case. Sure I can use command-line text browsers like lynx to hide the nonsense surrounding the content I want to get to. I actually do this sometimes to check content I am putting out into the world works, at least to some extent for text-mode as a guideline towards being accessible, and available.
The second case, where I am not present… I am not aware of a webdriver or automation for lynx browser, nor do I really want one, or the syntax that will invariably upset me. It is a bit of a gap in the market.
Lots of people want to write web-browsers, but the tech they run on, seems quite coupled, and does not seem to have spread much past the isolated tribes of browser vendors. I will not harp on about diverse browser markets, nor the need for software to be decomposable. They should be. That is all.
There must be a catch?
I did find a surprising behaviour when trying pup out on my own site. You may have noticed in the example I gave; I use an ID attribute selector. This is no mistake. If I target main, which is the generic primary content; pup does not do what I expect.
Peculiarly, my limited testing seems to suggest this is either isolated to the main tag, or just does not affect div tags. Ironically div span soup works better for the program at the time of writing. As far as my limited testing goes.
When I do use main. The program returns an empty array.
I suspect Golang HTML parser may be at fault in some way there. I read through the entire software repository. I saw nothing glaringly obvious.
Writing a ticket to OpenSource software
I wrote about this on LinkedIn.
I am particularly proud of how far my issue writing has come.
- Start with a thank you
- Describe problem, providing a replicable test case
- Be kind to non-technical folks (collapse TMI if possible)
- Use headings, to aid skimming & linking
- Do not demand, insist on, or label a thing 'BUG'. Scope is the creator decision, not yours.
This about wraps up my short ramble about pup. I Hope you learned something. Maybe more than one thing. I also hope you give pup a go.