Are websites embassies of foreign soil inside your own hardware?

superkuh on blog at 2023-09-06

This one is a bit roundabout but stick with me. I've exchanged a few emails with the science.org website technical support people about the RSS feed for the excellent chemistry/pharmacology blog "In the Pipeline". It's not Derek's fault at all, but his science.org/aaas hosts have basically blocked native RSS feed readers and they only allow corporate service websites that do the feed reader part for you like Feedly.org. They consider using a native application feed reader to be scraping their website and ban them.

Hello $superkuhrealname,

I wanted to follow up on your inquiry regarding RSS readers being blocked on science.org. We allow most traditional RSS readers (like Feedly) but this one in particular (QuiteRSS) we do not support. It behaves differently than most readers by using a browser to scrape content similar to a bot. We encourage you to try another RSS feed reader.

Let me know if you have any questions. Thank you.

Jessica Redacted
Publishing Platform Manager
American Association for the Advancement of Science
1200 New York Ave NW, Washington, DC 20005
jredacted@aaas.org

All QuiteRSS does is literally an HTTP HEAD or GET for the feed URL.

10.13.37.1 - - [06/Sep/2023:15:45:53 -0500] "HEAD /blog/rss.xml HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.21 (KHTML, like Gecko) QuiteRSS/0.18.12 Safari/537.21"

It is the most normal of normal of RSS readers. So I'm a bit taken aback at how a professional organization can be holding such obviously ignorant and dangerous views about what an RSS feed is. I brought it up on a cyberpunk IRC channel and it was pointed out this reflects a more fundemental division in how computing is perceived these days.

this whole "scraper" equals the boogieman to people now. You're presenting data to an external client, what said client does with the data is none of your business.

You have people that saw the internet before it was commercial, or who know came later but know how the meat is made, that perceive it that way. Then you have commercial/institutional/government and people who were presented the web fait accompli who see it as a black box where interference is against the law; "interference" being a POV word choice. I don't think changing a CSS rule is interference but nowdays it'd be like vandalizing someone's building wall.

It's as if visiting a website and downloading the publicly available contents is a nation setting up an embassy of "foreign soil" on your hardware.

Their cultural expectation is that you cannot do what you want with that data. Modifying it or how it's displayed is, to them, is like walking into their business location and moving around the displays. So obviously the only legal interface is the one they provide "at their location" or via another incorporated entity they associate with. But of course they aren't at *their location* they're at my location on my property in my PC. But slowly this commercial norm is working it's way into leglistation to become our reality as web attestation.

What they see, and what they want, is a situation equal to you going to their business premise and sitting down at one of their machines. They want to own your computer in just the same way simply by you visiting a website. That shit's fucked.

Digging deeper into the situation I noticed the real problem: it's cloudflare. Of course. They're applying the cloudflare policies to the entire domain as whole and the invasive browser internals checks they have for bots are blocking everything other than major browsers and other corporations like feedly they add to whitelists. It was silly of me to expect their support email address to connect with a person who wouldn't ignorantly lie to me. The problem isn't DNS anymore. It's always cloudflare.

[comment on this post] Append "/@say/your message here" to the URL in the location bar and hit enter.

Comments:

A sensible definition for what static HTML and static web sites are? No.

superkuh on blog at 2023-02-19

I've seen, and been part of, multiple heated conversations about the meaning of the phrase "static web site" is and what types of things it defines. For any communication I think we need to take as a premise that static HTML and static web sites are not the same category of thing. Any particular page on the web can be totally static or totally dynamic or a mix. There are static HTML pages on static web sites. There are dynamic HTML pages on static web sites. There are static HTML pages on dynamic web sites. And there are dynamic HTML pages on dynamic web sites.

The meaning of static HTML is the least contentious. A static HTML page is just an .html (or .htm, or anything else if the mimetype is set right) hypertext markup langage document that is stored on a file system as a file and sent to the end user when they request the URL that maps to that file. The HTML encodes what the web site user will see and does not change.

When the static .html file includes "static" (not really since it is executed code) javascript (or other executing language embeds) that changes the page to something other than displayed by the html in the file on disk. So it becomes a dynamic HTML page (for a while called "DHTML").

The only place where static HTML becomes unclear is in the case where some webserver linked program generates the static HTML on demand with no storage of the HTML as a file on the filesystem before being sent to the site user. In this case even though the user sees only static HTML there's crucially no file ever created on the webserver so it's dynamic HTML.

The meaning of static website is increasingly more unclear compounding on the fuzziness of what a static HTML page is. Generally there are the same two points of view as above but with a tweak.

There's the website users point of view where a static web site is static if the pages are just HTML and do not require executing any code to view. If you (or your browser) look at the source you can read the text and see the image URLs. It does not have to be generated by the browser's execution of some client side code.

Then there's the developer point of view where a static web site if the code required to generate the website is stored in a static file on the webserver. In this framing you can deploy a self contained .html file which includes the javascript code for a client side dynamic web application. This web application can completely change the text shown and even draw in outside information not in the file. But since it can be put on a CDN as a static asset it is a static web site.

I have to admit after writing this to clarify my thoughts I'm more confused than ever. The situation in which the end user only sees actual HTML in the browser but that HTML was generated without ever being a file on disk is definitely the case of a static HTML web page from the user POV. But it is also the most extreme case of a dynamic web site and everything bad about dynamic sites. Luckily this is the only exception to the rules that spoils the categorization. Maybe it's like the old saying says re: single exceptions.

This post HTML itself was written in gedit, then concatenated with a bunch of other .html files with a single line of shell and redirected to a file on the file system for the webserver to serve and scripts to process. The rss updates are generated by a single call to a perl script I manually type that scans the file system to generate the .rss file for the webserver to serve. Comments on this post will not be noticed by the webserver. But a perl script that tails the webserver access.log file on the file system will see them and then append the comment on to an existing .html file on disk.

Is this page a static HTML page? Yes. Is my site a static web site? I think so. Others would disagree.

[comment on this post] Append "/@say/your message here" to the URL in the location bar and hit enter.