Tag: programming

  • POST Mortem: How Azure Application Gateways’s Missing 308 Killed Our Linked Data API

    In the Linked Data world, cool urls don’t change. That means in the RDF world you’re coining URIs that should be resolvable, and you pick the easiest one. Most people in the linked data world use http:// when coining URIs, even though today’s modern internet lives on https:// with upgrades handled by the service’s web stack.

    The Water Quality service launching on the environment.data.gov.uk portal has a RESTful, Hydra API, and it supports a combination of GET and POST methods to retrieve data. The most useful endpoints living at /data are POST, as they can receive GeoJSON bounding boxes to query both geographic and observation data, though some uses don’t require a body.

    In our testing we discovered that Python clients break when navigating the pagination of our service, but JavaScript works. WTF?

    The HTTP Redirect Status Code Landscape

    The 300 series of HTTP Status Codes defined in RFC 7231, with 308 added in RFC 7538, help people navigate the internet automatically when resources move, protocols change, and they’re all quite useful.

    • 301 (Moved Permanently): The old guardโ€”allows method changes
    • 302 (Found): Temporary and method-flexible
    • 303 (See Other): Forces GET (useful for POST-Redirect-GET pattern)
    • 307 (Temporary Redirect): Preserves method but temporary semantics
    • 308 (Permanent Redirect): The hero we needโ€”permanent + method preservation

    The issue It’s not a resource move (different URI), it’s a protocol upgrade (same resource, different scheme).

    Link Data APIs need 308

    Our canonical URIs often use http:// scheme as protocol agnostic identifiers; however transport security requires HTTPS. Content negotiation and RDF payloads reference http:// URIs, and we don’t select the protocol on the fly in our responses. Both the link headers and the Hydra pagination links in our endpoint use the same URIs to help people navigate our pagination setup.

    So you see how this is going to go? With POST getting redirected to GET will cause things to fall over when we erroneously get a 301 from Microsoft’s Application Gateway?

    The Azure Application Gateway Gap

    The current available responses for a HTTP to HTTPS upgrade in Azure’s Application Gateway service are 301, 302, 304, and 307. It’s missing the semantically accurate and method-preserving 308. Not only that, we can’t target specific paths or entry points in the service. We are forced to chose between wrong semantics (i.e. temporary redirects) or broken clients (POST gets converted to GET).

    Real-World Impact: Client Behaviour Broken

    Let’s be honest, the problem here is that Python is full of pedants (see: Pydantic), and the interpretation of RFC 7231 by the authors of itsย requestsย library have correctly implemented their redirect flag in theย post()ย method. When a 301 redirect is encountered,ย requests converts the POST to a GETโ€”which ourย /dataย endpoint doesn’t support, returning aย 405 Method Not Allowedย error.

    What should be a simple for loop navigating the link headers to collect a paginated dataset now requires custom redirect handling. What should be the simple contents of theย while next_url:ย loop.

    # What breaks with 301:
    response = requests.post(next_url, headers=headers, data="")
    # requests converts POST โ†’ GET on 301 redirect
    # Server responds: 405 Method Not Allowed
    # Pagination fails immediately

    Becomes the more convoluted:

    # Manual redirect handling to preserve POST method:
    response = requests.post(
        next_url, 
        headers=headers, 
        auth=auth, 
        data="", 
        allow_redirects=False  # Disable automatic redirect
    )
    
    # Handle redirect manually to keep POST
    if response.status_code in (301, 302, 307, 308):
        next_url = response.headers['Location']
        continue  # Re-POST to new URL

    Now I have to build my own redirect handling in Python because Microsoft has the semantics of their response codes wrong. I’m fine with it, but I want people to be able to be able to use our endpoint easily.

    Our front-end developers didn’t experience the same problem, which means that JavaScript’s fetch doesn’t do the same thing. This gives us an inconsistent API experience, and even with documentation being clear what’s going wrong with their code I’m still going to get support tickets that the thing is broken.

    Microsoft: Fix Your Shit

    Your Application Gateway redirect options aren’t complete. Give us a 308 code, allow us to be the pedants I want us to be. It would make a massive impact for the semantic web, improve our RESTful APIs, and follow modern HTTP patterns without breaking it for everyone else.

    Standards exist for a reason; it’s not a niche concern, as LLMs and Agentic AI usage becomes more and more common, having modern ways of accessing knowledge graphs and FAIR Data requires getting the semantics right everywhere โ€” including in our HTTP response codes.

    @Azure: gimme the response code 308.


    Note: I have a support request asking for this behaviour. I expect Microsoft to change nothing.

  • PostgreSQL is the best triplestore

    When folks want data they went it on a subject basis, making providing linked data that much easier. I have been exploring using Postgres with FastAPI and pydantic to serialize JSON-LD direct from SQL, to give users a familiar JSON RESTful API with content negotiated RDF baked in.

    Compared to the existing Jena-based API its throughput is two orders of magnitude faster, more reliable, and the data ingress doesn’t make me reconsider being in this domain.

    I hinted about this in February with a post about JSON-LD and prefixes. The API should be finished by the end of the month.

    (I’m on my second of hopefully two refactors.)

  • The Art of Semantic Procrastination: Why I Use Blank Nodes for Concepts That Aren’t Mine

    In the linked data world, there is always a temptation to boil the ocean. When building out a new API or even just a new dataset, there are so many concepts (SKOS:Concept and otherwise) that are undefined and uncoined which provide human context and you feel the pressure to define it in your RDF – at the risk of taking on too much and straying outside of your authority. I’ve faced that in the past while building out a linked data service at Office for National Statistics, and having been burnt by the numerous kettles we had going to define everything semantically. I’ve been determined to not make that mistake again.

    The new API I’ve been developing for DEFRA is a Hydra/SOSA vocabulary based RESTful, content negotiated API for observational water quality data in England. The architecture of the service is FastAPI+PostGIS with a Next.JS frontend: the API doesn’t know anything about RDF; however it responds via JSON-LD by default with the JSON written in a way that people not familiar with RDF would appreciate.

    The main payload of the API is sampling points (sosa:FeatureOfInterest) have samples & samplings (sosa:Samplesosa:Sampling), which in turn have observations (sosa:Observation). Each of these levels have domain-specific types, classifications, and annotations which are necessary for the interpretation and discovery of these data; however no authoritative, public resource currently exists of these concepts.

    As someone who lives FAIR, linked data, but knows most consumers of data neither understand nor care about it, what should I do? The answer isn’t to avoid these concepts – it’s to represent them responsibly until someone with actual authority shows up.

    Procrastination by way of blank nodes

    My solution is deterministic blank nodes. Instead of coining URIs for concepts I don’t own, I generate consistent blank nodes that can be reconciled later when authoritative sources emerge. This keeps my API stable while avoiding coining URIs I may eventually regret. Let me explain.

    Previously I would have attempted to coin URIs for all my concepts, either at the dataset or higher level scope. For example, capturing the concept of running surface water from a river. In the source data for the API I have a table with a key and a label, the key acts as a notation.

    // You have no authority here, Jackie Weaver
    {
      "@id": "http://environment.data.gov.uk/id/sample-material/2AZZ",
      "@type": ["skos:Concept", "sosa:FeatureOfInterest"],
      "skos:prefLabel": "RIVER / RUNNING SURFACE WATER",
      "skos:notation": "2AZZ"
    }

    The issue is I currently don’t have responsibility of the concept scheme for sample materials, and it’s also not online. I know all the values, and I have a copy of it to make the service work but it’s not within the scope of delivery for the water quality API. So instead of speaking with authority I’ve shifted to getting it down in code first to serve it via the API. How about as a blank node?

    // Procrastinating via blank nodes
    {
      "@id": "_:sampleMaterial-2AZZ",
      "@type": ["skos:Concept", "sosa:FeatureOfInterest"],
      "skos:prefLabel": "RIVER / RUNNING SURFACE WATER",
      "skos:notation": "2AZZ"
    }

    The key here isn’t just using any blank node – it’s using a deterministic blank node identifier. By concatenating the concept scheme name with the notation (_:sampleMaterial-2AZZ), I ensure that every time this concept appears in my API responses, it gets the same blank node identifier.

    Note: This isn’t standard RDF blank node syntax – it’s my deterministic generation pattern from my source data. When serialized to actual RDF formats, these become proper blank nodes, but the consistent string ensures they all resolve to the same node across serializations. This isn’t just semantic pedantry – it has real practical benefits.

    When someone downloads multiple API responses and converts them to Turtle or N-Triples, all instances of _:sampleMaterial-2AZZ will be recognized as the same entity. Without this deterministic approach, you’d end up with multiple disconnected blank nodes for what should be the same concept, creating an unforgivable mess.

    Here’s what this looks like in practice – a real API response converted to Turtle:

    curl -sSL --fail 'http://localhost:8000/sampling-point/53130070/sample?skip=0&limit=3&sampleMaterialType=2AZZ&complianceOnly=false' | rdfpipe -i json-ld -o ttl -
    @prefix dcterms: <http://purl.org/dc/terms/> .
    @prefix hydra: <http://www.w3.org/ns/hydra/core#> .
    @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
    @prefix sosa1: <http://www.w3.org/ns/sosa#> .
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
    
    <http://localhost:8000/sampling-point/53130070/sampling/1506412> a sosa1:Sampling ;
        dcterms:type _:samplingPurpose-CA ;
        sosa1:hasFeatureOfInterest <http://localhost:8000/sampling-point/53130070> ;
        sosa1:hasResult <http://localhost:8000/sampling-point/53130070/sample/1506412> ;
        sosa1:resultTime "2001-08-08"^^xsd:date ;
        sosa1:startTime "2000-08-18T12:20:00"^^xsd:dateTime .
    
    <http://localhost:8000/sampling-point/53130070/sampling/1510110> a sosa1:Sampling ;
        dcterms:type _:samplingPurpose-CA ;
        sosa1:hasFeatureOfInterest <http://localhost:8000/sampling-point/53130070> ;
        sosa1:hasResult <http://localhost:8000/sampling-point/53130070/sample/1510110> ;
        sosa1:resultTime "2000-10-05"^^xsd:date ;
        sosa1:startTime "2000-09-20T12:00:00"^^xsd:dateTime .
    
    <http://localhost:8000/sampling-point/53130070/sampling/2303318> a sosa1:Sampling ;
        dcterms:type _:samplingPurpose-CA ;
        sosa1:hasFeatureOfInterest <http://localhost:8000/sampling-point/53130070> ;
        sosa1:hasResult <http://localhost:8000/sampling-point/53130070/sample/2303318> ;
        sosa1:resultTime "2001-06-07"^^xsd:date ;
        sosa1:startTime "2000-11-29T00:01:00"^^xsd:dateTime .
    
    <http://localhost:8000/sampling-point/53130070/sample/1506412> a sosa1:Sample ;
        sosa1:isResultOf <http://localhost:8000/sampling-point/53130070/sampling/1506412> ;
        sosa1:isSampleOf _:sampleMaterial-2AZZ,
            <http://localhost:8000/sampling-point/53130070> .
    
    <http://localhost:8000/sampling-point/53130070/sample/1510110> a sosa1:Sample ;
        sosa1:isResultOf <http://localhost:8000/sampling-point/53130070/sampling/1510110> ;
        sosa1:isSampleOf _:sampleMaterial-2AZZ,
            <http://localhost:8000/sampling-point/53130070> .
    
    <http://localhost:8000/sampling-point/53130070/sample/2303318> a sosa1:Sample ;
        sosa1:isResultOf <http://localhost:8000/sampling-point/53130070/sampling/2303318> ;
        sosa1:isSampleOf _:sampleMaterial-2AZZ,
            <http://localhost:8000/sampling-point/53130070> .
    
    [] a hydra:Collection ;
        hydra:member <http://localhost:8000/sampling-point/53130070/sample/1506412>,
            <http://localhost:8000/sampling-point/53130070/sample/1510110>,
            <http://localhost:8000/sampling-point/53130070/sample/2303318> ;
        hydra:totalItems 129 ;
        hydra:view [ hydra:first <http://localhost:8000/sampling-point/53130070/sample?skip=0&limit=3&sampleMaterialType=2AZZ&complianceOnly=false> ;
                hydra:last <http://localhost:8000/sampling-point/53130070/sample?skip=126&limit=3&sampleMaterialType=2AZZ&complianceOnly=false> ;
                hydra:next <http://localhost:8000/sampling-point/53130070/sample?skip=3&limit=3&sampleMaterialType=2AZZ&complianceOnly=false> ] .
    
    _:sampleMaterial-2AZZ a skos:Concept,
            sosa1:FeatureOfInterest ;
        skos:notation "2AZZ" ;
        skos:prefLabel "RIVER / RUNNING SURFACE WATER" .
    
    _:samplingPurpose-CA a skos:Concept ;
        skos:notation "CA" ;
        skos:prefLabel "COMPLIANCE AUDIT (PERMIT)" .

    Notice howย _:sampleMaterial-2AZZย appears once in the graph but is referenced by multiple samples – exactly what we want.

    When the kettles come out: reconciliation without regret

    The beauty of this approach is that when the authoritative concept scheme eventually goes online (and it will, because I’m also building that service), I can simply add reconciliation triples without breaking anything. This is where semantic versioning becomes your friend – adding triples is a patch-level change at most. It neither changes the shape of the API’s JSON, nor previously coined URIs.

    // Future state - same identifier, now with authority
    {
      "@id": "_:sampleMaterial-2AZZ",
      "@type": ["skos:Concept", "sosa:FeatureOfInterest"],
      "skos:prefLabel": "RIVER / RUNNING SURFACE WATER",
      "skos:notation": "2AZZ",
      "skos:exactMatch": "http://environment.data.gov.uk/def/sample-material/2AZZ",
      "rdfs:definedBy": "http://environment.data.gov.uk/def/sample-material/"
    }

    Now I can fire up those kettles I avoided earlier. The blank node stays the same, existing API consumers continue to work, but new consumers can follow the skos:exactMatch to the authoritative source. Cool URIs don’t change, and neither will these deterministic blank nodes.

    This approach scales beautifully across different concept schemes. Whether it’s determinands that eventually align with QUDT vocabularies, geographic regions that get proper Ordnance Survey URIs, or measurement units that find their way into authoritative registries – the pattern remains the same. Add the reconciliation triples when you have them, leave the blank nodes as stable anchors within the service.

    // And it even supports multiple reconciliation targets
    {
      "@id": "_:sampleMaterial-2AZZ",
      "@type": ["skos:Concept", "sosa:FeatureOfInterest"],
      "skos:prefLabel": "RIVER / RUNNING SURFACE WATER",
      "skos:notation": "2AZZ",
      "skos:exactMatch": "http://environment.data.gov.uk/def/sample-material/2AZZ",
      "rdfs:definedBy": "http://environment.data.gov.uk/def/sample-material/",
      "skos:closeMatch": "http://purl.obolibrary.org/obo/ENVO_00000022"
    }

    In a perfect world, every concept would have an authoritative URI from day one. In the real world, sometimes the most responsible thing you can do is admit you’re not the authority – yet. Deterministic blank nodes let you build useful services today while keeping the door open for proper reconciliation tomorrow. It’s procrastination with a purpose.

  • csvcubed, a personal retrospective

    csvcubed is a tool for building CSV-W files. If you’re wondering what the hell CSV-W is, it’s basically CSV files with extra metadata that provides context and makes them play nice with linked data. It was born out of necessity when I was working on the ONS’s Integrated Data Service’s Dissemination service. Our end product was 5-Star Linked Data, and we needed a way to convert CSV files into RDF. I joined the project during the tail end of 2020 during lockdown as a data engineer and the pipeline for creating CSV-W was a bit of a mess but born of necessity.

    My onboarding at ONS was great – I was quickly indoctrinated into the power of linked data and the associated standards. My actual job though? Unfucking presentational spreadsheets that locked away most of ONS’s statistical publications. Who wants to unpivot data just to do analysis? Not me, and honestly not the analysts producing them either – nobody wants to do analysis on pivoted data.

    The tool I initially learned for generating CSV-W was databaker, which Sensible Code knocked together during a hackathon. It did the job of creating tidy data, but that was about it. Our pipeline was ultimately this Airflow-orchestrated mess: scrape a publication’s latest spreadsheet, use databaker to unpivot it, describe the data using something called gss-utils (to which I will not link because CVEs and archived repo related reasons), build a CSV-W, use Swirrl’s csv2rdf tool to convert the CSV-W to RDF, and then publish the RDF to the ONS’s linked data platform (now defunct but it was called IDS Data Explorer). This was a lot of steps, and the pipeline was brittle. Kicking Airflow was a regular occurance.

    I’m a bit of a diva, and sometimes divas are good for getting shit done. The first thing I started to change was the unpivoting process. The databaker tool needed to go – it was slow, unpythonic, and didn’t provide any transferrable skills. Deadend tools are a horrible career investment, so I switched to pandas and dragged the other data engineers with me. This was a good first step, but the reproducibility was still a mess. It was time to build a tool that standardized the production of CSV-W files.

    gssutils was probably my biggest bugbear – while it technically did the job of producing CSV-W files, it was about as transparent as a brick wall. Extending it was a pain in the ass, and adding new predicates to our data was even worse. Since our target was RDF Cube Vocabulary, I conspired with a good work-friend (who went by robons on github) to build a tool that would actually make sense of this CSV-W building process. We originally called it csvwlib but ultimately named it csvcubed.

    Here’s the thing about generating linked observational data – it’s a massive problem space. The RDF Cube Vocabulary is a solid standard, but when you throw in the requirement for harmonization before publication, it’s daunting. RDF Cubes split tabular data into three parts: dimensions (what you slice and dice by), attributes (context for your observations), and the actual observations themselves. In our idealistic world, each dimension needed a code list (basically a SKOS concept scheme), and ideally, you’d just reuse one that already existed in our service. This meant that in the old way of building a cube, you either had to reconcile definitions between datasets to reuse them, or manually write a new concept scheme as a CSV-W. Fun times.

    To write a RDF Cube-bound CSV-W, you had to write at least one other CSV-W, or worse, reconcile concept definitions across multiple datasets. This was a massive headache for my fellow data engineers – we weren’t statistical subject matter experts, we were data engineers who just wanted to build pipelines that actually worked and could scale. That’s where csvcubed came in.

    The idea behind csvcubed was simple: you give it a tidy data CSV, and it figures out the rest. Using keywords in the column headers, it works out the dimensions, attributes, and observations of the cube. It automatically creates code lists and concept schemes for dimensions. Suddenly, building a cube wasn’t such a pain in the ass, and the pipeline actually made sense. The tool was a hit – we went from pushing out 1 publication per data engineer per week to smashing out 10 publications per data engineer per week at our peak.

    I’ve moved on since then – these days I’m virtualizing RDF data using ontop in my new gig providing linked data services for DEFRA. But I hope csvcubed keeps being useful for people in the linked data world. I’ve used it a few times in my new role, so I’m still eating my own dog food.

    I’m now not only a diva but fully a linked data partisan. ONS turned me into a true believer, and I’m not looking back. You can claim to do linked data with a black box tool, but let’s be real – if you can’t see how it works, you can’t claim it’s FAIR or 5-Star Linked Data.