jb… a weblog by Jonathan Buys

Searching for Non-Printable Characters in Text

August 14, 2025

One of the systems at work accepts data in csv format, which is essentially plain text with columns separated by commas. Occasionally a client will upload a file with mistakes in it, and while our applications are fairly robust and can handle most issues, sometimes one slips by that we weren’t expecting. When this happens, as it has twice in the past as many days, I’m called in to find out why.

The first issue was with a Python application that was pulling a file from S3, parsing it, and turning it into a tab-separated values file before uploading it back to S3 to be further processed by another system. The error given by Python’s csv package was:

_csv.Error: need to escape, but no escapechar set

Which was odd because we process many, many files during a day and this same code hasn’t needed to set anything else. After a bit of faffing around replicating the error locally, and isolating where in the code the error was occuring, I finally got it in my head to just grep the file for special characters. Sure enough, I found that the file contained tabs inside of the values, which given the logic of our program was causing it to have a bad time of it.

The command grep '\t' in.csv gave me the lines containing the offending tabs, and grep -n '\t' in.csv | cut -d : -f 1 gave me just the line numbers, which is what I was asked for. The cut command lets me select specific parts of a line, in this case I asked for -d : which set the colon as the delimiter, and -f 1 which asked for just the first column.

This morning I was asked to look at another task that had failed, this time using a custom Go binary that, again, parsed a csv file from S3. Thinking I might get lucky I ran the same search for tabs in the file but came up empty. After some looking around I found a Stack Overflow question that pointed me in the right direction, but I first had to install the GNU version of grep.

When Mac OS was merged with NeXTSTEP to create Mac OS X, the NeXT OS brought with it pure Unix underpinnings thanks to BSD1. Thanks to that lineage, the Mac contains all the Unix tools we’d expect, but it does not include the Linux tools you’d expect. There are sometimes subtle differences between the tools, and I’ve found that the (ugg…) GNU/Linux version of grep to be more flexible. Luckily, Homebrew makes it trivially easy to install standard GNU tools, and running brew install grep provided me with the ggrep binary.

Equipped with the right tools for the job, I ran this command:

LC_ALL=C  ggrep --color='auto' -P -n "[\x80-\xFF]" in.csv

I’ve written about setting LC_ALL=C before, so I’ll skip that here. The rest of the command I’ll cover below.

--color='auto', this makes it much easier to spot the matching characters in a long string.

-P, tells grep to use Perl-compatible regular expressions

-n, print the line numbers

"[\x80-\xFF]", This searches the text for the extended ASCII characters ranging from hexadecimal code 80, or €, to code FF, the ÿ, or “Latin small letter y with diaeresis”, according to the ASCII table hosted at ascii-code.com.

Finally, in.csv is just the file name.

After editing the file, re-uploading, and kicking off the job again, it completed successfully. How in the world is it 2025 and we still have text encoding issues?

  1. I thought for years that NeXTSTEP was based on FreeBSD, but Wikipedia tells me that it was actually built initially on the older 4.3BSD-Tahoe. Sometimes I forget how long ago that really was, and how fast Steve Jobs was pushed out of Apple after the announcement of the Mac in 1984. 


Bright Gold in Dark Times

August 10, 2025

On August 6th, in the White House, Tim Cook announced that Apple is committing an additional $100 billion dollars in American manufacturing. If only he’d left it at that. If only Cook had simply announced the additional investment, bringing Apple’s commitment up to $600 billion over the next four years, and then politely thanked the president and walked out of the room, he may have held on to his integrity. But that’s not what happened. Instead, Cook kissed Trump’s ass and gave him a gold participation trophy.

The trophy is a glass plaque, “made in America”, and “designed by a former Marine” (probably while a bald eagle flew overhead singing “God Bless the U.S.A.”, wearing the American flag as a cape, with a gun in one claw and the Constitution in the other) that sits in a 24 karat gold base. The trophy itself is, honestly, kind of ugly. It doesn’t give off the “Designed by Apple in California” vibes I would expect from something that they actually put a lot of effort into. It looks like a high-school shop class project. However mediocre the actual object, Cook’s gift represented his, and Apple’s, subjugation. Bending the knee and kissing the ring. I nearly threw my iMac out the window.

But I didn’t, because my mind grinds slow and fine. I thought on it for days before reading Gruber’s take on Daring Fireball, where he says that Cook:

…is keenly aware that trust and reputation are only accrued slowly, but are always at risk of being squandered quickly, and that this applies both to how he is perceived personally and how Apple is perceived as an institution — a pillar of American ingenuity and industry. His life’s work. And that despite all of that, Cook concluded that debasing himself, selling some shares of his own dignity, was the best course of action — for Apple, for Apple’s customers (and, yes, shareholders), and perhaps even for the country. That ruthless practicality is necessary merely to stay afloat in a sea of abject graft, extortion, and cronyism. That’s dark. That requires considering that the problem isn’t the greed of a few billionaires and executives who ought to resist burgeoning corruption, but that Trump and his sycophants in the Republican Party have already succeeded in corrupting the system. That the corruption isn’t happening, but happened. The United States isn’t heading for existential trouble. We’re in it — and a pathway out is not yet clear. That’s not to say all is forever lost, but that we are, in our current political moment, beyond the point where the game can be played successfully on the level. You can choose to play a crooked game straight, but you can’t win. Business is competition. A loser who played above reproach is still a loser. You need to choose your battles. US manufacturing is Cook’s choice.

Once again, Gruber is faster to put into words what I’ve been mulling over.

To be clear, I fully support Apple bringing manufacturing back to the United States. In rural Iowa, small towns have been decimated over the past few decades by big manufacturers leaving the country and building overseas or in Mexico. Newton’s loss of Maytag is the poster child for describing how lives can be upended and entire ways of life lost due to the greed of corporations chasing the cheapest labor. The loss of identity and self-respect associated with losing a career is one of the reasons we’re in this mess with Trump and his ilk to begin with. But, one can both agree with bringing manufacturing back to the US and despise the bootlicking at the same time. With many things with the Trump administration, it’s not just what is being done, but how they are going about doing it.

Yes, bringing manufacturing back to the US is a good and nobel goal. I hope we can actually do it, but I also hope that we can do it in an environmentally clean way, leveraging renewable energy and sustainable materials harvesting whenever possible. The Republicans want to tout new investments in the US and new jobs, but they don’t believe in climate change, and their oil-drilling overlords won’t let them invest in clean energy alternatives. Since the MAGA cult is in charge of everythign for at least the next year and a half, if not till 2029, the United States is in a downward spiral that I don’t see a way out of.

The government is now blatantly, openly corrupt. Gifts like Cooks are now just the way things are done to win favor. The US is doing everything it can to roll back environmental protections, dissuade people from buying electric or investing at all in renewable energy. The health department is making it harder to get vaccinated. The Department of Education is being dismantled, which will make it easier to send public funds to private religious schools. There are literally masked gangs of thugs kidnapping people off the street in broad daylight and putting them in camps. And just to really own the libs, they give the camps fun names like “Alligator Alcatraz”.

Cook’s display of fealty has given me pause when considering future Apple purchases, but changing my computing platform of choice would be like cutting off my nose to spite my face. The answer isn’t in punishing individual companies (which would also be punishing myself and my family) that are playing by the current rules, the answer is to change the rules. To try to rebuild a fair, just, and civil society, free from the corruption of the MAGA movement. The only way we are going to do that is by making sure we vote out every Republican currently in office, so that not a single one of those sycophantic cowards ever get close to holding power again. Is that likely? Probably not. Not anytime soon anyway. The bad guys won. We’re living in their world now.

I hope Trump scratches the base of his little trophy someday to discover that it is only “gold plated”, and as fake as his orange tan.


ChatGPT Designed Programming Language

July 2, 2025

On a whim, I asked ChatGPT today:

If you were to design a new programming language intended to be written by AI large language models like yourself, and understood by humans who need to check your work, what would the language look like?

I found it’s answer fascinating. Enough that I’d like to preserve it here. It even came up with a nice name for the language, “Clarion”.

Clarion

Below is a high‑level design sketch for “Clarion”, a hypothetical general‑purpose language optimized for two simultaneous audiences:

  • Primary author: an AI large‑language model (LLM) that can generate vast amounts of code quickly.
  • Primary reviewer/maintainer: a human engineer who must be able to audit, reason about and safely evolve that code.

1. Design Goals

Goal Rationale
Machine‑writeable Syntax must be regular, unambiguous and easily producible by generative models (no “smart” whitespace, few context‑sensitive rules).
Human‑verifiable Semantics must be obvious, explicit and compact; defaults that hide complexity are disfavoured.
Formally checkable Every construct should have a precise, machine‑readable specification enabling static verification, model checking and proof generation.
Audit‑friendly Traceability from requirement → spec → test → implementation is first‑class, not external paperwork.
Gradual adoption Can interoperate with C/Python/Rust so that teams can introduce AI‑generated modules incrementally.

2. Core Language Shape

2.1 Surface Syntax

  • Indentation‑agnostic but terminator‑rich: every block ends with an end keyword carrying the block’s opening token, e.g. end func. This sacrifices a little brevity for total clarity.
  • YAML‑like metadata headers precede each file and major declaration, containing:

    ### spec:
    # requirement: REQ‑TR‑017
    # safety‑critical: true
    # proof‑obligation: VerifiedBy="SPARK"
    ###
    
  • UTF‑8 by default but the grammar restricts identifiers to ASCII letters, digits and _ to minimise visual confusion.

2.2 Types and Contracts

Feature Notes
Algebraic data types & pattern matching Like in ML/Rust; deterministic exhaustiveness‑checked.
Refinement types type Positive = Int where value > 0 for lightweight correctness specs.
Dependent‑like “ghost” expressions Allowed only in contracts; erased before runtime.
Units‑of‑measure Prevent dimension errors (Speed = m/s).

Contracts use Dijkstra‑style syntax:

func divide(n: Int, d: NonZeroInt)
    -> (q: Int, r: Int)
pre  d != 0
post n == q*d + r  &&  0 <= r < abs(d)
end func

2.3 Concurrency & Effects

  • Capability‑based effects: Every function’s signature declares which side effects it may perform (io, net, fs, db). The compiler refuses calls across mismatched capabilities.
  • Structured concurrency with hierarchical cancellation scopes (task / end task).
  • Deterministic by default: uncontrolled data races are a compile‑time error; shared mutable state must sit behind sync‑annotated types with statically provable hand‑off patterns.

3. Interaction with LLMs

3.1 Generation‑Guiding Pragmas

LLMs can embed generation hints that do not affect semantics but shape future completions:

pragma lm_hint:
    style = "prefer_recursive"
    complexity_budget = "O(log n)"
end pragma

Humans can delete or override these without touching logic.

3.2 “Round‑trip Delimiters”

Every top‑level entity is bracketed by stable markers so tooling can surgically update just one region:

###<Clarion:func id=sortVersions v=1.0>###
    ... implementation ...
###</Clarion:func id=sortVersions>###

An LLM patching the function must preserve the outer marker and increment the v field; diff reviews become trivial.


4. Tooling & Verification Pipeline

  1. clc (Clarion compiler)

    • Single pass, deterministic.
    • Emits LLVM IR or WebAssembly.
  2. clarion‑lint

    • Style, naming, cyclomatic complexity budgets.
  3. clarion‑prove

    • Generates VCs (verification conditions) from contracts.
    • Delegates to back‑ends (Z3, Coq, Daphne).
    • Embeds proof outlines back into source for audit.
  4. clarion‑trace

    • Traces run‑time values against contracts during testing; produces counter‑example captures for LLM fine‑tuning.
  5. clarion‑doc

    • Extracts spec headers + contracts into human‑readable HTML, hyperlinked to source, tests and proofs.

5. Example Module

### spec:
# module: Payments.Core
# requirement: PAY‑CLR‑002
# author: GPT‑4o
# reviewer: Alice.T
###

import std/crypto.Hash
import std/time.Utc

type Cents = Int where value >= 0

record Invoice:
    id:        Hash256
    issued_at: Utc.Timestamp
    amount:    Cents
end record

enum SettlementStatus:
    | Pending
    | Settled(at: Utc.Timestamp, ref: Hash256)
    | Failed(reason: String)
end enum

func settle(invoice: Invoice, tx_ref: Hash256) 
    -> (status: SettlementStatus)
pre   invoice.amount > 0
post  match status:
          | Settled(at, _) => at >= invoice.issued_at
          | Failed(_)      => true
          | Pending        => false
      end match
cap   net, db
begin
    db.begin_tx()
    let ok = BankApi.sendFunds(invoice.id, invoice.amount, tx_ref)
    if ok then
        db.recordSettlement(invoice.id, tx_ref)
        return Settled(now(), tx_ref)
    else
        db.rollback()
        return Failed("transfer rejected")
    end if
end func

A human can audit:

  • Preconditions/postconditions are explicit.
  • External side effects (net, db) are called out.
  • Failure modes are exhaustively enumerated.

6. Why Humans Can Trust It

Mechanism Benefit for reviewer
Structural regularity Easy to diff; little “action at a distance.”
Contracts and capability lists Immediate visibility into safety & security boundaries.
Machine‑checked proofs Eliminates gap between design and implementation.
Round‑trip delimiters & version tags Changes are traceable; no silent regeneration.
Self‑describing metadata Links each line of code to higher‑level requirement IDs.

7. Why LLMs Can Write It Well

  • Grammar fits context‑free rules → simpler token‑level generation.
  • No semantic dependence on whitespace → fewer accidental errors.
  • Explicit block terminators → models need not balance indentation.
  • Generation pragmas let prompts steer style and complexity targets.
  • Stable markers allow fine‑grained edits without reconstructing the file.

8. Relationship to Existing Languages

  • Ada/SPARK inspired contracts and determinism, but Clarion is more ergonomic for “everyday” coding.
  • Rust influenced ownership; Clarion lifts capability control to the type system.
  • F★, Dafny supply the verification lineage.
  • Go lent the structured‑concurrency philosophy.
  • Python readability goals remain, but ambiguity is removed.

Final Thoughts

Clarion deliberately sacrifices some brevity and syntactic sugar in exchange for:

  • Audit clarity – humans must see exactly what the AI meant.
  • Mechanical soundness – every compile target, proof obligation and capability is machine‑checkable.
  • Collaborative workflows – the language is equally hospitable to deterministic tooling, continuous integration and regenerative AI agents.

This balance aims to turn AI‑generated code from a black box into a transparently engineered artifact that real teams can rely on—while still letting language models move fast.

Opinions

That func -> end func syntax is ugly, but ChatGPT implicitly states that it’s not designed for “brevity and syntactic sugar”. I’m almost curious enough to see how long it would take the AI to create a complete language spec and compiler. Almost.


Random Strings on the Command Line

December 19, 2024

Browsing through some old files today I came across this note:

To Generate A Random String:

tr -dc A-Za-z0-9_ < /dev/urandom | head -c 8 | xargs

Curios if it still worked I pasted it into my terminal and, unsurprisingly, was met with an error:

tr: Illegal byte sequence

The tr utility is one of the many old-school Unix programs with history reaching way back to System V. It stands for “translate characters”, and with the -dc flags on, it should have ignored all input except for alphabet characters A-Z, both upper and lower case, and the integers 0 through 9, and the underscore character. The “Illegal byte sequence” error means it was really not happy with the input it was getting from /dev/urandom.

On macOS, the pseudo-device /dev/urandom is, according to the man page, “a compatibility nod to Linux”. The device generates random bytes, so if we read it we’ll get back raw binary data that looks like:

00010101 01011001 10111101

The reason the command is not working like it used to is because most modern computing system expect text character encoding to be UTF-8. When tr gets the string of random bytes from /dev/urandom, it expects the bytes to be in a specific sequence that it can translate into printable characters on the screen. Since we are intentionally generating random bytes though, we might get a few characters that translate properly, but eventually we’ll encounter the “illegal byte sequence” error above.

To fix the problem, all we need to do is set LC_ALL=C before running tr:

LC_ALL=C tr -dc A-Za-z0-9_ < /dev/urandom | head -c 14 | xargs

Setting LC_ALL=C sets the language setting back to POSIX, or the original C ASCII encoding for text. That means when tr is fed a random string of bytes, it interprets each byte as a character, according to the ASCII table, which looks something like this:

Character ASCII Decimal ASCII Hexadecimal Binary Representation
A 65 41 01000001
B 66 42 01000010
C 67 43 01000011

Now each byte is interpreted as a character that matches the list passed as an argument to tr.

➜ LC_ALL=C tr -dc A-Za-z0-9_ < /dev/urandom | head -c 14 | xargs

Rhac_WGis7tHzS

So, to break down each command in the pipeline:

  • tr: filters out all characters except those in the sets A-Z, a-z, 0-9, and _
  • head -c 14: displays the first 14 characters of the input from tr
  • xargs: adds a nice newline character at the end of the string, so it’s easy to copy.

This command could be easily adopted to use base64 instead of tr without setting LC_ALL=C if you wanted more random characters in the string:

base64 < /dev/random | head -c 14 | xargs

Expanding head -c to 34 or so makes for a nice password generator.

In fact, I’ve aliased this to pgen in my .zshrc:

pgen()
{
    base64 < /dev/random | head -c 32 | xargs
}

There’s almost certainly easier ways to generate a random string in the shell, but I like this, and it works for me.


Update: The good Dr. Drang suggested ending the pipeline and running echo instead of xargs for clairity, which makes a lot of sense to me. I updated the alias to base64 < /dev/random | head -c 32; echo.


The Tao of Cal

December 4, 2024

The Tao of Cal - Cal Newport

With the end of year rapidly approaching, and people finding themselves with some spare thinking time as work winds down for the holidays, I thought it might be fun to try to summarize essentially every major idea I discuss in one short primer.

I’m a big fan of Cal Newport’s work. I’d like to quote the entire post, but I’ll just post this one sentence. Read the rest of this article, then buy his books and read those too. Follow his advice and live a better life.


Godot Isn't Making it

December 4, 2024

Godot Isn’t Making it

Outside of a miracle, we are about to enter an era of desperation in the generative AI space. We’re two years in, and we have no killer apps — no industry-defining products — other than ChatGPT, a product that burns billions of dollars and nobody can really describe. Neither Microsoft, nor Meta, nor Google or Amazon seem to be able to come up with a profitable use case, let alone one their users actually like, nor have any of the people that have raised billions of dollars in venture capital for anything with “AI” taped to the side — and investor interest in AI is cooling.

Edward Zitron seems to be a rare voice of frustrated reason in the tech industry. He’s very critical of AI, and, more and more, I’m thinking rightfully so. OpenAI is spending over $2 to make $1, burning through billions with no path to profitability.

Couple that with the environmental cost of AI (and its just plain awful cousin, crypto currency) and the unreliability of the generated answers, and I’m wondering just where all of this goes in the next year or so.


Gross Apple Marketing

October 29, 2024

I’m not sure what’s going on over in Cupertino for them to think that any of the recent Apple Intelligence ads they’ve been running are a good idea. They’re cringy at best, and honestly just flat out insulting.

In one a schlub writes an email to his boss and uses AI to make it sound ‘more professional’, in another a young woman uses it to lie about remembering an acquaintance’s name. In another the same young woman again uses it to lie about reading an email from a college, to her face, while she’s sitting with her. In yet another, linked to recently by Scott McNulty, a woman uses AI to lie to her husband about getting him something for his birthday.

If this is what Apple thinks their AI is for, I honestly don’t know that I want any part of it.

Compare and contrast with the video I posted yesterday, and with this beautiful animation from Canonical.

I’ve watched that little animation several times, and they tell a better story in a minute twenty-five than all of Apple’s AI commercials combined.


Scout

October 28, 2024

I’m rooting for these guys. If they can pull off this truck at the $50-$60k mark, I think they are going to have a winner. I’ve been looking at electric trucks for a while, and I’m excited to see another entry in the market. And what a fantastic video:

A proper body-on-frame truck, 10,000 lb towing capacity, all electric, made in the USA. Count me in.


The Manual with Tim Walz

September 21, 2024

Love this guy. Patrick Rhone calls him folksy, I agree.


Loading and Indexing SQLite

October 19, 2023

What a difference a couple of lines of code can make.

I recognize that databases have always been a weak point for me, so I’ve been trying to correct that lately. I have a lot of experience with management of the database engines, failover, filesystems, and networking, but too little working with the internals of the databases themselves. Early this morning I decided I didn’t know enough about how database indexes worked. So I did some reading, got to the point where I had a good mental model for them, and decided I’d like to do some testing myself. I figured 40 million records was a nice round number, so I used fakedata to generate 40 million SQL inserts that looked something like this:

INSERT INTO contacts (name,email,country) VALUES ("Milo Morris","pmeissner@test.tienda","Italy");
INSERT INTO contacts (name,email,country) VALUES ("Hosea Burgess","kolage@example.walmart","Dominica");
INSERT INTO contacts (name,email,country) VALUES ("Adaline Frank","shaneIxD@example.talk","Slovenia");

I saved this as fakedata.sql and piped it into sqlite3 and figured I’d just let it run in the background. After about six hours I realized this was taking a ridiculously long time, and I estimated I’d only loaded about a quarter of the data. I believe that’s because SQLite was treating each INSERT as a separate transaction.

A transaction in SQLite is a unit of work. SQLite ensures that the write to the database is Atomic, Consistent, Isolated, and Durable, which means that for each of the 40 million lines I was piping into sqlite3, the engine was ensuring that every line was fully committed to the database before moving on to the next line. That’s a lot of work for a very, very small amount of data. So, I did some more reading and found one recommendation of explicitly wrapping the entire load into a single transaction, so my file now looked like:

BEGIN TRANSACTION;

INSERT INTO contacts (name,email,country) VALUES ("Milo Morris","pmeissner@test.tienda","Italy");
INSERT INTO contacts (name,email,country) VALUES ("Hosea Burgess","kolage@example.walmart","Dominica");
INSERT INTO contacts (name,email,country) VALUES ("Adaline Frank","shaneIxD@example.talk","Slovenia");

COMMIT;

I set a timer and ran the import again:

➜  var time cat fakedata.sql| sqlite3 test.db
cat fakedata.sql  0.07s user 0.90s system 1% cpu 1:13.66 total
sqlite3 test.db  70.81s user 2.19s system 98% cpu 1:13.79 total

So, that went from 6+ hours to about 71 seconds. And I imagine if I did some more optimization (possibly using the Write Ahead Log?) I might be able to get that import faster still. But a little over a minute is good enough for some local curiosity testing.

Indexes

So… back to indexes.

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure that holds the field values and a pointer to the record it relates to. Once the index is created it is sorted. This allows binary searches to be performed on the new data structure.

One good analogy is the index of a physical book. Imagine that a book has ten chapters and each chapter has 100 pages. Now imagine you’d like to find all instances of the word “continuum” in the book. If the book doesn’t have an index, you’d have to read through every page in every chapter to find the word.

However, if the book is already indexed, you can find the word in the alphabetical list, which will then have a pointer to the page numbers where the word can be found.

The downside to the index is that it does take additional space. In the book analogy, while the book itself is 1000 pages, we’d need another ten or so for the index, bringing up the total size to 1010 pages. Same with a database, the additional index data structure requires more space to hold both the original data field being indexed, and a small (4-byte, for example) pointer to the record.

Oh, and the results of creating the index are below.

SELECT * from contacts WHERE name is 'Hank Perry';
Run Time: real 2.124 user 1.771679 sys 0.322396


CREATE INDEX IF NOT EXISTS name_index on contacts (name);
Run Time: real 22.129 user 16.048308 sys 2.274184


SELECT * from contacts WHERE name is 'Hank Perry';
Run Time: real 0.003 user 0.001287 sys 0.001598

That’s a massive improvement. And now I know a little more than I did.