-
“Live Search / ELK on the Lake”:
Same ELK tools, but the scalability, cost effectiveness & durability of the lake, powered by ChaosSearch.
Recommended for log search by Corey Quinn, pricing looks reasonable too.
Tags: search elk kibana chaossearch logs data-lake ops via:cquinn
Justin's Linklog Posts
-
via Ben Schaechter: “a new microsite we’ve launched for the AWS community that helps with understanding billing codes present in either Cost Explorer or the CUR. We profiled the number of distinct billing codes across our customer base and have about ~60k unique billing codes. We hear all the time that FinOps practitioners and engineers are confused about the billing codes present in Cost Explorer or the Cost and Usage Report. Think of these as being things like “Requests-Tier1” for S3 or “CW:GMWI-Metrics” for CloudWatch. There is usually really limited resources for determining what these billing codes are even when you Google around for them.”
Words from an ex-Zizian-adjacent person
It seems there’s now a full-on Mansonesque death cult emerging from the LessWrong/rationalist/effective-altruism community: https://www.sfgate.com/bayarea/article/bay-area-death-cult-zizian-murders-20064333.php
This HN comment was very interesting for background:
[Former member of that world, roommates with one of Ziz’s friends for a while, so I feel reasonably qualified to speak on this.] The problem with rationalists/EA as a group has never been the rationality, but the people practicing it and the cultural norms they endorse as a community.
As relevant here:
1) While following logical threads to their conclusions is a useful exercise, each logical step often involves some degree of rounding or unknown-unknowns. A -> B and B -> C means A -> C in a formal sense, but A -almostcertainly-> B and B -almostcertainly-> C does not mean A -almostcertainly-> C. Rationalists, by tending to overly formalist approaches, tend to lose the thread of the messiness of the real world and follow these lossy implications as though they are lossless. That leads to…
2) Precision errors in utility calculations that are numerically-unstable. Any small chance of harm times infinity equals infinity. This framing shows up a lot in the context of AI risk, but it works in other settings too: infinity times a speck of dust in your eye >>> 1 times murder, so murder is “justified” to prevent a speck of dust in the eye of eternity. When the thing you’re trying to create is infinitely good or the thing you’re trying to prevent is infinitely bad, anything is justified to bring it about/prevent it respectively.
3) Its leadership – or some of it, anyway – is extremely egotistical and borderline cult-like to begin with. I think even people who like e.g. Eliezer [Yudkowsky] would agree that he is not a humble man by any stretch of the imagination (the guy makes Neil deGrasse Tyson look like a monk). They have, in the past, responded to criticism with statements to the effect of “anyone who would criticize us for any reason is a bad person who is lying to cause us harm”. That kind of framing can’t help but get culty.
4) The nature of being a “freethinker” is that you’re at the mercy of your own neural circuitry. If there is a feedback loop in your brain, you’ll get stuck in it, because there’s no external “drag” or forcing functions to pull you back to reality. That can lead you to be a genius who sees what others cannot. It can also lead you into schizophrenia really easily. So you’ve got a culty environment that is particularly susceptible to internally-consistent madness, and finally:
5) It’s a bunch of very weird people who have nowhere else they feel at home. I totally get this. I’d never felt like I was in a room with people so like me, and ripping myself away from that world was not easy. (There’s some folks down the thread wondering why trans people are overrepresented in this particular group: well, take your standard weird nerd, and then make two-thirds of the world hate your guts more than anything else, you might be pretty vulnerable to whoever will give you the time of day, too.)
TLDR: isolation, very strong in-group defenses, logical “doctrine” that is formally valid and leaks in hard-to-notice ways, apocalyptic utility-scale, and being a very appealing environment for the kind of person who goes super nuts -> pretty much perfect conditions for a cult. Or multiple cults, really. Ziz’s group is only one of several.
Tags: zizians cults extropianism tescreal effective-altruism rationalism lesswrong death-cults
-
an algorithm used to prepare data for use with data compression techniques such as bzip2. It permutes the order of characters in a string (S), sorting all the circular shifts of the text in lexicographic order, then extracting the last column and the index of the original string in the set of sorted permutations of S.
Some day when I have lots of free time to spare, I’ll spend a while getting my head around this deep magic, because it’s just amazing that this works.
(via John Regehr)
Tags: compression algorithms burrows-wheeler-transform bzip2 via:john-regehr magic text
-
This is fantastic — a newly-discovered species of fungus does the same trick as Ophiocordyceps in Brazil; it infects the brains of orb-weaving cave spiders in Ireland, and induces them to leave their lairs or webs, and migrate to die in an exposed situation, in order to favor dispersal of the fungal spores.
Ophiocordyceps is, of course, the inspiration for the zombie-forming fungus in The Last Of Us.
Tags: cordyceps fungi ireland spiders zombies fungus nature gross
The Billion Docs JSON Challenge: ClickHouse vs. MongoDB, Elasticsearch, and more
This buries the lede somewhat, but here’s the key bit:
We built a new powerful JSON data type for ClickHouse with true column-oriented storage, support for dynamically changing data structures without type unification and the ability to query individual JSON paths really fast. […] ClickHouse stores the values of each unique JSON path as native columns, allowing high data compression and, as we are demonstrating in this blog, maintaining the same high query performance seen on classic types.
The performance results are very impressive, and notably also efficient in disk space usage.
Tags: clickhouse benchmarks performance json querying columnar-storage mongodb elasticsearch databases storage
-
The moon may have a timezone of its own soon, Coordinated Lunar Time (LTC):
Due to the moon’s lower gravity and its motion relative to Earth, moon time passes 56 microseconds faster each earth day. As a result, an atomic clock on Earth would run at a different rate than an atomic clock on the moon.
Similar to how UTC is determined, the memo suggests “an ensemble of clocks” deployed to the moon might be used to set the new time standard.
(via David Cuthbert)
Tags: via:david-cuthbert moon time timezones ltc
Understanding the BM25 full text search algorithm
“BM25, or Best Match 25, is a widely used algorithm for full text search. It is the default in Lucene/Elasticsearch and SQLite, among others.” At its heart, it’s an interesting probabilistic ranking scheme, involving the Inverse Document Frequency of a term, term frequency in a single document, and the document length. (Via Tony Finch)
Tags: via:fanf lucene elasticsearch search text algorithms sqlite full-text bm25
LLM-Driven Code Completion in JetBrains IDEs
JetBrains have come up with a new relatively-lightweight LLM-driven code generation option, constrained to producing single line suggestions:
The length of the completion suggestions is a trade-off. While longer suggestions do tend to reduce how many keystrokes you have to make, which is good, they also increase the number of reviews required on your end. Taking the above into account, we decided that completing a single line of code would be a fair compromise.
Some key features:
-
It works locally and is available offline. This means you can take advantage of the feature even if you aren’t connected to the internet.
-
It doesn’t send any data from your machine over the internet. The language models that power full line code completion run locally, which is great for two reasons. First, your code remains safe, as it never leaves your machine. Second, there are no additional cloud-related expenses – that’s why this feature comes at no additional cost.
Also, customer code is never used for training.
I’ve used this (in RubyMine), and found it fairly useful; it’s good for generating the obvious next line, but is easily ignored when that’s not what’s needed. Not bad at all.
Tags: coding code-completion jetbrains ides java ruby llms ai code-generation rubymine intellij
-
-
Crazy stuff. Elite, ported to the Commodore VIC 20 (albeit with a 32K expansion):
VIC 20 Elite is based on the C-64 source. VIC 20 specific graphics, text, keyboard & joystick input, and sound routines were written from scratch to replace the corresponding C-64 code.
Of course, the complete enhanced Elite won’t fit within the VIC 20’s limited memory, so some features had to be left out. Following the original 1984 BBC Cassette and Acorn Electron version, the VIC 20 version omits extended planet descriptions, planetary details (craters and meridians), and the missions that appear further on in the game. The pause mode options are dropped, and there is no Find Planet option in Galactic Chart (that would be only really useful during missions).
(via Sleepy from FP)
Tags: retrogaming commodore emulation gaming history elite vic-20
-
“a Go heap object reference analysis tool based on delve: It can display the space and object count distribution of Go memory references, which is helpful for efficiently locating memory leak issues or viewing persistent heap objects to optimize the garbage collector (GC) overhead.”
Nice to see Go supporting similar debugging/optimisation tools to those offered by the JVM.
Tags: go heap memory gc memory-leaks
Artsy’s Technology Choices evaluation process
This is a nice way to evaluate new technology options, from Artsy:
We want to accomplish a lot with a lean team, which means we must choose stable technologies. However, we also want to adopt best-of-breed technologies or best-suited tools, which may need work or still be evolving. We’ve borrowed from ThoughtWorks’ Radar to define the following stages for evaluating, adopting, and retiring technologies:
- Adopt: Reasonable defaults for most work. These choices have been exercised successfully in production at Artsy and there is a critical mass of engineers comfortable working with them.
- Trial: These technologies are being evaluated in limited production circumstances. We don’t have enough production experience to recommend them for high-risk or business-critical use cases, but they may be worth consideration if your project seems like a fit.
- Assess: Technologies we are interested in and maybe even built proofs-of-concept for, but haven’t yet trialed in production.
- Hold: Based on our experience, these technologies should be avoided. We’ve found them to be flawed, immature, or simply supplanted by better alternatives. In some cases these remain in legacy production uses, but we should take every opportunity to retire or migrate away.
(Via Lar Van Der Jagt on the Last Week In AWS slack instance)
Tags: via:lwia tech technology radar choices evaluation process architecture planning tools
-
Some good thoughts from a SlateDB dev, regarding initial principles for errors in SlateDB, derived from experience with Kafka:
- Keep public errors separate from internal errors. The set of public errors should be kept minimal and new errors should be highly scrutinized. For internal errors, we can go to town since they can be refactored and consolidated over time without affecting the user.
- Public errors should be prescriptive. Can an operation be retried? Is the database left in an inconsistent state? Can a transaction be aborted? What should the user actually do when the error is encountered? The error should have clear guidance.
- Prefer coarse error types with rich error messages. There are probably hundreds of cases where the database can enter an invalid state. We don’t need a separate type for each of them. We can use a single FatalError and pack as much information into the error message as is necessary to diagnose the root cause.
(via Chris Riccomini)
Tags: errors api design slatedb api-design error-handling exceptions architecture
7 Lessons from building a small-scale AI application
These are good. tl;dr:
- AI programming is stochastic;
- Data quality is real work;
- Models are only as good as the evaluation;
- Trust/Quality is the #1 issue;
- Your training pipeline is your core IP;
- AI is yet another distributed system;
- Don’t buy the AI library hype
via Niall Murphy.
Optimizing Java Apps on Kubernetes
“Optimizing Java Applications on Kubernetes: beyond the Basics”: Bruno Borges, at the InfoQ Dev Summit Boston, discusses the strategies for enhancing Java application performance on Kubernetes, focusing on leveraging JVM ergonomics, and managing garbage collection processes. Some interesting tips here.
Tags: kubernetes java eks resources ops scaling scalability gc optimization jvm
-
Bookmarking this in case I have to use it; I have a blog-related use case that I don’t want LLM scrapers to kill my blog with.
Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don’t support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.
The most hilarious part about how Anubis is implemented is that it triggers challenges for every request with a User-Agent containing “Mozilla”. Nearly all AI scrapers (and browsers) use a User-Agent string that includes “Mozilla” in it. This means that Anubis is able to block nearly all AI scrapers without any configuration.
Tags: throttling robots scraping ops llms bots hashcash tarpits
Cost-optimized archival in S3 using s3tar
“s3tar” is new to me, and looks like a perfect tool for this common use-case — aggregation and archival of existing data on S3, which often requires aggregation into large file sizes to take advantage of S3 Glacier storage classes (which have a minimum file size of 128Kb).
s3tar optimizes for cost and performance on the steps involved in downloading the objects, aggregating them into a tar, and putting the final tar in a specified Amazon S3 storage class using a configurable “–concat-in-memory” flag. … The tool also offers the flexibility to upload directly to a user’s preferred storage class or store the tar object in S3 Standard storage and seamlessly transition it to specific archival classes using S3 Lifecycle policies.
The only downside of s3tar is that it doesn’t support recompression, which is also a common enough requirement — especially after aggregation of multiple small input files into a larger, more compressible archive. But hey, can’t have everything.
s3tar: https://github.com/awslabs/amazon-s3-tar-tool
Tags: s3tar amazon s3 compression storage archival architecture aggregation logs glacier via:lwia
-
It’s great to see pushback against React, Angular, and other SPA architectures for web app delivery. I never got my head around the applicability of these for many web app use cases so this is just confirming my biases :)
Related Mastodon thread: https://toot.cafe/@slightlyoff/113868445222841008
Tags: react angular spa web-apps webdev javascript html apps
-
Since 2019 (!), the AWS load balancer controller component doesn’t safely handle pod shutdowns when the ALB target-type is set to
ip
. This is the bug report, still open…Tags: aws load-balancing alb eks kubernetes ops bugs
Cryptocurrency “market caps” and notional value
Excellent explainer from Molly White, which explains the risk around quoting “market caps” for memecoins:
The “market cap” measurement has become ubiquitous within and outside of crypto, and it is almost always taken at face value. Thoughtful readers might see such headlines and ask questions like “how did a ‘$2 trillion market’ tumble without impacting traditional finance?”, but I suspect most accept the number.
When crypto projects are hacked, there are headlines about hackers stealing “$166 million worth” of tokens, when in reality the hackers only could cash out 2% of that amount (around $3 million) because their attempts to sell illiquid tokens caused the price to crash.
Tags: molly-white memecoins bitcoin rug-pulls scams liquidity market-caps cryptocurrency
Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch
A discussion of the popular byte pair encoding (BPE) tokenization algorithm, which is used in large language models like GPT-2 to GPT-4, Llama 3, etc. to tokenize text. The BPE algorithm was originally described in 1994: “A New Algorithm for Data Compression” by Philip Gage.
Tags: encoding text bpe llms algorithms tokenization parsing
-
“A federated microblogging software for single users. ActivityPub-enabled, Mastodon-compatible API, supports CommonMark and Misskey-style quotes. Hollo is designed for single-users, so you can own your instance and have full control over your data. It’s perfect for personal microblogs, notes, and journals.”
Seems fairly heavyweight, however, so I probably won’t be running it, but it’s a nice take on the single-user-server Fediverse use case.
-
The Irish National Transport Authority have an open data API for realtime public transport information; very cool. “The GTFS-R API contains real-time updates for services provided by Dublin Bus, Bus Éireann, and Go-Ahead Ireland.”
The specification currently supports the following types of information:
Trip updates – delays, cancellations, changed routes; Service alerts – stop moved, unforeseen events affecting a station, route or the entire network; Vehicle positions – information about the vehicles including location and congestion level
Registration is required.
Tags: public-transport buses trains transit nta gtfs apis open-data dublin ireland
Five things privacy experts know about AI
Damien Desfontaines writes some really interesting stuff about Differential Privacy in AI training, and how bad the current situation is with large language models
Tags: llms ai differential-privacy damien-desfontaines privacy training anonymisation memorization
Why the British government is so into AI
Interesting BlueSky thread on the topic —
The UK Government believes several things:
1) The AI genie is out of the bottle and cannot be put back in
2) Embracing AI would definitely be good for the British economy
3) Enforcing copyright on AI training would put Britain out of step with rest of the world and subsequently…
4) Enforcing copyright would be ineffective as AI would just be trained elsewhere, cutting out Brit creatives entirely
5) Govt’s preferred option is permissive enough to be attractive to AI firms but demands transparency so at least rights holders have some recourse; the alternative is bleaker.
Obviously, I contest all of these beliefs to one degree or another, but this is where the govt is, and it’s useful to understand that. The real crux of the debate, as they see it, is how Britain’s laws can practically deal with the global inevitability of AI. They believe it’s untenable to make Britain a legislative pariah state for AI, and that this would not lead to good outcomes for British creatives anyway. This is a point worth considering when replying to the consultation.
However, the govt says it’s not going to implement policy before it has a technical solution for rights holders to opt-out and chase down infringements. My view is that this is difficult to the point of being pure fantasy, and either means that the govt is not serious about finding a real, effective technical solution, or this policy will be kicked indefinitely down the road. My dinner partner was optimistic a solution could be achieved within the timespan of a year or two. I just don’t buy it.
Government says it has not sided with AI firms over creative industries. However, its understanding of “not taking a side” creates a false equality between massive companies whose business relies on crime and individuals whose livelihoods will be destroyed.
I got the sense that there is no political will whatsoever to seriously challenge firms who offer to spend big in Britain, and that any thought of holding them to account for actual crime is simply considered naive. But we do have a bit of time while govt attempts to confect their magical, easy to use, opt-out solution—time during which one or several of these AI firms might implode, making the true cost more apparent.
Tags: uk government ai policy copyright ip britain economy future
The people should own the town square
Ah, this is welcome news from Mastodon:
We are going to transfer ownership of key Mastodon ecosystem and platform components to a new non-profit organization, affirming the intent that Mastodon should not be owned or controlled by a single individual. […] Taking the first tentative steps almost a year ago, there are already multiple organizations involved with shepherding the Mastodon code and platform. The next 6 months will see the transformation of the Mastodon structures, shifting away from the early days’ single-person ownership and enshrining the envisioned independence in a dedicated European not-for-profit entity.
-
As a modern option for observability through service metrics, ClickHouse seems to be decent as a self-hosted option, integrating with Grafana as described here and collecting data from OpenTelemetry instrumentation in service code. (By many accounts, this avoids some not great design decisions made in Prometheus.) Bookmarking for reference…
Tags: telemetry metrics service-metrics clickhouse sql grafana observability opentelemetry
-
Nice to see an important public need being met here:
The [Watch Duty] app gives users the latest alerts about fires in their area [in California] and has become a vital service for millions of users in the western U.S. struggling with the seemingly constant threat of deadly wildfires—one major reason it had over 360,000 unique visits from 8:00-8:30 a.m. local time Wednesday. And the man behind Watch Duty promises that as a nonprofit, his organization has no plans to pull an OpenAI and become a profit-seeking enterprise.
-
this is a great Steve Jobs story, from the engineer who wrote v1 of the MacOS X Dock:
At one point during a trip over, Steve was talking to Bas and asked how things were coming along with the Dock. He replied something along the lines of “going well, the engineer is over from Ireland right now, etc”. Steve left, and then visited my manager’s manager’s manager and said the fateful words (as reported to me by people who were in the room where it happened).
“It has come to my attention that the engineer working on the Dock is in FUCKING IRELAND”.
I was told that I had to move to Cupertino. Immediately. Or else.
I did not wish to move to the States. I liked being in Europe. Ultimately, after much consideration, many late night conversations with my wife, and even buying a guide to moving, I said no.
They said ok then. We’ll just tell Steve you did move.
(via Niall Murphy)
Light Bars for Zoom / Video Conference
recommended by someone on ITC Slack; improves videoconference lighting nicely
Tags: video slack videoconferences lighting work
Court docs allege Meta trained LLM models using pirated book trove
This is pretty massive:
The [court] document claims that Meta decided to download documents from Library Genesis — aka. “LibGen” — to train its models. LibGen is the subject of a lawsuit brought by textbook publishers who believe it happily hosts and distributes [pirated] works [….]
The filing from plaintiffs in the Kadrey case claims that documents produced by Meta […] describe internal debate about accessing LibGen, a little squeamishness about using BitTorrent in the office to do so, and eventual escalation to “MZ” [Mark Zuckerberg himself], who approved use of the contentious resource. […]
Another filing claims that a Meta document describes how it removed copyright notifications from material downloaded from LibGen, and suggests the company did so because it realized including such text could mean a model’s output would reveal it was trained on copyrighted material.
US District Court Judge Vince Chhabria also noted that in one of the documents Meta wants to seal, an employee wrote the following:
“If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues.”
No shit.
Tags: piracy meta copyright mark-zuckerberg law llama training libgen books
-
A handy tool to test your internet connection for “bufferbloat”, the error condition involving “undesirable high latency caused by other traffic on your network. It happens when a flow uses more than its fair share of the bottleneck. Bufferbloat is the primary cause of bad performance for real-time Internet applications like VoIP calls, video games, and videoconferencing.”
(My home internet connection is currently rating a C: “your latency increased considerably under load”, jumping from a min/mean/p95/max of 10.7, 16.9, 23.7, 30.1ms to 35.3, 98.4, 121.0, 286.0ms under load, yikes, so looks like I need to do some optimising.)
Tags: bufferbloat internet networking optimisation performance testing tools
Waymos don’t stop for pedestrians
Ah here.
“Waymo (aka Google) admits that it trains its robotaxis to break the law. When a Washington Post reporter finds robotaxis fail to stop for pedestrians in marked crosswalk 70% of the time, Waymo says it follows “social norms” rather than laws.
Expert explains: When robotaxis obey law, they don’t go fast enough to compete successfully with Uber, so Google execs ordered engineers to ignore laws.”
Tags: google waymo laws pedestrians safety crosswalks crossings road-safety self-driving-cars
Garbage Day on Meta’s moderation plans
This is 100% spot on, I suspect, regarding Meta’s recently-announced plans to give up on content moderation:
After 2021, the major tech platforms we’ve relied on since the 2010s could no longer pretend that they would ever be able to properly manage the amount of users, the amount of content, the amount of influence they “need” to exist at the size they “need” to exist at to make the amount of money they “need” to exist.
And after sleepwalking through the Biden administration and doing the bare minimum to avoid any fingers pointed their direction about election interference last year, the companies are now fully giving up. Knowing the incoming Trump administration will not only not care, but will even reward them for it.
The question now is, what will the EU do about it? This is a flagrant raised finger in the face of the Digital Services Act.
Tags: moderation content ugc meta future dsa eu garbage-day
-
Via Susie Dent, Word of the Day is ‘uhtcearu’ [ucht-kay-aru, with the ‘ch’ as in the Scottish ‘loch’]: Old English for ‘the sorrow before dawn’, when you lie awake in the darkness and worries crowd your mind.
It’s amazing to realise that this unpleasant phenomenon of neurochemistry is a thing that’s been around for thousands of years.
See also https://www.reddit.com/r/OldEnglish/comments/e7su8n/what_is_the_proper_form_of_uhtceare/
Tags: brains worry words uhtcearu uhtceare dawn morning neurochemistry