121 points | by thunderbong2 days ago
https://github.com/sequinstream/sequin
Any reason we're seeing so many CDC tools pop up?
Overall the Confluent ecosystem feels targeted at “data engineer” use-cases so if you want to build a reactive product it’s not a great fit. I’m not sure what the performance target is of the Debezium Postgres connector maintainers but I get the sense they’re not ambitious because there’s so little documentation about performance optimization; data ecosystem feels contemporary with “nightly batch job” kind of thing vs product people today who want 0ms latency.
If you look at backend infrastructure there’s a clear trope of “good idea implemented in Java becomes standard, but re-implementing in $AOT_COMPILED_LANGUAGE gives big advantage:
- Cassandra -> ScyllaDB
- Kafka -> RedPanda, …
- Zookeeper -> ClickHouse Keeper, Consul, etcd, …
- Debezium -> All these thingies
There’s also a lot of hype around Postgres right now, so a bit of VC funded Cambrian explosion going on and I think a lot these will die off as a clear winner emerges.
When I looked for something ~1 year ago to dump to S3 (object storage) they all sucked in some way.
I'm also of the opinion Postgres gives you a pretty "raw" interface with logical replication so a decent amount of building is needed and each person is going to have slightly different requirements/goals.
I haven't looked recently but hopefully these do a better job handling edge cases like TOASTd values, schema changes, and ideally full load
* pgstream: https://github.com/xataio/pgstream
* pg_flo: https://github.com/pgflo/pg_flo
Are there others? Each of them has slightly different angles and messaging, but it is interesting to see.
I’ve tried many things and looked into excluding them from replication with a publication filter but this still causes “events”.
Anyone has some pointers on CDC on busy tables?
That said, their implementation is kinda poor since it allows overlapping dates for queries when a row gets updated multiple times per day. When you SQL join to that kind of SCD2 by a given date you can easily get duplicates.
This can be avoided by A) updating old rows to end-date yesterday rather than today, and B) if a row begins and ends on the same day, the start date or end date can be NULL or a hardcoded ancient or far-future end-date, such as having the record from "2023-01-01 to 2023-01-01" instead be "2023-01-01 to 0001-01-01". Those rows won't show up in joins, but the change remains visible/auditable and you do get the last row available for every given date (and only one such row.)
Currently evaluating https://sequinstream.com/ which claims sub-200ms latency, but has a lot of extras that I don’t need and a lighter weight alternative would be nice.
I don’t love Debezium but I also don’t love Erlang, plenty of us have scars from RabbitMQ’s weird sharp edges…
The main thing we want to communicate is that we're able to keep up with workloads Debezium can handle.
(And, re: RabbitMQ, I wouldn't write off a platform based on a single application built on that platform :) )
I can't remember the name but I saw a Ruby based tool on Hacker News a few months ago that'd automate logical rep setup and failover for you
I believe that's a pretty standard way to provide "HA" postgres. (We use Patroni for our HA setup)
The clients will have to retry on-going transactions, but that's a basic fault tolerant requirement anyway.
I am not entirely sure, but I think CloudNativePG (a Kubernetes operator) can also be used for that.
However, this only does postgres-postgres, so it's a lot more limited compared to Debezium.