43 points | by vortex_ape2 天前
Context: When adding Clickhouse to Langfuse [1] to store all observability data, the migration DX was a big challenge. Some migrations need to copy/reorder data between tables which can take a long time and thus need be performed in the background/async. Many teams run Langfuse self-hosted and reducing downtime of an upgrade is important, thus, we added some custom logic to run these “background migrations” for now [2]. For regular migrations we use golang-migrate, works decently although DX isn’t as good as eg Prisma for Postgres.
[1] OSS LLM Observability, https://github.com/langfuse/langfuse
My pain comes from needing to modify the schema in such a way it requires a data migration, reprocessing data to some degree, and managing the migration of that data in a sane way.
You can't just run a simple `INSERT INTO table SELECT * FROM old_table`, or anything like that because if the data is large, it takes forever and a failure in the middle could be fairly difficult to recover from.
So what I do is I split the migration into time-based chunks, because nearly every table has a time component that is immutable after writing, but I really want a migration tool that can figure out what that column is, what those chunks are, and incrementally apply a data migration in batches so that if one batch fails I can go in there to investigate and know exactly how much progress on the data migration has been made.
For materialized columns, it's a "easy" (not really because you still need to monitor the backfill) because we can run something like `ALTER TABLE events UPDATE materialized_column = materialized_column WHERE 1`. Depending on the materialization that can bring the load up on clickhouse because it still creates a lot of background mutations that we need to monitor, because they can fail due to all sorts of reasons (memory limit or disk space errors), in which case we need to jump in and fix things by hand.
For materialized tables, it's a bit harder because we need to write custom scripts to load data in day-wise (or month, depending on the data size) chunks like you mentioned, because an `INSERT INTO table SELECT * FROM another_table` will for sure run into memory limit errors depending on the data size.
I would love to think more about this to see if it's something that would make sense to handle within Houseplant.
Would love to chat more here or there if you're keen!
On the other, I've definitely had trouble with getting an existing migrations tool to work with Clickhouse.. currently using golang-migrate but it's not painless. I think Atlas might have the right idea but I'm hesitant to lock my company into commercial tooling. Sqlmesh may also kind of do what's needed, but its focus is more on transforming tables of data, managing your ddl happens as a happy accident along the way.
As to what else, I continually fail to see what’s wrong with SQL. It’s incredibly easy to read, in plain English, and understand what is being done. ALTER TABLE foo ADD COLUMN bar INTEGER NOT NULL DEFAULT 0… hmm, wonder what that’s doing?
For people who say you get too many .sql files doing it this way: a. Get your data models right the first time b. Build a tool to parse and combine statements, if you must.
This could be done with SQL comments, but having it in a structured format makes it more reliable to parse and validate programmatically. I do see why YAML's quirks could become a problem in a tool that's meant to help you make sure your database is in order, we didn't run into issues like country codes or numbers being interpreted as sexagesimal (yet).
Perhaps a middle ground would be to keep the actual migrations in .sql files for readability, while using a separate metadata file (in JSON or TOML) for the orchestration details. What do you think?
I guess my stance is that if you need some kind of config file to create the migration, then fine, but as a DBRE, it’s much easier for me to reason about the current state by seeing the declarative output than by mentally summing migration files. As a bonus, this lets you see quite easily if you’ve done something silly like creating an additional index on a column that has a UNIQUE constraint (though you’d still have to know why that’s unnecessary, I suppose).