281 points | by gernest5 days ago
https://github.com/vinceanalytics/vince/blob/f0c2c3cc38cbd8c...
My dream for a business is practically dead now. That snippet is a relic of early days of vince and I will remove it.
I am currently looking for work, and will be maintaining vince as usual (I do a lot of open source stuff) since I also use it with my hobby projects.
I'm struggling finding remote roles now, since remote now means Remote US or Remote EU and I'm stuck here in Tanzania.
So, don't worry, I also use vince so I will keep hacking on it.
Also why are you using pebble exactly? I was interested in seeing how you're managing your geo databases because that's usually the most mind numbing part of handling analytics if your cloud provider doesn't add that information into the request header already. However, I can't understand why you'd use pebble over something like sqlite.
> Why protocol buffers ?
They are very good for defining API boundaries, in vince we only use them for configuration and admin structure. We use Roaring Bitmap based storage, so fundamental units persisted are Bitmap containers.
> Also why are you using pebble exactly?
Well, vince is write heavy and any LSM based key value store would have been nice. It happens pebble is the best option for us.
Also, we don't use transactions (We batch writes and use snapshots for reads). Combining with the fact we rely on pebble batch Merge api.
The merge api allows us to do efficient updates. Since we only store bitmap containers, when doing update we just do a container union of observed values of a key.
Bitmap unions are pretty fast and efficient.
I hope I covered all your questions.
Edit: Just noticed the feature comparison in the readme.
404 page not found
Plausible gets crazy expensive on their hosted option and it complex to setup (needs elixir + high memory requirements)
If Vince gets 1:1 parity with plausible and has the option to use clickhouse, I'll consider moving a few servers and people I know over.
Love that Vince is also a single binary as well.
Twitter, Threads, Mastodon, Blusky all look the same. Project management apps all reuse the same UI patterns. The "AI" logo looked pretty much the same for all companies for a while. Video sharing websites all use YouTube's layout. Forums like Reddit and HN share quite a lot in their looks.
If you want to display website analytics, you will want to show the most important metrics at a glance, you'll need graphs showing visitors over time, top sources and pages... There is only so much you can do to display those and have users understand what's going on on your website.
What kind of database is this using though? I don't know enough Go to figure it out from the source.
> WARNING: Pebble may silently corrupt data or behave incorrectly if used with a RocksDB database that uses a feature Pebble doesn't support. Caveat emptor!
Slightly worrying for now running this in prod if there is a risk for silent data corruption, but hopefully in a few years Vince would have drivers for Postgres / Clickhouse.
For reference, the demo is hosted on a 6$ vultr instance, the last 3 days it handled about 11.9K pageviws with 4.3K unique visitors.
I have just checked the vultr dashboard.
Bandwidth = 3.37 GB ,vCPU usage = 1% (yep one percent) , Current charges = 1.06$.
Majority of the bandwidth is for outgoing data serving the dashboard.
I carefully designed vince to be extremely efficient for web analytics workloads.
Please give vince a try.
There are many web analytics providers with surprisingly high prices.
We are cheaper and even planning on creating free tier by making smart use of resources and avoiding overpriced cloud providers.
Do the better paying ad networks all reject you solely because of the language?
"Add the following line to you page source to send data to Vince"
> vince started as a Go port of plausible with a focus on self hosting.
I deployed this on our cloud (excloud.in) in less than 2 mins.
Anyone you can use the below k8s manifest to deploy it to their k8s cluster. Just change the admin password before doing so.
https://gist.github.com/lomkju/90fe7500d8cf854bf3b7c2f26aa58...
Does it always pull the latest vince image?
Just FYI, we also have simple helm charts, and the repository is hosted on https://vinceanalytics.com/charts
Oh cool, didn't see that in the docs.
> Does it always pull the latest vince image?
Yes haven't specified any tag so should default to latest.
How do you deal with location data, do you purchase maxmind db license or use their free versions.
Both maxmind and db-ip free versions of city data miss city geo id values, rendering city data useless for many cases.
With vince, I had to index embed the whole city data from geonames database to work around this.
> Both maxmind and db-ip free versions of city data miss city geo id values, rendering city data useless for many cases.
I work for IPinfo.
I think you might find my conversation with Goatcounter's dev interesting: https://github.com/arp242/goatcounter/issues/765
I pitched him to use our free country database because of MaxMind's EULA issues. MaxMind does not permit distribution of the database and requires end users to use their own token. Moreover, they actually charge thousands of dollars when you distribute the "free" database with a commercial intent.
Now, we have a free IP to Country database that we offer under a straight CC-BY-SA 4.0 license without an EULA. It is free, comes with daily updates, has full accuracy, and you can even commercially redistribute the database (via providing us an attribution).
I understand we do not have a free city database to offer, nor is our database lightweight because we have full accuracy. But you can check it out if you are interested. We do have a version with ASN (ISP) information as well.
Also I am pretty sure Plausible CE doesn't limit number of sites / events, unlike what's listed in "Comparison with Plausible Analytics".
RAM : 1GB
STORAGE: 25 GB
so far bandwidth used is 3.6GB
So, you can successful deploy vince on low spec servers depending on your expected traffic.
I found a small bug, if you click Expand in the Top Pages section, the Time on Page column has NaNs.
Dark mode for the dashboard and showing realtime current visitors in the <title> would be great.
> Instead of tagging users with cookies, we count the number of unique IP addresses that accessed your website. Counting IP addresses is an old-school method that was used before the modern age of JavaScript snippets and tracking cookies.
Since IP addresses are considered personal data under GDPR, we anonymize them using a one-way cryptographic hash function. This generates a random string of letters and numbers that is used to calculate unique visitor numbers for the day. Old salts are deleted to avoid the possibility of linking visitor information from one day to the next. We never store IP addresses in our database or logs.
Um... hashing IPv4 addresses, even with salt, does literally nothing to anonymise (assuming the output space is at least ~32 bits, which I think is safe to assume): they’ll still be PII. IPv6 addresses I’m not so confident about; maybe it would be sufficient for some parts, but it’s definitely inadequate for some concerns.
(For IPv4, enumerating all four billion inputs is so completely practical that “one-way” is nonsense.)
I’m almost certain this is legal theatre.
That said, the whole IP thing is weird to me. Not only are we allowed to log IPs directly for security reasons, we even *have* to log IPs in certain cases (newsletter subscriptions).
The point of designating something as PII isn't that we then _never_ store or use it, it's to carefully consider if we actually need it or not (and what protections we can add for the values we do need to store/use).
We're meant to stop the practice of just collecting and storing all data, without consideration for the harms that causes.
Information-theory-wise, this is no different to just storing the actual IP addresses (and deleting them daily after tallying, as before). It does mean that you need to obtain two things instead of just one, but if you get access to it all, it’s straightforward to reverse the lot (though computationally a little expensive), and easy to check a single value for a match.
The technique may be considered reasonable effort at protecting against casual abuse, but it’s not technically effective of itself, and it doesn’t stop the data from being PII. The important aspect is that the PII is deleted within 24 hours. My personal opinion is that the hashing part should probably be considered snake oil and whitewash, at least for what they’re claiming—I don’t say it’s useless, but it definitely doesn’t do what they’re touting it for.
Unless they’re actually keeping the hashed values for some reason after one day, and associating them with other records? In which case, disregard part of what I say, it’s obviously better than persisting IP addresses long-term! But also it’s extremely dubious to call that anonymisation as they do, because you can so often tie things together, behavioural patterns and such, to deanonymise. It’s frighteningly effective.
Lossless techniques do nothing to dilute that taint.
Lossy techniques are necessary to get anywhere, such as disregarding certain bits of the address, or Bloom filters.
If I tell you the value is either 1 or 2, but I hashed it with sha256 to make it secure, that's bullshit, right? You can just hash both and see which it is.
Same concept applies regardless of the hash algo, and still applies if you have more than 2 possible values, 4 billion or so possible ipv4 addresses is _not_ that many values to a computer.
Other common places this problem occurs is with any other restricted set of values, eg phone numbers and email addresses (most are at like 5 domains and are easy to guess/know).
When I researched this topic it was strange to me that no one seems to agree. Is it just arm-chair internet answers? Or is it actually that the letter of the law is actually ambiguous? What are the real world consequences of using this when it’s possible it violates GDPR? Or, what are the chances there would be consequences?
Shynet is similarly self hostable, and has a tiny footprint..
Minor bug: "See Live Demo Dashboard" url is wrongly pointed.
Since they usually offer software via cPanel and alike, seems unlikely unless you give it lots of time for the project to first get popular enough to get on the "admin panels" mind, and secondly for them to integrate it.
Besides, do people really pay 10 USD/month for shared hosting? Sounds really expensive when you can grab VPSes for half that price and run whatever software you want, not just what they've packaged for you. I guess ongoing maintainace is included in that price, but still sounds kind of expensive for what you get.
I mean, it is quite nice to have binary installation hosted on a single VPS, but will you support it?