this post was submitted on 20 Jun 2023
45 points (97.9% liked)

Selfhosted

39575 readers
385 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago
MODERATORS
 

Are there any Discord servers or somewhere in the Matrix to chat about hosting a Lemmy instance? I've got Lemmy running, but I think there are several of us in the same boat struggling with federation performance issues and it might be good to have some place to chat real time.

top 45 comments
sorted by: hot top controversial new old
[–] [email protected] 10 points 1 year ago (3 children)

My server is struggling with federation. Pretty much everything I see in the logs with debug turned on is this:

2023-06-20T01:55:28.018419Z WARN Error encountered while processing the incoming HTTP request: lemmy_server::root_span_builder: Header is expired

[–] xebix 9 points 1 year ago (2 children)

This is exactly what I am seeing. I just tried upping federation_worker_count in the postgres database. I saw someone in another thread mention trying that so we’ll see.

[–] [email protected] 1 points 1 year ago

That guy was me, and it seemed like it worked. Those errors were flooding in, and when I changed the workers to 1024+, they practically stopped except every few seconds, which may not even be my server's fault.

[–] [email protected] 1 points 1 year ago

Check that your server time is synced to an NTP server and accurate. Federation requires the correct time.

[–] [email protected] 1 points 1 year ago

Upping worker count significantly reduced those in my case. If Lemmy is maxing out your CPU too much though by chance, you may need to upgrade.

[–] [email protected] 1 points 1 year ago (1 children)

There is an nginx setting you can tune as well. I believe it was worker threads? Can’t remember the exact one and too tired to ssh into my instance to check.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

This post says that the worker threads only effect outbound federation. I'm struggling with my instance not receiving anything inbound.

[–] [email protected] 5 points 1 year ago (2 children)

Honestly- a lot of the performance issues aren't due to OUR servers- but, the upstream servers.

beehaw.org, lemmy.world, for example- I think their servers are completely overloaded, and are having issues keeping up.

I don't have sync issues for the smaller/other servers at all. Just the big ones.

I have 128G of ram, 32 cores dedicated. I have the federation worker count set at 256. There is NO shortage of resources, and my server sits more or less, idle.

Due to this only really impacting those larger instances- I believe the blame may lie there.

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago)

Agree on this. When i run docker-compose up and dont detach it, my instance is just constantly asking for updates, and I only get warnings from the biguns.

Honestly I am not sure how scalable this is. It would borderline make more sense to federate authentication and just dump you to the destination instance/community when you click a link and interact directly with it rather than relying on async at the server level.

[–] [email protected] 3 points 1 year ago (1 children)

I think it is less about pointing fingers as to who's at blame, and trying to see if there are things we can do to resolve/alleviate that.

I recall reading somewhere that @[email protected] mentioned before that the server is scaled all the way up to a fairly beefy dedicated server already, perhaps it is soon time to scale this service horizontally across multiple servers. If nothing else, I think a lot of value could be gained by moving the database to a separate server than the UI/backend server as a first step, which shouldn't take too much effort (other than recurring $ and a bit of admin time) even with the current Lemmy code base/deployment workflow...

[–] [email protected] 1 points 1 year ago (1 children)

Well- I do know- most of the components do scale.

The UI/Frontend, for example, you can run multiple instances easily.

The API/MiddleTier, I don't know if it supports horizontal scaling though. But, a beefy server can push a TON of traffic.

The database/backend, being postgres, does support some horizontal scaling.

Regarding the app itself, it scales much better if EVERYONE didn't just flock to lemmy.ml, lemmy.world, and beehaw.org. I think that is one of the huge issues.... everyone wanted to join the "big" instance.

[–] [email protected] 5 points 1 year ago (2 children)

If you look here: https://lemmy.world/comment/65982

At least specs and capacity wise, it doesn't suggest it is hitting a wall.

The more I dug into things, the more I think the limitation comes from an age old issue in that if your service is expected to connect to a lot of flakey destinations, you're not going to be in for a good time. I think the big instance backend is trying to send federation event messages, and a bunch of smaller federated destinations have shuttered (because they're not getting all the messages, so they just go and sign up on the big instances to see everything), which results in the big instances' out going connection have to wait for timeout and/or discover the recipient is no longer available, which results in a backed up queue of messages to send out.

When I posted a reply to myself on lemmy.world, it took 17 seconds to reach my instance (hosted in a data centre w/ sub 200ms ping to lemmy.world itself, so not a network latency issue here), which exceeds the 10 seconds limit per defined by Lemmy. Increasing it on the application protocol level won't help, because as more small instances come up, they too would also like to subscribe to big hubs, which will just further exacerbate the lag.

I think the current implementation is very naive and can scale a bit, but will likely be insufficient as the fediverse grows, not as the individual instance's user grows. That is, the bottle neck will not so much be "this can support instance up to 100K users" but rather "now that there's 100K users, we'd also have 50K servers trying to federate with us". And to work around that, you're going to need a lot more than Postgres horizontal scaling... you'd need message buses and workers that can ensure jobs (i.e.: outward federation) can be sent effectively.

[–] [email protected] 1 points 1 year ago

I agree here. I don't see Federation scaling without major arch changes. I can't see a server making 50k (subscribed servers) outbound connections for every upvote, comment, etc.

Q: How many Federated actions, on average per user per community per day? Probably a low number, say 5. But 5 * Users * Servers is a huge number of connections once Users and Servers get moderately large. 500k users and 5k servers is 12.5 billion connections, just for one community.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

That is a VERY small server....

MY server, has 32 cores, 64 threads, 256G of ram, and 130T of storage (4T of which is NVMe)

Sheesh, that is prob why that instance is dragging!!

https://lemmy.world/post/56228

[–] [email protected] 1 points 1 year ago (1 children)

They've bumped the server much more than the original posted VM. I was pointing to the zabbix charts and actual usage. Notice CPU is sub 20%, and the network usage being sub 200Mbits. There's plenty of headroom.

[–] [email protected] 1 points 1 year ago (1 children)

I found the newest link- https://lemmy.world/comment/379405

Ok, that is a pretty sizable chunk of hardware.

[–] [email protected] 1 points 1 year ago (1 children)

I care less about what it is running on, but what is consumed. At sub 20% usage, it really doesn't matter what the hardware is, because the overall spec is not the bottle neck.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

Your original link is from 9 days ago, before the massive surge hit.

https://lemmy.world/post/56228 Came 8 days ago, with reports of it being pretty well saturated.

Remember- the big surge, is in the last 3-4 days.

Fediverse stats: https://fediverse.observer/dailystats

In the last 4 days, they have went up over 400% in size.

[–] [email protected] 1 points 1 year ago (1 children)

I don't know if you're totally missing it... here is the CPU usage from 3 hours ago: https://lemmy.world/comment/377946

Even if you 4x the usage from the alleged 400% growth, the spec of the server itself is not the bottleneck. They've also significantly increased the federation workers to 10000 based on my private chat... so something is not scaling to the fullest potential.

I think the point of focus should be more on why it is not using all the resources available, rather than 'that server is weak'. We're about to see a much larger influx up comes July 1st that's going to make the 400% growth look like a joke, and if current larger instances aren't able to handle the federation now, the current smaller instances will buck hard up comes the big move.

[–] [email protected] 1 points 1 year ago (1 children)

Ok- sorry- I 100% missed that.

I am onboard with you now.

Hopefully the upcoming 0.18 release I keep hearing about helps compensate for a few of these issues...

[–] [email protected] 1 points 1 year ago (2 children)

Quick skim through the commits on the master branch, I don't see many changes pertaining to federation. This one looks interesting/related, but I think in itself only tells server admins when to increase worker counts: https://github.com/LemmyNet/lemmy/commit/25275b79eed0fb1fe90d27c197725f510f9965bb

[–] [email protected] 2 points 1 year ago (1 children)

Don't know if they changed it or not-

But, I am busy making a few kubernetes manifests for deploying lemmy- and I am noticing a ton of extra debugging / logging that doesn't need to be there for production use.

Seriously doubt it's going to fix the issue- but, reducing some of the debugging enabled by default, wouldn't hurt.

[–] [email protected] 1 points 1 year ago (1 children)

100% with you. A lot of the current deployment are very development centric... Let's pretend the default docker-compose.yml isn't opening up the postgres DB to the internet with a generic password...

The pace the entire system must mature to enable the platform to handle the hopeful upcoming growth is... interesting, to say the least...

[–] [email protected] 1 points 1 year ago

Well, if I can knock out a decent helm chart for these manifests, it might actually help a bit-

Most of the components can scale quite easily on k8s. The only piece I am unsure of currently, is lemmy itself.

The lemmy-ui scales. Appears mostly stateless. pictrs scales. postgres scales.

[–] [email protected] 4 points 1 year ago

Yeah, I’ve been selfhosting for nearly a decade and setting up lemmy was, surprisingly, a challenge, and not because it was all that difficult but because the documentation was contradictory, out-of-date, or non-existent in key areas. Federation is my current hurdle, too. It would be nice to have a place to compare notes. Maybe here?

[–] [email protected] 4 points 1 year ago (2 children)

From the docs / troubleshooting:

"Also ensure that the time is accurately set on your server. Activities are signed with a timestamp, and will be discarded if it is off by more than 10 seconds."

[–] [email protected] 1 points 1 year ago

Interesting. What if the big communities servers times are off instead?

[–] [email protected] 1 points 1 year ago

Thanks for pointing this out. I got hopeful that it may be a simple fix, but unfortunately NTP is set up and synchronized.

[–] [email protected] 2 points 1 year ago (1 children)

The matrix space have multiple rooms, one explicitly related to instance admins

https://matrix.to/#/#lemmy-space:matrix.org

[–] [email protected] 1 points 1 year ago

This is perfect. Thank you!

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

There is a Matrix chat they have that is pretty active. https://matrix.to/#/!OwmdVYiZSXrXbtCNLw:matrix.org

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)
[–] useful_idiot 2 points 1 year ago* (last edited 1 year ago) (1 children)

I was able to adapt the docker compose manifest into a nomad job(yay high availability), but I am really struggling with federation. I have a domain/proper ssl certificate, accessible remotely everything seems OK, but when I try to subscribe to other instances, I get an initial load of posts, then it’s just stuck in subscribe pending. Any time I try to subscribe I see this log message which isn’t exactly helpful about what to do about it…

‘ 2023-06-19T20:11:18.426743Z INFO Worker{worker.id=06aa9ebe-1cab-42fb-ac4b-54bbe7954ba2 worker.queue=default worker.operation.id=fe75d47d-f50d-43d6-921f-795aa50a1b68 worker.operation.name=process}:Job{execution_id=83235752-79dd-4e42-a6f5-d6e32c2e95a9 job.id=ed8bcdbd-4e78-464e-9ae0-871f3d79fd92 job.name=SendActivityTask}: activitypub_federation::core::activity_queue: Target server https://lemmy.ca/inbox rejected https://lemmy.my-domain-redacted.ca/activities/follow/c4b74591-767e-42a0-a160-5023e67c77aa, aborting’

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

FWIW i see that too on several instances. I dont think it affects anything but syncing with busier instances is a struggle and the destination instance is not acknowledging you following it.

For example if you look at my subscribed you will see the following

But when i go to that page i get the following

And when i click to go the the instance page I get a 500 error, then it works on refresh. And my comments are clearly struggling to stay in sync.

TL:DR - I think some of the larger instances are overloaded its not just user counts and traffic, but also all the backend requests to sync with other instances which may be silently failing/timing out.

[–] useful_idiot 1 points 1 year ago* (last edited 1 year ago)

OK I was able to resolve this issue on my end, I had setup an internal consul based url during setup and after changing the config it looks like the old url was still lingering in config/db in a bunch of the json responses from the test urls in the documentation. After dropping everything from postgres and re-initializing I seem to be as far as everyone else (some can subscribe, others stuck pending, and no comments :D )

[–] [email protected] 1 points 1 year ago
[–] [email protected] 1 points 1 year ago

I would be up for something like this. I host my own 8nstance as well. I'm having issues updating communities though. Every time I try I get the button spinner of death. I think in the end, the software is buggy and needs some time to get the bugs worked out, but it is frustrating.

[–] [email protected] 1 points 1 year ago

There is

[email protected]

... at least in theory.

[–] [email protected] 1 points 1 year ago
[–] [email protected] 1 points 1 year ago
load more comments
view more: next ›