So I’ve been troubleshooting the federation issues with some other admins:

(Thanks for the help)

So what we see is that when there are many federation workers running at the same time, they get too slow, causing them to timeout and fail.

I had federation workers set to 200000. I’ve now lowered that to 8192, and set the activitypub logging to debugging to get queue stats. RUST_LOG="warn,lemmy_server=warn,lemmy_api=warn,lemmy_api_common=warn,lemmy_api_crud=warn,lemmy_apub=warn,lemmy_db_schema=warn,lemmy_db_views=warn,lemmy_db_views_actor=warn,lemmy_db_views_moderator=warn,lemmy_routes=warn,lemmy_utils=warn,lemmy_websocket=warn,activitypub_federation=debug"

Also, I saw that there were many workers retrying to servers that are unreachable. So, I’ve blocked some of these servers:

commallama.social,mayheminc.win,lemmy.name,lm.runnerd.net,frostbyrne.io,be-lemmy.org,lemmonade.marbledfennec.net,lemmy.sarcasticdeveloper.com,lemmy.kosapps.com,pawb.social,kbin.wageoffsite.com,lemmy.iswhereits.at,lemmy.easfrq.live,lemmy.friheter.com,lmy.rndmm.us,kbin.korgen.xyz

This gave good results, way less active workers, so less timeouts. (I see that above 3000 active workers, timeouts start).

(If you own one of these servers, let me know once it’s back up, so I can un-block it)

Now it’s after midnight so I’m going to bed. Surely more troubleshooting will follow tomorrow and in the weekend.

Please let me know if you see improvements, or have many issues still.

  • phiresky@lemmy.world
    link
    fedilink
    arrow-up
    36
    ·
    edit-2
    1 year ago

    I want to say that with 0.18 the definition of federation_workers has changed massively due to the improved queue. As in, whatever is good in 0.17 is not necessarily good for 0.18.

    On 0.18, it probably makes sense to have it around 100 to 10’000. Setting it to 0 is also be an option (unlimited, that’s the default). Anything much higher is probably a bad idea.

    On 0.18, retry tasks are also split into a separate queue which should improve things in general.

    0 might have perf issues since every federation task is one task with the same scheduling priority as any other async task (like ui / user api requests). So if 10k federation tasks are running and 100 api requests are running then tokio will schedule the api requests with probability 100 / (10k+100) (if everything is cpu-limited). (I think, not 100% sure how tokio scheduling works)

  • chaos@lemmy.world
    link
    fedilink
    arrow-up
    30
    arrow-down
    1
    ·
    1 year ago

    Holy fuck 200k workers!? I’m not familiar with lemmy internals but I’ve literally never seen any program run anything close to well at levels that high. Want some help from someone who is a DevOps engineer by day? I think I remember you said you were a psql dba professionally so maybe my experience could help out?

  • Magiwarriorx@lemmy.world
    link
    fedilink
    arrow-up
    10
    ·
    edit-2
    1 year ago

    Posted this last night, but reposting for visibility:

    To those experiencing federation issues with communities that aren’t local, make sure to properly set your language in our profile! I thought my off-instance communities were having extremely slow federation, but the issue was I didn’t have English as one of my profile languages.

    • ptrknvk@lemmy.world
      link
      fedilink
      arrow-up
      6
      ·
      1 year ago

      I’ve read about it and I don’t understand. I have a list of the languages in the settings, but I have all of them and I cannot remove or add any languages.

      • MentalEdge@sopuli.xyz
        link
        fedilink
        arrow-up
        13
        ·
        edit-2
        1 year ago

        The dialogue is very bad.

        To select what languages you want, hold ctrl and click each language you want, to highlight it.

        With the languages you want highlighted, save the settings.

      • Magiwarriorx@lemmy.world
        link
        fedilink
        arrow-up
        11
        ·
        1 year ago

        The list is to select from, not what you have selected. If you click one it should highlight it blue. Then hit save at the bottom of the profile.

  • dragontamer@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    1 year ago

    So as of right now, https://lemmy.ca still seems bugged.

    These two are 0 comments here on lemmy.world, while comments clearly exist over at lemmy.ca.

    The opposite here: I made a test post at microcontroller@lemmy.ca, so lemmy.world thinks there is +1 comment. But the true instance at http://lemmy.ca/c/microcontroller sees 0 comments, so my comment fails to traverse the federation to lemmy.ca.

    So both imports and exports to these communities on lemmy.ca seem bugged.


    EDIT: I should note that I was under “Subscription Pending” for days. I decided to unsubscribe (erm… stop pending?) and then re-subscribe. I’m now stuck on “Subscription Pending” again.

  • BitOneZero @ .world@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    1 year ago

    From what I’ve seen, there is a 10-second hard-coded timeout for HTTP, seems too low for the kind of load going on. Especially if the server is opening tons of connects to the same peer server.

  • sunaurus@lemm.ee
    link
    fedilink
    arrow-up
    4
    ·
    1 year ago

    Ideally, we can fix this in the software eventually (most likely it has already been improved a lot in 0.18.1 - we’ll find out for sure when lemmy.world upgrades), but for now it really does seem that defederating offline servers will massively improve the success rate of federated posts and comments reaching other instances.

      • sunaurus@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        1 year ago

        Yep, anything that will get your instance to stop sending activities to an unresponsive instance will help (at least for sure on 0.17.4)

  • 🌈 Denuath@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    1 year ago

    My feeling is that the update has improved the situation. For example, when I filter by Hot, significantly more recent posts appear than before.