Commit Graph

46 Commits

Author SHA1 Message Date
Mark Felder 9a83301ff8 Credo 2024-05-07 22:11:19 -04:00
Mark Felder df0734fcbf Increase the :max_body for Rich Media to 5MB
Websites are increasingly getting more bloated with tricks like inlining content (e.g., CNN.com) which puts pages at or above 5MB. This value may still be too low.
2024-05-07 19:54:56 -04:00
Mark Felder ede414094f RichMedia refactor
Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break.

Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data.

Implementation notes:

- The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL
- The Task will attempt to fetch the data 3 times with increasing sleep time between attempts
- The HTTP request obeys the default HTTP request timeout value instead of 2 seconds
- URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes
- URLs that fail with an expected error will receive a negative cache with no TTL
- Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered
- Expiring image URLs are handled with an Oban job
- There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time
- The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering

Overall performance of timelines and creating new posts which contain URLs is greatly improved.
2024-05-07 19:54:56 -04:00
Mark Felder 9f2319e50d RichMedia.Helpers: move the validate_page_url/1 function to the Parser module
This will ensure that the page validation happens in Parser.parse/1 so it can be called from anywhere and still filter invalid URLs.
2024-02-06 18:34:02 -05:00
Mark Felder 0cc038b67c Ensure URLs with IP addresses for the host do not generate previews 2024-02-05 00:09:37 -05:00
Mark Felder 579561e97b URI.authority is deprecated 2024-02-04 23:49:07 -05:00
Mark Felder 04fc4eddaa Fix Rich Media Previews for updated activities
The Rich Media Previews were not regenerated when a post was updated due to a cache invalidation issue. They are now cached by the activity id so they can be evicted with the other activity cache objects in the :scrubber_cache.
2024-02-04 23:47:04 -05:00
Lain Soykaf 00def0875b RichMediaTest: Use mocked config 2023-12-12 13:28:11 +04:00
lain e853cfe7c3 Revert "Merge branch 'copyright-bump' into 'develop'"
This reverts merge request !3825
2023-01-02 20:38:50 +00:00
marcin mikołajczak 10886eeaa2 Bump copyright year
Signed-off-by: marcin mikołajczak <git@mkljczk.pl>
2023-01-01 12:13:06 +01:00
Sean King 17aa3644be
Copyright bump for 2022 2022-02-25 23:11:42 -07:00
lain e1e7e4d379 Object: Rework how Object.normalize works
Now it defaults to not fetching, and the option is named.
2021-01-04 13:38:31 +01:00
Alexander Strizhakov 8d218ebaf5
Moving some background jobs into simple tasks
- fetching activity data
- attachment prefetching
- using limiter to prevent overload
2020-11-11 13:39:49 +03:00
Mark Felder ba7f9459b4 Revert Rich Media censorship for sensitive statuses
The #NSFW hashtag test was broken anyway.
2020-09-28 18:22:59 -05:00
rinpatch f70335002d RichMedia: Do a HEAD request to check content type/length
This shouldn't be too expensive, since the connections are pooled,
but it should save us some bandwidth since we won't fetch non-html
files and files that are too large for us to process (especially
since you can't cancel a request without closing the connection
with HTTP1).
2020-09-14 14:45:58 +03:00
Alexander Strizhakov 696bf09433
passing adapter options directly without adapter key 2020-09-07 19:59:17 +03:00
Alexander Strizhakov a83916fdac
adapter options unification
not needed options deletion
2020-09-07 19:59:17 +03:00
rinpatch e198ba492e Rich Media: Do not cache URLs for preview statuses
Closes #1987
2020-09-05 20:53:46 +03:00
Alexander Strizhakov 79f65b4374
correct pool and uniform headers format 2020-09-02 09:16:51 +03:00
Mark Felder 016d8d6c56 Consolidate construction of Rich Media Parser HTTP requests 2020-08-03 12:37:31 -05:00
lain 781b270863 ChatMessageReferenceView: Display preview cards. 2020-07-30 19:57:26 +02:00
lain 5b1eeb06d8 Revert "Merge branch 'revert-2b5d9eb1' into 'develop'"
This reverts merge request !2784
2020-07-21 22:18:17 +00:00
lain 696c13ce54 Revert "Merge branch 'linkify' into 'develop'"
This reverts merge request !2677
2020-07-21 22:17:34 +00:00
Alex Gleason 8daacc9114
AutoLinker --> Linkify, update to latest version
https://git.pleroma.social/pleroma/elixir-libraries/linkify
2020-06-30 16:39:15 -05:00
Egor Kislitsyn 520367d6fd Fix atom leak in Rich Media Parser 2020-06-13 12:08:46 +03:00
Mark Felder 3bf78f2be7 Fix Oban not receiving :ok from RichMediaHelper job 2020-04-14 11:43:53 -05:00
Mark Felder 05da5f5cca Update Copyrights 2020-03-03 16:44:49 -06:00
Maksim Pechnikov 5c0f646cef fix validate_page_url 2019-06-26 06:27:17 +03:00
Maksim Pechnikov 4ad15ad2a9 add ignore hosts and TLDs for rich_media 2019-06-25 22:25:37 +03:00
Maksim Pechnikov 0276cf5a02 fix validate_url for private ip 2019-06-25 17:44:24 +03:00
rinpatch f30a3241d2 Deps: Update auto_linker 2019-06-18 16:08:18 +03:00
Egor Kislitsyn bf22ed5fbd Update `auto_linker` dependency 2019-06-12 15:53:33 +07:00
William Pitcock 0da1233e8e rich media: suppress link previews if post is marked as sensitive 2019-05-17 18:49:43 +00:00
William Pitcock 57d11ac9db activitypub: move post rich media fetching to job queue 2019-05-13 19:36:00 +00:00
William Pitcock c62220c500 rich media: helpers: only crawl Create activities 2019-03-23 02:28:59 +00:00
William Pitcock b3bf523c09 rich media: use optimized Object.normalize() 2019-03-23 00:22:57 +00:00
Haelwenn (lanodan) Monnier a3a9cec483
[Credo] fix Credo.Check.Readability.AliasOrder 2019-03-13 04:26:54 +01:00
William Pitcock b7aa1ea9e6 rich media: helpers: rework validate_page_url() 2019-03-04 18:39:13 +00:00
William Pitcock 9f3cb38012 helpers: use AutoLinker to validate URIs as well as the other tests 2019-03-04 18:31:49 +00:00
William Pitcock d38d537bee rich media: don't crawl bogus URIs 2019-03-04 18:31:49 +00:00
Haelwenn (lanodan) Monnier 6a6a5b3251
de-group alias/es 2019-02-09 16:31:17 +01:00
lain b19b4f8537 Remove default value for rich media.
Setting it to true will actually override a 'false' set before.
2019-01-31 20:02:08 +01:00
rinpatch 7057891db6 Make rich media support toggleable 2019-01-31 18:18:20 +03:00
William Pitcock ddb5545202 rich media: kill some testsuite noise 2019-01-28 20:55:33 +00:00
William Pitcock ebeabdcc72 rich media: helpers: clean up unused aliases 2019-01-28 06:10:25 +00:00
William Pitcock 8e42251e06 rich media: add helpers module, use instead of MastodonAPI module 2019-01-28 06:04:54 +00:00