spc-pleroma/lib/pleroma/html.ex

# Pleroma: A lightweight social networking server
# Copyright © 2017-2022 Pleroma Authors <https://pleroma.social/>
# SPDX-License-Identifier: AGPL-3.0-only

defmodule Pleroma.HTML do
  # Scrubbers are compiled on boot so they can be configured in OTP releases
  #  @on_load :compile_scrubbers

  def compile_scrubbers do
    dir = Path.join(:code.priv_dir(:pleroma), "scrubbers")

    dir
    |> Pleroma.Utils.compile_dir()
    |> case do
      {:error, _errors, _warnings} ->
        raise "Compiling scrubbers failed"

      {:ok, _modules, _warnings} ->
        :ok
    end
  end

  defp get_scrubbers(scrubber) when is_atom(scrubber), do: [scrubber]
  defp get_scrubbers(scrubbers) when is_list(scrubbers), do: scrubbers
  defp get_scrubbers(_), do: [Pleroma.HTML.Scrubber.Default]

  def get_scrubbers do
    Pleroma.Config.get([:markup, :scrub_policy])
    |> get_scrubbers
  end

  def filter_tags(html, nil) do
    filter_tags(html, get_scrubbers())
  end

  def filter_tags(html, scrubbers) when is_list(scrubbers) do
    Enum.reduce(scrubbers, html, fn scrubber, html ->
      filter_tags(html, scrubber)
    end)
  end

  def filter_tags(html, scrubber) do
    {:ok, content} = FastSanitize.Sanitizer.scrub(html, scrubber)
    content
  end

  def filter_tags(html), do: filter_tags(html, nil)
  def strip_tags(html), do: filter_tags(html, FastSanitize.Sanitizer.StripTags)

  def ensure_scrubbed_html(
        content,
        scrubbers,
        fake,
        callback
      ) do
    content =
      content
      |> filter_tags(scrubbers)
      |> callback.()

    if fake do
      {:ignore, content}
    else
      {:commit, content}
    end
  end

  @spec extract_first_external_url_from_object(Pleroma.Object.t()) :: String.t() | nil
  def extract_first_external_url_from_object(%{data: %{"content" => content}})
      when is_binary(content) do
    content
    |> Floki.parse_fragment!()
    |> Floki.find("a:not(.mention,.hashtag,.attachment,[rel~=\"tag\"])")
    |> Enum.take(1)
    |> Floki.attribute("href")
    |> Enum.at(0)
  end

  def extract_first_external_url_from_object(_), do: nil
end
add license boilerplate to pleroma core 2018-12-23 20:04:54 +00:00			`# Pleroma: A lightweight social networking server`
Revert "Merge branch 'copyright-bump' into 'develop'" This reverts merge request !3825 2023-01-02 20:38:50 +00:00			`# Copyright © 2017-2022 Pleroma Authors <https://pleroma.social/>`
add license boilerplate to pleroma core 2018-12-23 20:04:54 +00:00			`# SPDX-License-Identifier: AGPL-3.0-only`

html: new module providing a configurable markup scrubbing policy 2018-09-09 23:29:00 +00:00			`defmodule Pleroma.HTML do`
HTML: Compile Scrubbers on boot This makes it possible to configure their behavior on OTP releases. 2019-12-08 16:42:40 +00:00			`# Scrubbers are compiled on boot so they can be configured in OTP releases`
			`# @on_load :compile_scrubbers`

			`def compile_scrubbers do`
			`dir = Path.join(:code.priv_dir(:pleroma), "scrubbers")`

			`dir`
Use Pleroma.Utils.compile_dir/1 in Pleroma.HTML.compile_scrubbers/0 2019-12-09 17:38:01 +00:00			`\|> Pleroma.Utils.compile_dir()`
HTML: Compile Scrubbers on boot This makes it possible to configure their behavior on OTP releases. 2019-12-08 16:42:40 +00:00			`\|> case do`
			`{:error, _errors, _warnings} ->`
			`raise "Compiling scrubbers failed"`

			`{:ok, _modules, _warnings} ->`
			`:ok`
			`end`
			`end`

html: allow scrubbing policies to be stackable 2018-09-16 02:07:01 +00:00			`defp get_scrubbers(scrubber) when is_atom(scrubber), do: [scrubber]`
			`defp get_scrubbers(scrubbers) when is_list(scrubbers), do: scrubbers`
			`defp get_scrubbers(_), do: [Pleroma.HTML.Scrubber.Default]`

[Credo] Remove parentesis on argument-less functions 2019-03-05 03:18:43 +00:00			`def get_scrubbers do`
Runtime configuration Related to #85 Everything should now be configured at runtime, with the exception of the `Pleroma.HTML` scrubbers (the scrubbers used can be changed at runtime, but their configuration is compile-time) because it's building a module with a macro. 2018-11-06 18:34:57 +00:00			`Pleroma.Config.get([:markup, :scrub_policy])`
html: allow scrubbing policies to be stackable 2018-09-16 02:07:01 +00:00			`\|> get_scrubbers`
			`end`

html: default to using normal scrub policy if provided scrub policy is nil 2018-09-22 01:10:53 +00:00			`def filter_tags(html, nil) do`
shame on me for not testing after revert 2018-12-30 19:44:17 +00:00			`filter_tags(html, get_scrubbers())`
			`end`

			`def filter_tags(html, scrubbers) when is_list(scrubbers) do`
			`Enum.reduce(scrubbers, html, fn scrubber, html ->`
html: allow scrubbing policies to be stackable 2018-09-16 02:07:01 +00:00			`filter_tags(html, scrubber)`
			`end)`
html: new module providing a configurable markup scrubbing policy 2018-09-09 23:29:00 +00:00			`end`

Switch from HtmlSanitizeEx to FastSanitize 2019-10-28 22:18:08 +00:00			`def filter_tags(html, scrubber) do`
			`{:ok, content} = FastSanitize.Sanitizer.scrub(html, scrubber)`
			`content`
			`end`

html: default to using normal scrub policy if provided scrub policy is nil 2018-09-22 01:10:53 +00:00			`def filter_tags(html), do: filter_tags(html, nil)`
Switch from HtmlSanitizeEx to FastSanitize 2019-10-28 22:18:08 +00:00			`def strip_tags(html), do: filter_tags(html, FastSanitize.Sanitizer.StripTags)`
Move scrubber cache-related functions to Pleroma.HTML 2018-12-31 07:19:48 +00:00
			`def ensure_scrubbed_html(`
			`content,`
Fix the issue with HTML scrubber 2019-04-01 08:55:59 +00:00			`scrubbers,`
add scrubber for html special char 2019-04-30 19:52:17 +00:00			`fake,`
			`callback`
Move scrubber cache-related functions to Pleroma.HTML 2018-12-31 07:19:48 +00:00			`) do`
add scrubber for html special char 2019-04-30 19:52:17 +00:00			`content =`
			`content`
			`\|> filter_tags(scrubbers)`
			`\|> callback.()`

			`if fake do`
			`{:ignore, content}`
			`else`
			`{:commit, content}`
			`end`
Move scrubber cache-related functions to Pleroma.HTML 2018-12-31 07:19:48 +00:00			`end`

RichMedia refactor Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break. Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data. Implementation notes: - The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL - The Task will attempt to fetch the data 3 times with increasing sleep time between attempts - The HTTP request obeys the default HTTP request timeout value instead of 2 seconds - URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes - URLs that fail with an expected error will receive a negative cache with no TTL - Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered - Expiring image URLs are handled with an Oban job - There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time - The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering Overall performance of timelines and creating new posts which contain URLs is greatly improved. 2024-02-11 21:11:52 +00:00			`@spec extract_first_external_url_from_object(Pleroma.Object.t()) :: String.t() \| nil`
Fix Rich Media Previews for updated activities The Rich Media Previews were not regenerated when a post was updated due to a cache invalidation issue. They are now cached by the activity id so they can be evicted with the other activity cache objects in the :scrubber_cache. 2024-02-05 00:24:52 +00:00			`def extract_first_external_url_from_object(%{data: %{"content" => content}})`
Rich Media: Do not cache URLs for preview statuses Closes #1987 2020-09-05 09:37:27 +00:00			`when is_binary(content) do`
RichMedia refactor Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break. Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data. Implementation notes: - The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL - The Task will attempt to fetch the data 3 times with increasing sleep time between attempts - The HTTP request obeys the default HTTP request timeout value instead of 2 seconds - URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes - URLs that fail with an expected error will receive a negative cache with no TTL - Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered - Expiring image URLs are handled with an Oban job - There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time - The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering Overall performance of timelines and creating new posts which contain URLs is greatly improved. 2024-02-11 21:11:52 +00:00			`content`
			`\|> Floki.parse_fragment!()`
			`\|> Floki.find("a:not(.mention,.hashtag,.attachment,[rel~=\"tag\"])")`
			`\|> Enum.take(1)`
			`\|> Floki.attribute("href")`
			`\|> Enum.at(0)`
Rich Media: Do not cache URLs for preview statuses Closes #1987 2020-09-05 09:37:27 +00:00			`end`
html: don't attempt to parse nil content 2019-02-05 05:06:17 +00:00
RichMedia refactor Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break. Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data. Implementation notes: - The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL - The Task will attempt to fetch the data 3 times with increasing sleep time between attempts - The HTTP request obeys the default HTTP request timeout value instead of 2 seconds - URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes - URLs that fail with an expected error will receive a negative cache with no TTL - Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered - Expiring image URLs are handled with an Oban job - There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time - The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering Overall performance of timelines and creating new posts which contain URLs is greatly improved. 2024-02-11 21:11:52 +00:00			`def extract_first_external_url_from_object(_), do: nil`
html: new module providing a configurable markup scrubbing policy 2018-09-09 23:29:00 +00:00			`end`