BR Crawl

I swear I'll do a better write up sometime soon.

Adding new URLs from a list

Duplicated feeds are ignored. Domains that block scrapping via robots.txt will be skipped.

If it's a list of urls

Any page will work, doesn't need to be the home.

Use the ExternalUrlsSpider crawler. The output will be a .jsonl file with rss_url and domain of each website listed.

# on the scraper/ directory
uv run scrapy crawl rss -a urls_file=urls.txt -o rss.jsonl

Import the resulting .jsonl file into the backend's database using the flask import-feeds command.

# on the backend/ directory
uv run flask import-feeds rss.jsonl

If it's a list of valid rss feeds

Format it to .jsonl before importing with flask import-feeds:

# on the backend/ directory
jq -R -n -c '[inputs] | map({rss_url: .}) | .[]' rss_urls.txt > rss.jsonl
uv run flask import-feeds rss.jsonl

Generating the website

Always use the full list of imported feed_urls. Order randomly to reduce chances of hammering a small provider.

# on the backend/ directory
sqlite3 brcrawl.sqlite3

.output ../website/feeds.txt
SELECT feed_url FROM feeds WHERE status_id = 1 ORDER BY RANDOM();
.output stdout
.quit

Now use the generated feeds.txt file to run the build.sh command from the website directory.

# on the website/ directory

./build.sh feeds.txt

The resulting .html files can be deployed (eg. via github pages or vps with nginx).

Recently approved feeds

Filter by status verified and apply a filter based on the date. Eg;

SELECT feed_url FROM feeds WHERE status_id = 1 AND created_at > '2026-02-12 10:00:00';

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
backend		backend
experiments		experiments
runs		runs
scraper		scraper
website		website
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
brcrawl.sh		brcrawl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BR Crawl

Adding new URLs from a list

If it's a list of urls

If it's a list of valid rss feeds

Generating the website

Recently approved feeds

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BR Crawl

Adding new URLs from a list

If it's a list of urls

If it's a list of valid rss feeds

Generating the website

Recently approved feeds

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages