For anyone who’s ever wanted to store their own copies of mp3 audios on a website, this question is for you.
Is there a way to download every audio file on a website? This would include subdirectories.
I used DownThemAll as an extension to do this, and it works except for one flaw. When the site has link that goes to a separate page before you can open the file for the audio, it won’t work.
So I need a tool that can detect media on subdirectories of a site, and download them.
I think I understand. …you can try the “no-parent” flag but I think you need to toy with the “level” and “cut-dirs” flags as well. wget -r --level=1 -nH --cut-dirs=1 -A --no-parent '*.mp3' <URL>
Wget seems not ideal as the website navigation requires html parsing to get to the files.
I’m using HTTrack which works great, as well as CyoTek Webcopy. Both allow you to filter and download certain file extensions. They do download a chunk of the actual site, because they are website copy tools, but removing those parts is easy once everything is downloaded.
InternetDownloadManager works best, as it first scans them you can select to download all the files you want, but it’s a paid service, and I only need to do this on rare occasions.
What I finally got to work well, was that I wrote a script (actually ChatGPT did most of it) that scrapes and collect all the mp3 links into a text file. I can then input this text file to JDownloader and download them.
It actually worked rather well.
The other software kind of worked, but it missed too much and didn’t download all of them.
Is it okay to share this script (which populates a text file with “hits” for a type of file extension)?
I might have a use for it too for batch downloads.
You don’t have to share it, and I can understand if it has references to a particular website. (Which could alternatively be “settable” as a variable.)
I’d like to play with that because I think you could pass that text file to a text based app like wget or fetch or curl and have next to no interaction.
I can share it here. It’s a work in progress really. There are some errors that pop up when using lynx, but it seems to skip over and continure anyway.
Right now it will not allow you to crawl if the original domain does not match the domain it wants to crawl.
Usage is crawl.sh mywebsite.com
#!/bin/bash
#set -x
# Check if a URL is provided as an argument
if [ "$#" -ne 1 ]; then
echo "Usage: $0 <URL>"
exit 1
fi
# Set base url
BASE_URL=$1
# Output file for links
OUTPUT_FILE="downloadable_links.txt"
if [ ! -f "$OUTPUT_FILE" ]; then
touch "$OUTPUT_FILE"
fi
# File to keep track of visited URLs
VISITED_FILE="visited_urls.txt"
# Always reset visited file
: > "$VISITED_FILE"
crawl() {
local url="$1"
# Check if the URL has already been visited
if grep -Fxq "$url" "$VISITED_FILE"; then
return
fi
# Mark this URL as visited
echo "$url" >> "$VISITED_FILE"
# Get MP3 links from the current URL and extract full URLs
lynx -dump -nonumbers -force_html "$url" | grep -Eo "http[^ ]+\.mp3" | while read -r mp3_url; do
# Check if the MP3 link is already in the output file
if ! grep -Fxq "$mp3_url" "$OUTPUT_FILE"; then
echo "$mp3_url" >> "$OUTPUT_FILE"
echo "added $mp3_url to download list"
fi
done
# Get all links to follow
if [[ $url == *${BASE_URL}* ]]; then
lynx -dump -nonumbers -force_html "$url" | grep -Eo "$BASE_URL/[^ ]+" | sort -u | while read -r link; do
crawl "$link"
done
fi
}
# Start crawling from the base URL
crawl "$BASE_URL"
# Count the number of collected links and print the result
link_count=$(wc -l < "$OUTPUT_FILE")
echo "Finished. Collected $link_count links."
It should also let you continue where you left off in case the script times out or fails for some reason.
I’m now at a point where you can specify the URL as the first argument, and the file extension you are looking for as the second.
It also now will only go to the link if it does not have .* (is a file) which is nice because if there is a huge mp3 file it em take a long time to “crawl” it. So it just gets added to the list.
I’m hoping to get to a point where one can specify multiple extensions as well as keywords to ignore when crawling.
ChatGPT is helpful, but I do have to polish most of it fairly heavily.