Mass Download MP3 Files on a Website?

victor · October 30, 2024, 4:15am

For anyone who’s ever wanted to store their own copies of mp3 audios on a website, this question is for you.

Is there a way to download every audio file on a website? This would include subdirectories.

I used DownThemAll as an extension to do this, and it works except for one flaw. When the site has link that goes to a separate page before you can open the file for the audio, it won’t work.

So I need a tool that can detect media on subdirectories of a site, and download them.

Anyone?

oxyde · October 30, 2024, 12:06pm

Did you already tried jdownloader? Maybe with the “help” of a custom link crawler rule.

John · October 30, 2024, 12:47pm

Well by far the simplest option would be just to wget all the mp3 files recursivly.
wget -r -A '*.mp3' <URL>

When the site gets complicated (links to here and there and back to itself) you can specify wget only go one level deep and other stuff.

winnielinnie · October 30, 2024, 12:59pm

Is this a “dumb” website with a simple hierarchy, or one that involves more sophistication with CDNs, javascript, and redirects?

victor · October 30, 2024, 1:20pm

Just a dumb site that links to other pages with the actual download.

John · October 30, 2024, 2:19pm

Still not a lot of information to go on but:
wget -r --level=1 -nH --cur-dirs=1 -A '*.mp3' <URL>

Only 1 level deep.
just files not dir name.

https://man.freebsd.org/cgi/man.cgi?wget

victor · October 30, 2024, 2:20pm

I’m running your first command as we speak. Will update you on progress.

victor · October 30, 2024, 2:26pm

That first command seems like it try to go through the entire website including directories above the subdirectory I am listing.

I only need the lower directories…

John · October 30, 2024, 2:35pm

I think I understand. …you can try the “no-parent” flag but I think you need to toy with the “level” and “cut-dirs” flags as well.
wget -r --level=1 -nH --cut-dirs=1 -A --no-parent '*.mp3' <URL>

For example (from the man page):

No options	     ->	ftp.xemacs.org/pub/xemacs/
		   -nH		     ->	pub/xemacs/
		   -nH --cut-dirs=1  ->	xemacs/
		   -nH --cut-dirs=2  ->	.

victor · October 30, 2024, 8:57pm

Wget seems not ideal as the website navigation requires html parsing to get to the files.

I’m using HTTrack which works great, as well as CyoTek Webcopy. Both allow you to filter and download certain file extensions. They do download a chunk of the actual site, because they are website copy tools, but removing those parts is easy once everything is downloaded.

InternetDownloadManager works best, as it first scans them you can select to download all the files you want, but it’s a paid service, and I only need to do this on rare occasions.

John · October 30, 2024, 10:43pm

Sounds like a confusing website. Glad you found an easy solution though.

victor · November 2, 2024, 1:40pm

What I finally got to work well, was that I wrote a script (actually ChatGPT did most of it) that scrapes and collect all the mp3 links into a text file. I can then input this text file to JDownloader and download them.

It actually worked rather well.

The other software kind of worked, but it missed too much and didn’t download all of them.

winnielinnie · November 2, 2024, 4:25pm

Is it okay to share this script (which populates a text file with “hits” for a type of file extension)?

I might have a use for it too for batch downloads.

You don’t have to share it, and I can understand if it has references to a particular website. (Which could alternatively be “settable” as a variable.)

John · November 2, 2024, 4:42pm

I’d like to play with that because I think you could pass that text file to a text based app like wget or fetch or curl and have next to no interaction.

Send a DM if you need/want.

victor · November 2, 2024, 5:31pm

I can share it here. It’s a work in progress really. There are some errors that pop up when using lynx, but it seems to skip over and continure anyway.

Right now it will not allow you to crawl if the original domain does not match the domain it wants to crawl.

Usage is crawl.sh mywebsite.com

#!/bin/bash

#set -x

# Check if a URL is provided as an argument
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <URL>"
    exit 1
fi

# Set base url
BASE_URL=$1

# Output file for links
OUTPUT_FILE="downloadable_links.txt"
if [ ! -f "$OUTPUT_FILE" ]; then
    touch "$OUTPUT_FILE"
fi

# File to keep track of visited URLs
VISITED_FILE="visited_urls.txt"
# Always reset visited file
: > "$VISITED_FILE"

crawl() {
    local url="$1"

    # Check if the URL has already been visited
    if grep -Fxq "$url" "$VISITED_FILE"; then
        return
    fi

    # Mark this URL as visited
    echo "$url" >> "$VISITED_FILE"

    # Get MP3 links from the current URL and extract full URLs
    lynx -dump -nonumbers -force_html "$url" | grep -Eo "http[^ ]+\.mp3" | while read -r mp3_url; do
        # Check if the MP3 link is already in the output file
        if ! grep -Fxq "$mp3_url" "$OUTPUT_FILE"; then
            echo "$mp3_url" >> "$OUTPUT_FILE"
	    echo "added $mp3_url to download list"
        fi
    done

    # Get all links to follow
	if [[ $url == *${BASE_URL}* ]]; then
		lynx -dump -nonumbers -force_html "$url" | grep -Eo "$BASE_URL/[^ ]+" | sort -u | while read -r link; do
		crawl "$link"
		done
	fi
}

# Start crawling from the base URL
crawl "$BASE_URL"

# Count the number of collected links and print the result
link_count=$(wc -l < "$OUTPUT_FILE")
echo "Finished. Collected $link_count links."

It should also let you continue where you left off in case the script times out or fails for some reason.

victor · November 3, 2024, 10:04pm

I’m now at a point where you can specify the URL as the first argument, and the file extension you are looking for as the second.

It also now will only go to the link if it does not have .* (is a file) which is nice because if there is a huge mp3 file it em take a long time to “crawl” it. So it just gets added to the list.

I’m hoping to get to a point where one can specify multiple extensions as well as keywords to ignore when crawling.

ChatGPT is helpful, but I do have to polish most of it fairly heavily.