HTML to Markdown (html2md)

HTML to Markdown conversion tool.

A bit of history

This is a fork of gsites2md project, a tool to migrate Google Site pages to Markdown. The original project includes some features that only apply to fiquipedia.es, so I decided to fork this project to provide a more generic tool (site independent).

Running on the command line

Convert an HTML file or folder (and its content) in a Markdown file

Execution:
	python HTML2mdCLI.py -s <input_file_or_folder> -d <destination_path>
	
where:
	-h, --help: Print this help
	-s, --source <source_path>: (Mandatory) source file or folder
	-d, --dest <dest_path>: (Mandatory) destination file or folder
	-u, --url: (Optional) Use the page title, header of level 1 or the last section of the URL as URL description (only when URL link a description are the same). NOTE: This option can be slow.
	-t, --timeout <seconds>: (Optional) Timeout, in seconds, to use in link validation connections, e.g. "2" seconds. By default is unlimited
	-m, --multiline : (Optional) Support for multiline content in table cells. (WARNING: Google Sites may use internal tables in HTML which may not seem tables for the user. Use under your own risk!)

Setting up your development environment

These are some recommended readings in order to set up a local environment using PyCharm;

Download a copy of a website

This application needs a local copy of a website to use as input. The source HTML will be converted to Markdown.

Prerequisites on Linux (Ubuntu/Debian)

Install 'wget'

$ apt-get install wget

Prerequisites on Mac

Install Homebrew

$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install 'wget'

$ brew install wget

Using 'wget' to download a local copy of a website

wget --content-disposition --recursive -p http://www.fiquipedia.es

URL parameters in file names downloaded by wget

If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling wget` to listen to that header for the final filename is as simple as:

wget --content-disposition

Otherwise, you need to execute this script to remove the URL parameters from the file names added by wget

# /bin/bash
for i in `find $1 -type f`
do
    output=`echo $i | cut -d? -f1`
    if [ $i != $output ]
    then
        mv $i $output
    else
        echo "Skiping $i"
    fi
done

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources

resources

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

Repository files navigation

HTML to Markdown (html2md)

A bit of history

Running on the command line

Setting up your development environment

Download a copy of a website

Prerequisites on Linux (Ubuntu/Debian)

Install 'wget'

Prerequisites on Mac

Install Homebrew

Install 'wget'

Using 'wget' to download a local copy of a website

URL parameters in file names downloaded by wget

About

Releases

Packages

Languages

License

joaquinOnSoft/html2md

Folders and files

Latest commit

History

Repository files navigation

HTML to Markdown (html2md)

A bit of history

Running on the command line

Setting up your development environment

Download a copy of a website

Prerequisites on Linux (Ubuntu/Debian)

Install 'wget'

Prerequisites on Mac

Install Homebrew

Install 'wget'

Using 'wget' to download a local copy of a website

URL parameters in file names downloaded by wget

About

Topics

Resources

License

Stars

Watchers

Forks

Languages