HTML to Markdown conversion tool.
Tested with fiquipedia.es
This is a fork of gsites2md project, a tool to migrate Google Site pages to Markdown. The original project includes some features that only apply to fiquipedia.es, so I decided to fork this project to provide a more generic tool (site independent).
Convert an HTML file or folder (and its content) in a Markdown file
Execution:
python HTML2mdCLI.py -s <input_file_or_folder> -d <destination_path>
where:
-h, --help: Print this help
-s, --source <source_path>: (Mandatory) source file or folder
-d, --dest <dest_path>: (Mandatory) destination file or folder
-u, --url: (Optional) Use the page title, header of level 1 or the last section of the URL as URL description (only when URL link a description are the same). NOTE: This option can be slow.
-t, --timeout <seconds>: (Optional) Timeout, in seconds, to use in link validation connections, e.g. "2" seconds. By default is unlimited
-m, --multiline : (Optional) Support for multiline content in table cells. (WARNING: Google Sites may use internal tables in HTML which may not seem tables for the user. Use under your own risk!)
These are some recommended readings in order to set up a local environment using PyCharm;
This application needs a local copy of a website to use as input. The source HTML will be converted to Markdown.
$ apt-get install wget
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
$ brew install wget
wget --content-disposition --recursive -p http://www.fiquipedia.es
If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling
wget` to
listen to that header for the final filename is as simple as:
wget --content-disposition
Otherwise, you need to execute this script to remove the URL parameters from
the file names added by wget
# /bin/bash
for i in `find $1 -type f`
do
output=`echo $i | cut -d? -f1`
if [ $i != $output ]
then
mv $i $output
else
echo "Skiping $i"
fi
done