comprehensive-rust/src/exercises/day-4/link-checker.md

# Multi-threaded Link Checker

Let us use our new knowledge to create a multi-threaded link checker. It should
start at a webpage and check that links on the page are valid. It should
recursively check other pages on the same domain and keep doing this until all
pages have been validated.

For this, you will need an HTTP client such as [`reqwest`][1]. Create a new
Cargo project and `reqwest` it as a dependency with:

```shell
$ cargo new link-checker
$ cd link-checker
$ cargo add --features blocking,rustls-tls reqwest
```

> If `cargo add` fails with `error: no such subcommand`, then please edit the
> `Cargo.toml` file by hand. Add the dependencies listed below.

You will also need a way to find links. We can use [`scraper`][2] for that:

```shell
$ cargo add scraper
```

Finally, we'll need some way of handling errors. We [`thiserror`][3] for that:

```shell
$ cargo add thiserror
```

The `cargo add` calls will update the `Cargo.toml` file to look like this:

```toml
[dependencies]
reqwest = { version = "0.11.12", features = ["blocking", "rustls-tls"] }
scraper = "0.13.0"
thiserror = "1.0.37"
```

You can now download the start page. Try with a small site such as
`https://www.google.org/`.

Your `src/main.rs` file should look something like this:

```rust,compile_fail
{{#include link-checker.rs:setup}}

{{#include link-checker.rs:extract_links}}

fn main() {
    let start_url = Url::parse("https://www.google.org").unwrap();
    let response = get(start_url).unwrap();
    match extract_links(response) {
        Ok(links) => println!("Links: {links:#?}"),
        Err(err) => println!("Could not extract links: {err:#}"),
    }
}
```

Run the code in `src/main.rs` with

```shell
$ cargo run
```

## Tasks

* Use threads to check the links in parallel: send the URLs to be checked to a
  channel and let a few threads check the URLs in parallel.
* Extend this to recursively extract links from all pages on the
  `www.google.org` domain. Put an upper limit of 100 pages or so so that you
  don't end up being blocked by the site.

[1]: https://docs.rs/reqwest/
[2]: https://docs.rs/scraper/
[3]: https://docs.rs/thiserror/
Publish Comprehensive Rust 🦀 2022-12-21 16:36:30 +01:00			`# Multi-threaded Link Checker`

			`Let us use our new knowledge to create a multi-threaded link checker. It should`
			`start at a webpage and check that links on the page are valid. It should`
			`recursively check other pages on the same domain and keep doing this until all`
			`pages have been validated.`

			For this, you will need an HTTP client such as [`reqwest`][1]. Create a new
			Cargo project and `reqwest` it as a dependency with:

			```shell
			`$ cargo new link-checker`
			`$ cd link-checker`
Use rustls instead of openssl Using the `rustls-tls` feature on reqwest will use rustls and ring, which should compile on any system. 2023-01-06 15:16:03 -07:00			`$ cargo add --features blocking,rustls-tls reqwest`
Publish Comprehensive Rust 🦀 2022-12-21 16:36:30 +01:00			```

			> If `cargo add` fails with `error: no such subcommand`, then please edit the
			> `Cargo.toml` file by hand. Add the dependencies listed below.

			You will also need a way to find links. We can use [`scraper`][2] for that:

			```shell
			`$ cargo add scraper`
			```

			Finally, we'll need some way of handling errors. We [`thiserror`][3] for that:

			```shell
			`$ cargo add thiserror`
			```

			The `cargo add` calls will update the `Cargo.toml` file to look like this:

			```toml
			`[dependencies]`
Use rustls instead of openssl Using the `rustls-tls` feature on reqwest will use rustls and ring, which should compile on any system. 2023-01-06 15:16:03 -07:00			`reqwest = { version = "0.11.12", features = ["blocking", "rustls-tls"] }`
Publish Comprehensive Rust 🦀 2022-12-21 16:36:30 +01:00			`scraper = "0.13.0"`
			`thiserror = "1.0.37"`
			```

			`You can now download the start page. Try with a small site such as`
			`https://www.google.org/`.

			Your `src/main.rs` file should look something like this:

			```rust,compile_fail
			`{{#include link-checker.rs:setup}}`

			`{{#include link-checker.rs:extract_links}}`

			`fn main() {`
			`let start_url = Url::parse("https://www.google.org").unwrap();`
			`let response = get(start_url).unwrap();`
			`match extract_links(response) {`
			`Ok(links) => println!("Links: {links:#?}"),`
			`Err(err) => println!("Could not extract links: {err:#}"),`
			`}`
			`}`
			```

			Run the code in `src/main.rs` with

			```shell
			`$ cargo run`
			```

			`## Tasks`

			`* Use threads to check the links in parallel: send the URLs to be checked to a`
			`channel and let a few threads check the URLs in parallel.`
			`* Extend this to recursively extract links from all pages on the`
			`www.google.org` domain. Put an upper limit of 100 pages or so so that you
			`don't end up being blocked by the site.`

			`[1]: https://docs.rs/reqwest/`
			`[2]: https://docs.rs/scraper/`
			`[3]: https://docs.rs/thiserror/`