2022-12-21 16:36:30 +01:00
|
|
|
# Multi-threaded Link Checker
|
|
|
|
|
|
|
|
Let us use our new knowledge to create a multi-threaded link checker. It should
|
|
|
|
start at a webpage and check that links on the page are valid. It should
|
|
|
|
recursively check other pages on the same domain and keep doing this until all
|
|
|
|
pages have been validated.
|
|
|
|
|
|
|
|
For this, you will need an HTTP client such as [`reqwest`][1]. Create a new
|
|
|
|
Cargo project and `reqwest` it as a dependency with:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ cargo new link-checker
|
|
|
|
$ cd link-checker
|
2023-01-06 15:16:03 -07:00
|
|
|
$ cargo add --features blocking,rustls-tls reqwest
|
2022-12-21 16:36:30 +01:00
|
|
|
```
|
|
|
|
|
|
|
|
> If `cargo add` fails with `error: no such subcommand`, then please edit the
|
|
|
|
> `Cargo.toml` file by hand. Add the dependencies listed below.
|
|
|
|
|
|
|
|
You will also need a way to find links. We can use [`scraper`][2] for that:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ cargo add scraper
|
|
|
|
```
|
|
|
|
|
|
|
|
Finally, we'll need some way of handling errors. We [`thiserror`][3] for that:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ cargo add thiserror
|
|
|
|
```
|
|
|
|
|
|
|
|
The `cargo add` calls will update the `Cargo.toml` file to look like this:
|
|
|
|
|
|
|
|
```toml
|
|
|
|
[dependencies]
|
2023-01-06 15:16:03 -07:00
|
|
|
reqwest = { version = "0.11.12", features = ["blocking", "rustls-tls"] }
|
2022-12-21 16:36:30 +01:00
|
|
|
scraper = "0.13.0"
|
|
|
|
thiserror = "1.0.37"
|
|
|
|
```
|
|
|
|
|
|
|
|
You can now download the start page. Try with a small site such as
|
|
|
|
`https://www.google.org/`.
|
|
|
|
|
|
|
|
Your `src/main.rs` file should look something like this:
|
|
|
|
|
|
|
|
```rust,compile_fail
|
|
|
|
{{#include link-checker.rs:setup}}
|
|
|
|
|
|
|
|
{{#include link-checker.rs:extract_links}}
|
|
|
|
|
|
|
|
fn main() {
|
|
|
|
let start_url = Url::parse("https://www.google.org").unwrap();
|
|
|
|
let response = get(start_url).unwrap();
|
|
|
|
match extract_links(response) {
|
|
|
|
Ok(links) => println!("Links: {links:#?}"),
|
|
|
|
Err(err) => println!("Could not extract links: {err:#}"),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
Run the code in `src/main.rs` with
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ cargo run
|
|
|
|
```
|
|
|
|
|
|
|
|
## Tasks
|
|
|
|
|
|
|
|
* Use threads to check the links in parallel: send the URLs to be checked to a
|
|
|
|
channel and let a few threads check the URLs in parallel.
|
|
|
|
* Extend this to recursively extract links from all pages on the
|
|
|
|
`www.google.org` domain. Put an upper limit of 100 pages or so so that you
|
|
|
|
don't end up being blocked by the site.
|
|
|
|
|
|
|
|
[1]: https://docs.rs/reqwest/
|
|
|
|
[2]: https://docs.rs/scraper/
|
|
|
|
[3]: https://docs.rs/thiserror/
|