1
0
mirror of https://github.com/google/comprehensive-rust.git synced 2025-06-10 11:27:30 +02:00

78 lines
2.1 KiB
Markdown
Raw Normal View History

2022-12-21 16:36:30 +01:00
# Multi-threaded Link Checker
Let us use our new knowledge to create a multi-threaded link checker. It should
start at a webpage and check that links on the page are valid. It should
recursively check other pages on the same domain and keep doing this until all
pages have been validated.
For this, you will need an HTTP client such as [`reqwest`][1]. Create a new
Cargo project and `reqwest` it as a dependency with:
```shell
$ cargo new link-checker
$ cd link-checker
$ cargo add --features blocking,rustls-tls reqwest
2022-12-21 16:36:30 +01:00
```
> If `cargo add` fails with `error: no such subcommand`, then please edit the
> `Cargo.toml` file by hand. Add the dependencies listed below.
You will also need a way to find links. We can use [`scraper`][2] for that:
```shell
$ cargo add scraper
```
Finally, we'll need some way of handling errors. We [`thiserror`][3] for that:
```shell
$ cargo add thiserror
```
The `cargo add` calls will update the `Cargo.toml` file to look like this:
```toml
[dependencies]
reqwest = { version = "0.11.12", features = ["blocking", "rustls-tls"] }
2022-12-21 16:36:30 +01:00
scraper = "0.13.0"
thiserror = "1.0.37"
```
You can now download the start page. Try with a small site such as
`https://www.google.org/`.
Your `src/main.rs` file should look something like this:
```rust,compile_fail
{{#include link-checker.rs:setup}}
{{#include link-checker.rs:extract_links}}
fn main() {
let start_url = Url::parse("https://www.google.org").unwrap();
let response = get(start_url).unwrap();
match extract_links(response) {
Ok(links) => println!("Links: {links:#?}"),
Err(err) => println!("Could not extract links: {err:#}"),
}
}
```
Run the code in `src/main.rs` with
```shell
$ cargo run
```
## Tasks
* Use threads to check the links in parallel: send the URLs to be checked to a
channel and let a few threads check the URLs in parallel.
* Extend this to recursively extract links from all pages on the
`www.google.org` domain. Put an upper limit of 100 pages or so so that you
don't end up being blocked by the site.
[1]: https://docs.rs/reqwest/
[2]: https://docs.rs/scraper/
[3]: https://docs.rs/thiserror/