open...: crawlers

Showing posts with label crawlers. Show all posts

18 January 2008

For People with Large Data Sets

This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.

19 November 2007

What's a Paglo?

That was my first question to Brian de Haaff, CEO of the eponymous company. This is what he said, (more or less):

Francisco Paglo was a virtually unknown Italian explorer who first set sail as a lookout on Cadamosto's expedition to the Gambia River in 1455. Upon completion of a distance learning course in creative writing, he published a stirring account of the exploration from his viewpoint in the crow's nest, which was widely published throughout Europe. It ultimately caught the eye of Prince Henry the Navigator who was a Portuguese royal prince, soldier, and patron of explorers. Prince Henry summoned Paglo, and thanks to his generous funding, sent him on an expedition around Africa's Cape of Good Hope in 1460 to trade for spices in India. A storm pushed him off his target, and he finally dropped anchor in what is now known as New Zealand.

He never did set foot in India, but in New Zealand he remains a hero for bringing the country its first sheep, and his birthday (April 1) is celebrated every year with giant mutton pies. A growing movement has petitioned the government to officially establish the day as a national holiday — Dandy Mutton Day, in reverent appreciation for Paglo. On the eve of March 31 each year, children leave tiny bales of hay in their family rooms, hoping for the safe return of his ghost to their home and a flock of sheep for their family. Those who have been good the preceding year and have prepared fresh bales receive a bowl of lamb stew and freshly-knit wool socks and sweaters from their parents. But poor behavior and unkempt bales is frowned upon as a sign of disrespect, and these unfortunate kids receive a clump of manure.

And this is what the company does:

Paglo is a search engine for IT that specializes in searching the complex and varied data of IT networks, and in returning rich data reports in table and chart formats, as well as simple text hit lists.

As someone who was smitten with search engines ever since the early days of Lycos, WWWW and Inktomi, I was naturally highly receptive to this approach. Search has become the optic through which we see the digital world; applying it not just to traditional information, but also to corporate IT data is eminently sensible.

Things only got better when I found out that the search engine crawler was open source (GNU GPL to be precise). This makes a lot of sense. It means that people can add extra features to it to allow discovery of all kinds of new and whacky hardware and software through the use of plugins; it also means that people are more likely to trust it to wander around their intranets, gathering a lot of extremely sensitive information.

That information is sent back to Paglo, encrypted, where it is stored on their servers as a searchable index of your IT assets that can be interrogated. Now, obviously security is paramount here. I also worry about people turning up with a sub poena: after all, those search indexes will provide extremely useful information about unlicensed copies of software etc.; Paglo, not surprisingly, doesn't think this will be a problem.

There are other interesting aspects of Paglo, including its use of what it calls "social solving":

We do this by allowing all users to save their search queries and publish them for anyone’s use. The elegance here is that you can immediately access any query that’s been saved and made public, and run it against your own data. (Only the query syntax is published. The data itself, of course, is private to each user.) This is especially helpful when you need a query that searches out a complex relationship – such as between users and the applications they have installed on their desktops – and you do not know where to start. The permutations are endless, but since the core concept is the same, any saved query can be used against any set of network data.

But in many ways, the most interesting aspect of Paglo is its business model:

We are maniacally focused on delivering the most value, for the most users, as quickly as possible. To achieve this, we are removing barriers to getting started (like complex installation and cost) and making the service convenient to use. Our experience and the history of the Internet tells us that lots and lots of thrilled users of a free service are much more valuable than a handful of paying customers. If we are successful, you will love Paglo, use it daily, and tell your colleagues and friends.

Yup, that means that they don't have one, but they're really, really sure that if everyone uses them, they can find one. Of course, that's precisely what Google did, so there are precedents - but no guarantees. Let's hope the final business plan proves more credible than the explanation of the company name.

27 March 2006

Searching for an Answer

I have always been fascinated by search engines. Back in March 1995, I wrote a short feature about the new Internet search engines - variously known as spiders, worms and crawlers at the time - that were just starting to come through:

As an example of the scale of the World-Wide Web (and of the task facing Web crawlers), you might take a look at Lycos (named after a spider). It can be found at the URL http://lycos.cs.cmu.edu/. At the time of writing its database knew of a massive 1.75 million URLs.

(1.75 million URLs - imagine it.)

A few months later, I got really excited by a new, even more amazing search engine:

The latest pretender to the title of top Web searcher is called Alta Vista, and comes from the computer manufacturer Digital. It can be found at http://www.altavista.digital.com/, and as usual costs nothing to use. As with all the others, it claims to be the biggest and best and promises direct access to every one of 8 billion words found in over 16 million Web pages.

(16 million pages - will the madness never end?)

My first comment on Google, in November 1998, by contrast, was surprisingly muted:

Google (home page at http://google.stanford.edu/) ranks search result pages on the basis of which pages link to them.

(Google? - it'll never catch on.)

I'd thought that my current interest in search engines was simply a continuation of this story, a historical relict, bolstered by the fact that Google's core services (not some of its mickey-mouse ones like Google Video - call that an interface? - or Google Finance - is this even finished?) really are of central importance to the way I and many people now work online.

But upon arriving at this page on the OA Librarian blog, all became clear. Indeed, the title alone explained why I am still writing about search engines in the context of the opens: "Open access is impossible without findability."

Ah. Of course.

Update: Peter Suber has pointed me to an interesting essay of his looking at the relationship between search engines and open access. Worth reading.

open...

18 January 2008

For People with Large Data Sets

19 November 2007

What's a Paglo?

27 March 2006

Searching for an Answer

Blog Archive

About Me

Labels

mastodon verification

Followers

Creative Commons CC0 — “No Rights Reserved”

open...

18 January 2008

For People with Large Data Sets

19 November 2007

What's a Paglo?

27 March 2006

Searching for an Answer

Subscribe To

Blog Archive

About Me

Labels

mastodon verification

Followers

Creative Commons CC0 — “No Rights Reserved”