Spidering Hacks

I fielded a couple questions this week about search engine safe URL’s both of them along of the lines of a) how do you create them? and b) are they even worth it? I’m written about how you can create them using Apache before, but one of the things I didn’t mention was that I think writing your own spider.. or at least attempting to, is a great first step to understanding why search engine safe URL’s are important. To that end, I’d suggest the “Spidering Hacks” book that Oreilly just released as a great starting point. The book uses Perl quite extensively, but it’s the process that matters. I’ve picked up “Programming Spiders, Bots, and Aggregators in Java” at Barnes and Noble quite a few times as well, but have never pulled the trigger.

If you’d rather read code, you can download the spider/indexing engine I’ve been working on (was working on!) to get some kind of idea of what goes into a spider.

Notes on Peer-To-Peer: Harnessing the Power of Disruptive Technologies

Peer-To-Peer (amazon, oreilly) is an old book by internet standards (published in March of 2001), but chock full of interesting thoughts and perspectives.

· On gnutella, did you know that you can watch what other people are searching for? The book has a screenshot of gnutella client v0.56, I have Gnucleus, but you can do it in Gnucleus by clicking Tools –> Statistics –> Log. Betcha you didn’t know that an entire company was founded based on that idea did you?

· footnote: Chaffing and Winnowing: Confidentiality without Encryption by Ronald L. Rivest

· On the ‘small-world’ model and the importance of bridges: “.. The key to understanding the result lies in the distribution of links within social networks. In any social grouping, some acquaintances will be relatively isolated and contribute few new contacts, whereas otherws will have more wide-ranging connections and be able to serve as bridges between far-flung social clusters. These bridging vertices play a critical role in bringing the network closer together… It turns out that the presence of even a small number of bridges can dramatically reduce the lengths of paths in a graph, …” — Reference “Collective dynamics of ‘small-world’ networks” published in Nature, download the PDF.

· Publius looks like some cool software

· Crowds: “Crowds is a system whose goals are similar to that of mix networks but whose implementation is quite different. Crowds is based on that idea that people can be anonymous when they blend into a crowd. As with mix networks, Crowds users need not trust a single third party in order to maintain their anonymity. A crowd consists of a group of web surfers all running the Crowds software. When one crowd member makes a URL request, the Crowds software on the corresponding computer randomly chooses between retrieving the requested document or forwarding the request to a randomly selected member of the crowd…. ” — Read more about Crowds at the official site.

· Tragedy of the Commons. This idea was mentioned in chapter 16 on Accountability and is talked about in various other books I’ve read, but I’m not sure that I ever recorded the source. The idea came from Garrett Hardin in a paper written in 1968 called “The Tragedy of the Commons”, which you can read on his site.

· On accountability and how we really *are* all just six degrees apart. Read the PGP Web of Trust statistics if you don’t believe it.

· On reputation systems, specifically Advogato’s Trust Metric

· On reputation scoring systems, a good system “… will possess many of the following qualities:

  • Accurate for long-term performance: The system reflects the confidence (the likelihood of accuracy) of a given score. It can also distinguish between a new entity of unknown quality and an entity with bad long-term performance.
  • Weighted toward current behavior: The system recognizes and reflects recent trends in entity performance. For instance, an entity that has behaved well for a long time but suddenly goes downhill is quickly recognized and no longer trusted.
  • Efficient: It is convenient if the system can recalculate a score quickly. Calculations that can be performed incrementally are important.
  • Robust against attacks: The system should resist attempts of any entity or entities to influence scores other than by being more honest or having higher quality.
  • Amenable to statistical evaluation: It should be easy to find outliers and other factors that can make the system rate scores differently.
  • Private: No one should be able to learn how a given rater rated an entity except the rater himself.
  • Smooth: Adding any single rating or small number of ratings doesn’t jar the score much.
  • Understandable: It should be easy to explanin to people who use these scores what they mean — not only so they know how the system works, but so they can evaluate for themselves what the score implies.
  • Verifiable: A score under dispute can be supported with data.”

· On reputation scoring systems: “Vulnerabilities from overly simple scoring systems are not limited to “toy” systems like Instant Messenger. Indeed, eBay suffers from a similar problem. In eBay, the reputation score for an individual is a linear combination of good and bad ratings, one for each transaction. Thus, a vendor who has performed dozens of transactions and cheats on only 1 out of every 4 customers will have a steadily rising reputation, whereas a vendor who is completely honest but has done only 10 transactions will be displayed as less reputable. As we have seen, a vendor could make a good profit (and build a strong reputation!) by being honest for several small transactions and then being dishonest for a single large transaction.

· The book was written when Reputation Technologies was still a distinct company, but I thought this list of Reputation and Asset Management vendors was interesting in that reputation is something that is becoming more and more important.. for instance, when was the last time you purchased something from eBay where the vendor had a bad rating? Never right? Did you ever stop to think about how the vendor in question got a bad rating? Since when is eBay a good judge of someone’s character? Why do we trust eBay’s reputation algorigthms?

· On the optimal size of an organization: “Business theorists have observed that the ability to communicate broadly and deeply through the Internet as low cost is driving a process whereby large businesses break up into a more competitive system of smaller component companies. They call this process ‘deconstruction.’ This process is an example of Coase’s Law, which states that other things being equal, the cost of a transaction — negotiating, paying, dealing with errors or fraud — between firms determines the optimal size of the firm. When business transactions between firms are expensive, it’s more economical to have larger firms, even though larger firms are considered less efficient because they are slower to make decisions. When transactions are cheaper, small firms can replace the larger integrated entity.

· “Why Johnny Can’t Encrypt: A Usability Evaluation of PGP 5.0 (pdf)” from Alma Whitten

Notes on “Things A Computer Scientist Rarely Talks About”

I picked up “Things A Computer Scientist Rarely Talks About” by Donald Knuth at Barnes & Noble a couple weeks back on a whim after spending 45 minutes looking through the fascinating science/technology section at the back of the Natick store. (sidenote: some Barnes and Nobles have fabulous science/technology/computer science/engineering sections with rows and rows of books… and some have “JavaScript for Dummies”. Why is that?)

It’s not a book about computer science but is rather the transcribed text of his series of public lectures about interactions between faith and computer science (which you can view online). Couple quotes I deemed noteworthy for one reason or another:

· On page 28 he talks about he how he used randomization when grading papers while teaching at Stanford. Reminder to read up on “zero knowledge proofs” sometime.

· The basis of his lectures was a book he wrote called “3:16 Bible Texts Illuminated” which aimed to gain an understanding into the Bible by taking 59 random snapshots (verses) and studying them in detail. His son was inspired indirectly by this book: “… to start up the H-20 project, which is designed to answer the question ‘What is Massachusetts?’ … He and my daughter have a book of maps of Massachusetts at a large scale; they live fairly near campus, at coordinates H-20 in the relevant map of Cambridge. So they’re going to try and visit H-20 on all the other pages of their book. That should give terrific insights into the real nature of Massachusetts.

· on learning: “… I learned that the absolute best way to find out what you don’t understand is to try to express something in your own words. If I had been operating only in input mode, looking at other translations but not actually trying to output the thoughts they expressed, I would never have come to grips with the many shades of meaning that lurk just below the surface. In fact, I would never have realized that such shades of meaning even exist, if I had just been inputting. The exercise of producing output, trying to make a good translation by yourself, is a tremendous help to your education.

· A quote from Peter Gomes at the beginning of his book called “The Good Book“: “… The notion that [the texts of the Bible] have meaning and integrity, intention, contexts and subtexts, and that they are part of an enormous history of interpretation that has long involved some of the greatest thinkers in the history of the world, is a notion often lost on those for whom the text is just one more of the many means the church provides to massage the egos of its members.

· One of the questions asked about Douglas Hofstadter’s book “Le Ton Beau de Marot: In Praise of the Music of Language“.

· “My experience suggests that the optimum way to run a research think tank would be to take people’s nice offices away from them and to make them live in garrets, and even to insist that they do non-researchy things. That’s a strange way to run a research center, but it might well be true that the imposition of such constraints would bring out maximum creativity.” — after mentioning that he was able to come up with several relatively important ideas (attribute grammars, Knuth-Bendix completion, LL(k) parsing) during the “most hectic year of his life”.

· On aesthetics according to C. S. Peirce: “Aesthetics deals with things that are admirable; ethics deals with things that are right or wrong; logic deals with things that are true or false.

· “Somehow the whole idea of art and aesthetics and beauty underlies all the scientific work I do. Whatever I do, I try to do it in a way that has some elegance; I try to create something that I think is beautiful. Instead of just getting a job done, I prefer to do my work in a way that pleases me in as many senses as possible…. I like especially to be associated with art, in the sense of making things of beauty.

· Planet Without Laughter: “.. It’s a marvelous parable on many levels, about the limits of rationality. You can read it to get insight about all religions, and about the question of form over substance in religion.

· Eugene Wigner, a Princeton physicist: “It is good that the completion of our scientific work is an unattainable ideal. Striving toward it is attracting many of us, and gives much pleasure and satisfaction… If science were completed, the satisfaction which research, the furthering of human knowledge, had provided, would disappear. Also, even more men would strive for power and domination…. We know that there are facts and insights which we cannot communicate to animals — no animal is familiar, for instance, with the associative law of multiplication… Is it not possible that our understanding of nature also has limitations?… I hope that, even if this should be true, we will be able to continue the extension of our knowledge indefinitely, … even if the limit thereof will always remain widely separated from the complete knowledge and understanding of nature.

· On artificial life: “… the Game of Life illustrates the power of evolutionary mechanisms. Stable configurations arise out of random soup, usually very quickly; and many of those configurations have properties analogous to biological organisms.

· Stuart Sutherland, in the 1996 edition of the International Dictionary of Psychology: “Consciousness: The having of perceptions, thoughts and feelings; awareness. The term is impossible to define except in terms that are unintelligible without a grasp of what consciousness means. Consciousness is a fascinating but elusive phenomenom: it is impossible to specify what it is, what it does, or why it evolved. Nothing worth reading has ever been written on it.

Mobile Usability: How Nokia Changed the Face of the Mobile Phone

This book slapped me in the face last night as I was walking through the computer books section of B&N: Mobile Usability: How Nokia Changed the Face of the Mobile Phone. As described on Amazon, the book “.. explains the philosophies and working methods by which Nokia revolutionized product usability, written by current and former Nokia employees. Includes practical guidance on how to provide maximum usability to all end-users.” Looks like a great read for mobile device software designers and programmers.

Emergence: From Chaos to Order

I finished Emergence: From Chaos to Order a couple weeks ago, snippets that I want to remember include:

On model building: “For most of us model building starts at an early age. As children we use building blocks to generate concrete realizations of our imagination — castles and space stations. This facility for recombining standard objects to make new items carries over into late occupations. A watchmaker uses familiar mechanisms — gear wheels, spring, pinions and so on — to generate marvels of timekeeping, and a scientist does the same thing at a more abstract level, generating complex objects, such as molecules, from simpler objects, atoms. By selecting the building blocks and the ways of recombining them, we set up the rules that make rule-governed systems comprehensible. A well conceived model will exhibit the complexity and emergent phenonmena, of the system being modeled, but with much of the detail sheared away.” [pg 12]

In relation to the checker playing model, the emergent consequences of weight changing: “Subgoals: Though it seems that the evaluation function makes no provisions for subgoals, in fact it does, providing subtle direction when there is no clear path to a win or an obvious advantage. Anticipating the opponent: The checkersplayer must impute a strategy to the opponent if it is to anticipate the opponent’s actions; the valuation function can serve as a guide to the opponent’s likely response. Toward minimax: The valuation function only indirectly minimizes the maximum damage the opponent can inflict (minimax), yet it captures important elements of this idea. Bootstrapping: The checkersplayer can improve its performance by playing against itself. Lookahead: Knowing the rules of the game, the checkersplayer can look ahead several moves using its model of the other player, changing weights on the basis of the anticipated outcomes.” [pg 68-69]

On anticipating the opponent: “An undesirable outcome can only occur because a) the checkersplayer has made a bad play, or b) the opponent has made a good play. In either case, the checkersplayer is well advised to assign a low (possibly negative) value to configurations that lie in that direction. On the other hand, a desirable outcome presents an ambiguous situation. The desirable outcome may occur because of good play on the part of the checkersplayer, but it can also occur because of poor play on the part of the opponent. If the outcome is the result of an opponent’s poor play, then it is unwise to make any adjustments. That line of play is unlikely to recur, either because the opponent has learned or because a different player plays a better game.” [pg 70-71]

On neural nets and transmitter molecules: “Transmitter molecules diffuse across the synaptic gap to the surface of the receiving neuron; if enough transmitter molecules from enough synapses accumulate at the surface of a neuron, that neuron fires. In doing so, if effectively removes the molecules from the synaptic gap. If this happens repeatedly, the synapse increases its ability to produce the transmitter molecule…. The synapse has increased its effectiveness (weight)…” [pg 95]

On the mechanisms of a neural net: “Variable threshold. A neuron’s threshold decreases as the time since it last fired increases. This decreasing threshold makes the neuron increasingly sensitive to incoming pulses when it remains quiescent over an extended period. The variable threshold allows the neuron to act as a frequency modulator, firing at a rate that reflects the average synapse-weighted strength of the impulses firing on its surface. Fatigue. A neuron that fires at a high rate over an extended period has its threshold steadily incremented — in effect the whole variable threshold curve is translated upward. Contrariwise, a neuron that fires at a low rate over an extended period of time has its threshold steadily decremented. Fatigue eventually forces a neuron’s firing rate back to a ‘normal’ or ‘set-point’ level: no neuron can continue to fire at a rate above or below this set-point. Hebb’s rule. If neuron x fires at time t and neuron y fires at time t+1, then any synapses that x’s axon make at y are strengthened. Contrariwise, if x fires at time t and y does not fire at time t+1, the same synapses are weakened.” [pg 108]

On emergence: “… The behavior of an ant colony is not the simple sum of the behaviors of a group of average ants. The coupled interactions of the ants provide a coherence to the nest that far exceeds anything predictable in terms of simple summations. Emergence is above all a product of coupled, context-dependent interactions. Technically these interactions, and the resulting system, are nonlinear. The behavior of the overall system cannot be obtained by summing the behavior of its constituent parts. We can no more truly understand strategies in a board game by compiling statistics of the movements of its pieces than we can understand the behavior of any ant colony in terms of averages.” [pg 121-122]

On cultivating innovation: “Practice. A part of the answer to all who strive within a discipline, be it tennis, piano playing, writing poetry, or building a scientific model. The answer lies within the word itself: discipline. Only when you are so familiar with the elements (building blocks) of your discipline that you no longer have to think about how they are combined, do you enter the creative phase. If you are a tennis player and you have to concentrate on the elements of each stroke, you will have little appreciation of the flow of the game — your opponent’s strengths, weaknesses and strategy. If you play the piano and have to concentrate on fingering, you will not hear the flow of the music, the ‘long line’. Local concerns drive out global perceptions, and so it is with other disciplines.” [pg 211-212]

More on innovation: “Once a set of building blocks has been chosen, innovation depends on selection from among the plethora of potential combinations. The possibilities are so numerous that the same building blocks can be used over and over again without seriously impairing the chances for original discoveries. Think of the standard building blocks provided by words in a language, or folk themes in music. The key to handling this complexity is the discovery of salient patterns in the tree of combinations. Creative individuals exhibit talent for such selection, but the mechanisms they employ are largely unknown.” [pg 217-218]

Prisoner’s Dilemma

Finished Prisoner’s Dilemma: John Von Neumann, Game Theory and the Puzzle of the Bomb today, a rainy miserable Saturday just like every other Saturday the last couple weekends here in Massachusetts. When will the weather get better? As before, I like to post interesting quotes:

“… Jacob Bronowski wrote in 1973, ‘You must see that in a sense all science, all human thought, is a form of play. Abstract thought is the neotony of the intellect, by which man is able to carry out activities which have no immediate goal in order to prepare himself for long-term strategies and plans.'” [pg 39]

“It’s no exaggeration to say that society is founded on cooperation. Whether to litter — leave a tip — shoplift — stop and help someone — lie — conserve electricity — etc., etc. — all are dilemmas of individual gain and the common good. Some commentators have speculated that irrational cooperation is the cornerstone of of society, and without it life would be, as Hobbes put it, ‘solitary, poor, nasty, brutish, and short.'” [pg 227]

On the TIT FOR TAT strategy that Robert Axelrod used in the iterative prisoners dilemma game: “… one of the more surprising findings was that TIT FOR TAT won without ever exploiting another strategy. ‘We tend to compare our scores to other people’s scores,’ he explained. ‘But that’s not the way to get a good score. TIT FOR TAT can’t beat anybody, but it still wins the tournament. That’s a very bizarre idea. You can’t win a chess tournament by never beating anybody.” [pg 241]

Pages 249 & 250 describe real-life examples of iterated prisoners dilemmas in the lives of stickleback fish (read more about that here) and the vampire bat of South America.

Gandhi’s Truth

I finished “Gandhi’s Truth: On The Origins of Militant Nonviolence” this morning. I won’t try to summarize what Erik Erikson wrote over 450 pages, but here are a couple quotes I found worth remembering:

… I must reduce myself to zero. So long as a man does not of his own free will put himself last mong his fellow creatures, there is no salvation for him. Ahimsa is the farthest limit of humility.” [pg 59]

… I will not forget the consternation which I caused in some of Gandhi’s odl friends when I asked them to stand up and show me how tall he was as compared with them. It became clear that, while in fact small, he seemed immeasurable. The passing of such a pervasive light leaves the dark even darker and the once-enlightened suddenly forlorn. For the numinous person has the strange power to make the participant feel part of him and yet also feel augmented in himself; and both of these augmentations are apt to wane when the great moment is over.” [pg 63]

Heinrich Zimmer summarizes the meaning of dharma as: “The correct manner of dealing with every life problem that arises, therefore, is indicated by the laws (dharma) of the caste (varna) to which one belongs, and of the particular stage of life (asrama) that is proper to one’s age. One is not free to choose; one belongs to a species — a family, guild, and craft, a group, a denomination. And since this circumstance not only determines to the last detail the regulations for one’s public and private conduct, but also represents (according to this all inclusive and pervasive, unyielding pattern of integration) the real ideal of one’s present natural character, one’s concern as a judging and acting entity must be only to meet every life problem in a manner befitting the role one plays…” [pg 75-76]

.. In all of Gandhi’s utterances … two themes stood out, new in the independence movement: never start what you have not clearly circumscribed in your own mind or what you are not ready to suffer for to the very end.” [pg 89]

In regards to the desire Gandhi had to be untainted and unsmudged: “Mere character could be, as it were, a cold chimney, nothing more than an encasement. A fireplace is not worth more than the fire it can hold and warmth it can generate; and a man like Gandhi, I would surmise, early knew that he had to contain a superior energy of destructive, as well as benevolent, forces..” [pg 101]

On the difference between Indians and Westerners: “… the very qualities of Indians count for defects in South Africa. The Indians are disliked in South Africa for their simplicity, patience, perseverance, frugality, and otherworldliness. Westerners are enterprising, impatient, engrossed in multiplying their material wants and in satisfying them, fond of good cheer, anxious to save physical labor and prodigal in habits.” [pg 190]

An illuminating quote in relation to the chaos that enveloped Baghdad in the days following the initial US occupation: “The point is that excess and riot follow repression and suppression when the moral restraints are lifted, precisely because of the autocratic and blind nature of those restraints…. nonviolence, inward and outward, can become a true force only where ethics replaces moralism. And ethics, to me, is marked by an insightful assent to human values, whereas moralism is blind obedience; and ethics is transmitted with informed persuasion, rather than enforced with with absolute interdicts.” [pg 251]

On our identities, who we are: “For membership in a nation, in a class, or in a caste is one of those elements of an individual’s identity which at the very minimum comprise what one is never not, as does membership in one of the two sexes or in a given race. What one is never not establishes the life space within which one may hope to become uniquely and affirmatively what one is — and then to transcend that uniqueness by way of a more inclusive humanity.” [pg 266]

The Day I Turned Uncool

Finished “The Day I Turned Uncool” [official site] [amazon]by Dan Zevin last night. Dan is from Jersey, now resides in Cambridge. 192 pages of fun. Read it if you’re starting to realize you’re not 21 anymore.

On another (completely and totally unrelated) note, I started reading “Ghandi’s Truth: On The Origins Of Militant Nonviolence” [amazon] tonight, a book by Erik Erikson (a resident of Stockbridge here in MA for a time). It is described aptly in an Amazon review as “…an introduction to the challenges of poverty, religious difference, and ethnic tensions we all must accept and try to deal with as we head into the everchanging 21st century.

All this and I’m cognizant of Geoff’s post a couple days ago about Goog and off-topic posts. I really enjoy reading Goog and I think it’s a great tool (one obviously not written in a short weekend). I was not one of the people who emailed him complaining that there were too many posts not related to MX technology. With that said, I think that Goog’s primary benefit is not that I can see what things other people are doing with MX technology… Go subscribe to an email list if you want announcements, bug fixes and people talking about the nuances of the ‘this’ scope versus the variables scope or go get yourself a RSS reader and compile your own list of interesting blogs. I for one enjoy pseudo off-topic posts. Anywho…