Computing: Fun contest: How many different words does “Don Quixote” have?

by worldysnews
0 comment

MEXICO CITY (process.com.mx).–A monumental work of Spanish literature is Don Quixote, written by Miguel de Cervantes many years ago. There are printed versions in paperback, in hardcover, in deluxe editions, etc. Of course, it has been in electronic format for a while and Don Quixote can be read in PDF, ePub and normal text formats, without special formatting, then.

An interesting challenge is knowing how many different words were used in Don Quixote. Therefore, let’s formally pose this playful challenge: Let’s take the digital file of Don Quixote (which can be downloaded from this site: and the task to solve is to make a list of all the different words that Cervantes’ work has and the frequency of the same ones (arranged alphabetically). What will be the word written the most times by the one-armed man from Lepanto? What will be the word that was only written once? These doubts do not let me sleep.

Here we will take two criteria into account: the first will be speed, that is, which program can make this list (which must be saved in a text file) faster and the second will give the frequency of each word used. The result must be delivered showing the word found per line and next to it, its frequency, that is, the times it occurred in the text. The criteria are taken into account interchangeably. For example, someone may deliver a program that retrieves all the words and their frequencies, and takes 2 minutes (say), while another delivers a program that is very fast, but does not deliver all the words or the count. That’s wrong. Obviously the winner here will be the one who delivers the result closest to what it should be.

The usual restrictions in the challenge are: It is not allowed to use a library to search, sort, or any other task that is part of the challenge, that is, only the language can be used as it is defined with the input/output libraries, For example.

It should be noted that the challenge is not easy, because before starting we must decide what data structure we are going to use. According to Word (and including Project Gutenberg disclaimers), Don Quixote contains about 384,262 words. Let’s think, just to illustrate, that all of them were different (which is not true), and that each word occupies about 10 characters on average (it sounds reasonable to assume that), we would have to have a structure that contained about 4 MBytes, which is not very complicated. To that we must add the word count, which must be put in some way. Please don’t even think about making a 4 MByte array to contain words and numbers. That, although it can be done and the compiler does not protest, is not an acceptable programming technique. In my opinion, a tree of records must be made that contain the words and the frequency found. The advantage of this structure is that sorting can be done simply by traversing the tree in a particular way. It is not necessary to order it directly. But these are just tips, they do not have to be followed if you do not want to.

Those who program in Python, Ruby on Rails or any interpreted language, have missed the part of processing faster than others. This does not mean that they can no longer win the challenge, but they are clearly at a disadvantage compared to programs compiled to native x86 code. I already wrote my own version of this challenge and it takes less than 2 seconds to count and find the frequency of the words in Don Quixote. Do you think you can do better?

The award? A mug with the Walrus logo is the best solution. It could be that more small prizes were incorporated (we are working on that), such as t-shirts, etc. (we appeal to those who want to donate a symbolic prize to make these challenges more attractive). This only applies to programmers who live in CDMX (sending a cup to the province or other countries is stupidly expensive). If the contestants are from other countries or from the Mexican province, the prize will be a USB memory of at least 16 GBytes and will be sent to them by certified mail. And yes, I know they are not the big prizes but this is what there is for the moment. Obviously whoever wins will be announced here and will even have their fifteen minutes of fame.

The final results are final. It should be noted that this contest simply seeks to encourage the work of programming and show that it can be fun. It is a good faith contest. The winner gives their source code to the community. That is, open source is promoted. Programs copied from the web or that have a suspicious taste of plagiarism may be eliminated without further consideration and there is no possibility of protesting the decision. The joke of these challenges is that programmers are encouraged to solve them, not that they look for ways to cheat. So sharpen your programming skills and have a good time trying to solve the proposed problem!

The programs with the solution to the challenge, including the source code, please send them to my email: morsa@la-morsa.com


#Computing #Fun #contest #words #Don #Quixote
2024-04-16 23:55:49

You may also like

Leave a Comment

Hosted by Byohosting – Most Recommended Web Hosting – for complains, abuse, advertising contact: o f f i c e @byohosting.com