Construya un cuadro ASCII de las palabras más comúnmente usadas en un texto dado

El reto:

Construya un cuadro ASCII de las palabras más comúnmente usadas en un texto dado.

Las normas:

  • Solo acepta az y AZ (caracteres alfabéticos) como parte de una palabra.
  • Ignore el revestimiento ( She == she para nuestro propósito).
  • Ignore las siguientes palabras (bastante arbitrarias, lo sé): the, and, of, to, a, i, it, in, or, is
  • Aclaración: considerando don't : esto se tomaría como 2 “palabras” diferentes en los rangos az y AZ : ( don y t ).

  • Opcionalmente (es demasiado tarde para cambiar formalmente las especificaciones ahora), puede optar por descartar todas las ‘palabras’ de una sola letra (esto podría hacer que también se acorte la lista de ignorar).

Analice un text dado (lea un archivo especificado a través de los argumentos de la línea de comandos o entierre, presume us-ascii ) y comstackr un word frequency chart con las siguientes características:

  • Muestre la tabla (también vea el ejemplo a continuación) para las 22 palabras más comunes (ordenadas por frecuencia descendente).
  • El width barra representa el número de ocurrencias (frecuencia) de la palabra (proporcionalmente). Agregue un espacio e imprima la palabra.
  • Asegúrate de que estas barras (más espacio-palabra-espacio) siempre encajen : bar + [space] + word + [space] siempre deben tener <= 80 caracteres (asegúrate de tener en cuenta las posibles longitudes diferentes de barras y palabras: por ejemplo: el segundo la palabra más común podría ser mucho más larga que la primera, sin diferir tanto en frecuencia). Maximice el ancho de la barra dentro de estas restricciones y escale las barras apropiadamente (de acuerdo con las frecuencias que representan).

Un ejemplo:

El texto para el ejemplo se puede encontrar aquí ( Alicia en el país de las maravillas, de Lewis Carroll ).

Este texto específico arrojaría la siguiente tabla:

  _________________________________________________________________________
 | _________________________________________________________________________ |  ella 
 | _______________________________________________________________ |  tú 
 | ____________________________________________________________ |  dijo 
 | ____________________________________________________ |  Alicia 
 | ______________________________________________ |  estaba 
 | __________________________________________ |  ese 
 | ___________________________________ |  como 
 | _______________________________ |  su 
 | ____________________________ |  con 
 | ____________________________ |  a 
 | ___________________________ |  s 
 | ___________________________ |  t 
 | _________________________ |  en 
 | _________________________ |  todas 
 | ______________________ |  esta 
 | ______________________ |  para 
 | ______________________ |  tenido 
 | _____________________ |  pero 
 | ____________________ |  ser 
 | ____________________ |  no 
 | ___________________ |  ellos 
 | __________________ |  asi que 


Para su información: estas son las frecuencias sobre las que se basa el cuadro anterior:

 [('ella', 553), ('usted', 481), ('dicho', 462), ('alice', 403), ('fue', 358), ('que
 ', 330), (' como ', 274), (' ella ', 248), (' con ', 227), (' en ', 227), (' s ', 219), (' t '
 , 218), ('en', 204), ('todo', 200), ('esto', 181), ('para', 179), ('tenía', 178), ('
 pero ', 175), (' ser ', 167), (' no ', 166), (' ellos ', 155), (' así ', 152)]

Un segundo ejemplo (para verificar si implementó la especificación completa): reemplace cada ocurrencia de you en el archivo vinculado Alice in Wonderland con superlongstringstring :

  ________________________________________________________________
 | ________________________________________________________________ |  ella 
 | _______________________________________________________ |  superlongstringstring 
 | _____________________________________________________ |  dijo 
 | ______________________________________________ |  Alicia 
 | ________________________________________ |  estaba 
 | _____________________________________ |  ese 
 | ______________________________ |  como 
 | ___________________________ |  su 
 | _________________________ |  con 
 | _________________________ |  a 
 | ________________________ |  s 
 | ________________________ |  t 
 | ______________________ |  en 
 | _____________________ |  todas 
 | ___________________ |  esta 
 | ___________________ |  para 
 | ___________________ |  tenido 
 | __________________ |  pero 
 | _________________ |  ser 
 | _________________ |  no 
 | ________________ |  ellos 
 | ________________ |  asi que 

El ganador:

La solución más corta (por cantidad de caracteres, por idioma). ¡Que te diviertas!


Editar : Tabla que resume los resultados hasta el momento (2012-02-15) (originalmente agregado por el usuario Nas Banov):

 Lenguaje Relajado Estricto
 ========= ======= ======
 GolfScript 130 143
 Perl 185
 Windows PowerShell 148 199
 Mathematica 199
 Ruby 185 205
 Unix Toolchain 194 228
 Python 183 243
 Clojure 282
 Scala 311
 Haskell 333
 Awk 336
 R 298
 Javascript 304 354
 Groovy 321
 Matlab 404
 C # 422
 Smalltalk 386
 PHP 450
 F # 452
 TSQL 483 507

Los números representan la longitud de la solución más corta en un idioma específico. “Estricto” se refiere a una solución que implementa la especificación completamente (dibuja |____| barras, cierra la primera barra en la parte superior con una ____ línea, da cuenta de la posibilidad de palabras largas con alta frecuencia, etc.). “Relajado” significa que se tomaron algunas libertades para acortar la solución.

Solo se incluyen soluciones de menos de 500 caracteres. La lista de idiomas está ordenada por la longitud de la solución ‘estricta’. ‘Unix Toolchain’ se usa para significar varias soluciones que usan shell * nix tradicional más una combinación de herramientas (como grep, tr, sort, uniq, head, perl, awk).

Nodos de LabVIEW 51, 5 estructuras, 10 diagtwigs

Enseñar al elefante a bailar claqué nunca es bonito. Yo, ah, me saltearé el recuento de personajes.

código labVIEW

resultados

El progtwig fluye de izquierda a derecha:

explicado el código de labVIEW

Ruby 1.9, 185 caracteres

(basado en gran medida en las otras soluciones de Ruby)

 w=($< .read.downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22] k,l=w[0] puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}] 

En lugar de usar cualquier modificador de línea de comando como las otras soluciones, simplemente puede pasar el nombre del archivo como argumento. (es decir, ruby1.9 wordfrequency.rb Alice.txt )

Como uso caracteres literales aquí, esta solución solo funciona en Ruby 1.9.

Editar: reemplazó punto y coma por saltos de línea para "legibilidad". :PAG

Editar 2: Shtééf señaló que olvidé el espacio final, lo arreglé.

Editar 3: se eliminó el espacio final de nuevo;)

GolfScript, 177 175 173 167 164 163 144 131 130 caracteres

Lento – 3 minutos para el texto de muestra (130)

 {32|.123%971{;)}if}/]2/{~~\;}$22< .0=~:2;,76\-:1'_':0*' '\@{" |"\~1*2/0*'| '@}/ 

Explicación:

 { #loop through all characters 32|. #convert to uppercase and duplicate 123%97< #determine if is a letter n@if #return either the letter or a newline }% #return an array (of ints) ]''* #convert array to a string with magic n% #split on newline, removing blanks (stack is an array of words now) "oftoitinorisa" #push this string 2/ #split into groups of two, ie ["of" "to" "it" "in" "or" "is" "a"] - #remove any occurrences from the text "theandi"3/-#remove "the", "and", and "i" $ #sort the array of words (1@ #takes the first word in the array, pushes a 1, reorders stack #the 1 is the current number of occurrences of the first word { #loop through the array .3$>1{;)}if#increment the count or push the next word and a 1 }/ ]2/ #gather stack into an array and split into groups of 2 {~~\;}$ #sort by the latter element - the count of occurrences of each word 22< #take the first 22 elements .0=~:2; #store the highest count ,76\-:1 #store the length of the first line '_':0*' '\@ #make the first line { #loop through each word " |"\~ #start drawing the bar 1*2/0 #divide by zero *'| '@ #finish drawing the bar }/ 

"Correcto" (con suerte). (143)

 {32|.123%971{;)}if}/]2/{~~\;}$22< ..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{" |"\~1*^/0*'| '@}/ 

Menos lento, medio minuto. (162)

 '"'/' ':S*n/S*'"#{%q '\+" .downcase.tr('^a-z',' ')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22< .0=~:2;,76\-:1'_':0*S\@{" |"\~1*2/0*'| '@}/ 

Salida visible en los registros de revisión.

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

 ~ % wc -c wfg 209 wfg ~ % cat wfg egrep -oi \\b[az]+|tr AZ az|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"' ~ % # usage: ~ % sh wfg < 11.txt 

hm, acabo de ver arriba: sort -nr -> sort -n y luego head -> tail => 208 🙂
update2: erm, por supuesto, lo de arriba es tonto, ya que se invertirá entonces. Entonces, 209.
actualización3: optimización de la expresión regular de exclusión -> 206

 egrep -oi \\b[az]+|tr AZ az|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"' 

por diversión, aquí hay una versión solo por perl (mucho más rápido):

 ~ % wc -c pgolf 204 pgolf ~ % cat pgolf perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([az]+)/gi}{@w=(sort{$f{$b}< =>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w' ~ % # usage: ~ % sh pgolf < 11.txt 

Solución basada en el conjunto Transact SQL (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 caracteres

Gracias a Gabe por algunas sugerencias útiles para reducir el número de personajes.

NB: Se han agregado saltos de línea para evitar barras de desplazamiento, solo se requiere el último salto de línea.

 DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A', SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING (@,i+1,1)FROM N WHERE i1 AND W NOT IN('the','and','of','to','it', 'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+ REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @ 

Versión legible

 DECLARE @ VARCHAR(MAX), @F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',SINGLE_BLOB)x; /* Loads text file from path C:\WINDOWS\system32\A */ /*Recursive common table expression to generate a table of numbers from 1 to string length (and associated characters)*/ WITH N AS (SELECT 1 i, LEFT(@,1)L UNION ALL SELECT i+1, SUBSTRING(@,i+1,1) FROM N WHERE i1 AND W NOT IN('the', 'and', 'of' , 'to' , 'it' , 'in' , 'or' , 'is') GROUP BY W ORDER BY C /*Just noticed this looks risky as it relies on the order of evaluation of the variables. I'm not sure that's guaranteed but it works on my machine :-) */ SELECT @F=MIN(($76-LEN(W))/-C), @ =' ' +REPLICATE('_',-MIN(C)*@F)+' ' FROM # SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @ 

Salida

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| You |____________________________________________________________| said |_____________________________________________________| Alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| at |_____________________________| with |__________________________| on |__________________________| all |_______________________| This |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| So |___________________| very |__________________| what 

Y con la larga cuerda

  _______________________________________________________________ |_______________________________________________________________| she |_______________________________________________________| superlongstringstring |____________________________________________________| said |______________________________________________| Alice |________________________________________| was |_____________________________________| that |_______________________________| as |____________________________| her |_________________________| at |_________________________| with |_______________________| on |______________________| all |____________________| This |____________________| for |____________________| had |____________________| but |___________________| be |__________________| not |_________________| they |_________________| So |________________| very |________________| what 

Ruby 207 213 211 210 207 203 201 200 caracteres

Una mejora en Anurag, incorporando la sugerencia de rfusca. También elimina el argumento para clasificar y algunos otros golfitos menores.

 w=(STDIN.read.downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "} 

Ejecutar como:

 ruby GolfedWordFrequencies.rb < Alice.txt 

Editar: vuelva a poner 'puts', necesita estar allí para evitar tener comillas en la salida.
Edit2: archivo cambiado-> IO
Edit3: eliminado / i
Edit4: Se han eliminado los paréntesis (f * 1.0), contados
Editar5: utilizar la sum de cadenas para la primera línea; expandir s en el lugar.
Edit6: Hecho m flotar, eliminado 1.0. EDITAR: no funciona, cambia longitudes. EDITAR: No es peor que antes
Edit7: Use STDIN.read .

Mathematica ( 297 284 248 244 242 199 caracteres) Funcional puro

y la prueba de la ley de Zipf

Mire Mamma … no vars, no hands, .. no head

Editar 1> algunas palabras cortas definidas (284 caracteres)

 f[x_, y_] := Flatten[Take[x, All, y]]; BarChart[f[{##}, -1], BarOrigin -> Left, ChartLabels -> Placed[f[{##}, 1], After], Axes -> None ] & @@ Take[ SortBy[ Tally[ Select[ StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]], !MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&] ], Last], -22] 

Algunas explicaciones

 Import[] # Get The File ToLowerCase [] # To Lower Case :) StringSplit[ STRING , RegularExpression["\\W+"]] # Split By Words, getting a LIST Select[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&] # Select from LIST except those words in LIST_TO_AVOID # Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the test Tally[LIST] # Get the LIST {word,word,..} and produce another {{word,counter},{word,counter}...} SortBy[ LIST ,Last] # Get the list produced bt tally and sort by counters Note that counters are the LAST element of {word,counter} Take[ LIST ,-22] # Once sorted, get the biggest 22 counters BarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST # Get the list produced by Take as input and produce a bar chart f[x_, y_] := Flatten[Take[x, All, y]] # Auxiliary to get the list of the first or second element of lists of lists x_ dependending upon y # So f[{##}, -1] is the list of counters # and f[{##}, 1] is the list of words (labels for the chart) 

Salida

texto alternativo http://i49.tinypic.com/2n8mrer.jpg

Mathematica no es muy adecuado para jugar al golf, y eso se debe solo a los nombres largos y descriptivos de las funciones. Funciones como “RegularExpression []” o “StringSplit []” solo me hacen sollozar :(.

Prueba de ley de Zipf

La ley de Zipf predice que, para un texto en lenguaje natural, el gráfico Log (Rank) vs Log (occurrences) sigue una relación lineal .

La ley se usa en el desarrollo de algoritmos para criptografía y compresión de datos. (Pero NO es la “Z” en el algoritmo LZW).

En nuestro texto, podemos probarlo con lo siguiente

  f[x_, y_] := Flatten[Take[x, All, y]]; ListLogLogPlot[ Reverse[f[{##}, -1]], AxesLabel -> {"Log (Rank)", "Log Counter"}, PlotLabel -> "Testing Zipf's Law"] & @@ Take[ SortBy[ Tally[ StringSplit[ToLowerCase[b], RegularExpression["\\W+"]] ], Last], -1000] 

El resultado es (bastante bien lineal)

texto alternativo http://i46.tinypic.com/33fcmdk.jpg

Editar 6> (242 Chars)

Refactorización de Regex (ya no se usa la función Seleccionar)
Dejar caer 1 palabras de carbonilla
Definición más eficiente para la función “f”

 f = Flatten[Take[#1, All, #2]]&; BarChart[ f[{##}, -1], BarOrigin -> Left, ChartLabels -> Placed[f[{##}, 1], After], Axes -> None] & @@ Take[ SortBy[ Tally[ StringSplit[ToLowerCase[Import[i]], RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]] ], Last], -22] 

Editar 7 → 199 caracteres

 BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@ Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i, RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22] 
  • Se reemplazó f con los argumentos Transpose y Slot ( #1 / #2 ).
  • No necesitamos ningún paréntesis desagradable (use f@x lugar de f[x] cuando sea posible)

C # – 510 451 436 446 434 426 422 caracteres (minificado)

No es tan corto, ¡pero ahora probablemente sea correcto! Tenga en cuenta que la versión anterior no mostraba la primera línea de las barras, no escalaba las barras correctamente, descargaba el archivo en lugar de obtenerlo de stdin, y no incluía todo el nivel de detalle requerido de C #. Podrías afeitar muchos golpes si C # no necesitara tanta mierda extra. Tal vez Powershell podría hacerlo mejor.

 using C=System.Console; // alias for Console using System.Linq; // for Split, GroupBy, Select, OrderBy, etc. class Class // must define a class { static void Main() // must define a Main { // split into words var allwords = System.Text.RegularExpressions.Regex.Split( // convert stdin to lowercase C.In.ReadToEnd().ToLower(), // eliminate stopwords and non-letters @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+") .GroupBy(x => x) // group by words .OrderBy(x => -x.Count()) // sort descending by count .Take(22); // take first 22 words // compute length of longest bar + word var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length)); // prepare text to print var toPrint = allwords.Select(x=> new { // remember bar pseudographics (will be used in two places) Bar = new string('_',(int)(x.Count()/lendivisor)), Word=x.Key }) .ToList(); // convert to list so we can index into it // print top of first bar C.WriteLine(" " + toPrint[0].Bar); toPrint.ForEach(x => // for each word, print its bar and the word C.WriteLine("|" + x.Bar + "| " + x.Word)); } } 

422 caracteres con lendivisor en línea (lo que lo hace 22 veces más lento) en el siguiente formulario (líneas nuevas usadas para espacios seleccionados):

 using System.Linq;using C=System.Console;class M{static void Main(){var a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+xt));}} 

Perl, 237 229 209 caracteres

(Actualizado nuevamente para superar la versión de Ruby con más trucos de golf sucios, reemplazando split/[^az/,lc con lc=~/[az]+/g , y eliminando una marca de cadena vacía en otro lugar. Estos fueron inspirados por la versión de Ruby, así que acredite dónde se debe el crédito).

Actualización: ahora con Perl 5.10! Reemplace la print con say , y use ~~ para evitar un map . Esto debe invocarse en la línea de comando como perl -E '' alice.txt . Como toda la secuencia de comandos está en una línea, escribirla como una línea no debe presentar ninguna dificultad :).

  @s=qw/the and of to ai it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[az]+/g}<>;@s=sort{$c{$b}< =>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21]; 

Tenga en cuenta que esta versión se normaliza para el caso. Esto no acorta la solución, ya que eliminar ,lc (para la carcasa inferior) requiere que agregue AZ a la expresión regular dividida, por lo que es un lavado.

Si está en un sistema donde una nueva línea tiene un carácter y no dos, puede acortar esto con otros dos caracteres utilizando una nueva línea literal en lugar de \n . Sin embargo, no he escrito la muestra anterior de esa manera, ya que es “más claro” (¡ja!) De esa manera.


Aquí hay una solución perl, en su mayoría correcta, pero no remotamente lo suficientemente corta:

 use strict; use warnings; my %short = map { $_ => 1 } qw/the and of to ai it in or is/; my %count = (); $count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>); my @sorted = (sort { $count{$b} < => $count{$a} } keys %count)[0..21]; my $widest = 76 - (length $sorted[0]); print " " . ("_" x $widest) . "\n"; foreach (@sorted) { my $width = int(($count{$_} / $count{$sorted[0]}) * $widest); print "|" . ("_" x $width) . "| $_ \n"; } 

La siguiente información es tan breve como puede obtenerse mientras se mantiene relativamente legible. (392 caracteres).

 %short = map { $_ => 1 } qw/the and of to ai it in or is/; %count; $count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^az]/, lc } (<>); @sorted = (sort { $count{$b} < => $count{$a} } keys %count)[0..21]; $widest = 76 - (length $sorted[0]); print " " . "_" x $widest . "\n"; print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted; 

Windows PowerShell, 199 caracteres

 $x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort * filter f($w){' '+'_'*$w $x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}} f(76..1|?{!((f $_)-match'.'*80)})[0] 

(El último salto de línea no es necesario, pero se incluye aquí para facilitar la lectura).

(El código actual y mis archivos de prueba están disponibles en mi repository SVN . Espero que mis casos de prueba capten la mayoría de los errores comunes (longitud de la barra, problemas con la coincidencia de expresiones regulares y algunos otros)

Suposiciones

  • US ASCII como entrada. Probablemente se vuelva raro con Unicode.
  • Al menos dos palabras non-stop en el texto

Historia

Versión relajada (137), ya que eso ya se cuenta por separado, aparentemente:

 ($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name} 
  • no cierra la primera barra
  • no tiene en cuenta la longitud de palabra de la primera palabra

Las variaciones de las longitudes de barra de un carácter en comparación con otras soluciones se deben a que PowerShell utiliza el redondeo en lugar del truncamiento al convertir los números de coma flotante en enteros. Sin embargo, dado que la tarea requería solo una longitud de barra proporcional, esto debería estar bien.

En comparación con otras soluciones, tomé un enfoque ligeramente diferente al determinar la longitud más larga de la barra simplemente probando y tomando la longitud más alta en la que ninguna línea tiene más de 80 caracteres.

Una versión anterior explicada se puede encontrar aquí .

Ruby, 215, 216 , 218 , 221 , 224 , 236 , 237 caracteres

actualización 1: ¡ Hurra ! Es un empate con la solución de JS Bangs . No se puede pensar en una forma de reducir más 🙂

actualización 2: jugó un truco de golf sucio. Cambié each al map para guardar 1 carácter 🙂

actualización 3: Cambió File.read a IO.read +2. Array.group_by no fue muy fructífero, cambió para reduce +6. La verificación insensible a las mayúsculas y minúsculas no es necesaria después de la carcasa inferior con el downcase en regex +1. La ordenación en orden descendente se realiza fácilmente anulando el valor +6. Ahorro total +15

actualización 4: [0] lugar de .first , +3. (@ Shtééf)

actualización 5: expanda la variable l en contexto, +1. Expanda la variable s en contexto, +2. (@ Shtééf)

actualización 6: use la sum de cadenas en lugar de la interpolación para la primera línea, +2. (@ Shtééf)

 w=(IO.read($_).downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "} 

actualización 7: pasé por mucho hoopla para detectar la primera iteración dentro del ciclo, usando variables de instancia. Todo lo que obtuve es +1, aunque tal vez haya potencial. Preservando la versión anterior, porque creo que esta es magia negra. (@ Shtééf)

 (IO.read($_).downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "} 

Versión legible

 string = File.read($_).downcase words = string.scan(/[az]+/i) allowed_words = words - %w{the and of to ai it in or is} sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] < => a[1] }.take(22) highest_frequency = sorted_words.first highest_frequency_count = highest_frequency[1] highest_frequency_word = highest_frequency[0] word_length = highest_frequency_word.size widest = 76 - word_length puts " #{'_' * widest}" sorted_words.each do |word, freq| width = (freq * 1.0 / highest_frequency_count) * widest puts "|#{'_' * width}| #{word} " end 

Usar:

 echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb 

Salida:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

Python 2.x, latitudinarian approach = 227 183 chars

 import sys,re t=re.split('\W+',sys.stdin.read().lower()) r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22] for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w 

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion ( the, and, of, to, a, i, it, in, or, is ) – plus it also excludes the two infamous “words” s and t from the example – and I threw in for free the exclusion for an, for, he . I tried all concatenations of those words against corpus of the words from Alice, King James’ Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings: itheandtoforinis and andithetoforinis .

PD. borrowed from other solutions to shorten the code.

 =========================================================================== she ================================================================= you ============================================================== said ====================================================== alice ================================================ was ============================================ that ===================================== as ================================= her ============================== at ============================== with =========================== on =========================== all ======================== this ======================== had ======================= but ====================== be ====================== not ===================== they ==================== so =================== very =================== what ================= little 

Despotricar

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus used. Per one of the most popular lists ( http://en.wikipedia.org/wiki/Most_common_words_in_English , http://www.english-for-students.com/Frequently-Used-Words.html , http://www.sporcle.com/games/common_english_words.php ), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

So question is why or was included in the problem’s ignore list, where it’s ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

Alternative idea would be simply to skip the top 10 words from the result – which actually would shorten the solution (elementary – have to show only the 11th to 32nd entries).


Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

 import sys,re t=re.split('\W+',sys.stdin.read().lower()) r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22] h=min(9*l/(77-len(w))for l,w in r) print'',9*r[0][0]/h*'_' for l,w in r:print'|'+9*l/h*'_'+'|',w 

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to ai it in or is < "Alice's Adventures in Wonderland.txt"

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

PD. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

  _______________________________________________________________ |_______________________________________________________________| she |_______________________________________________________| superlongstringstring |_____________________________________________________| said |______________________________________________| alice |_________________________________________| was |______________________________________| that |_______________________________| as |____________________________| her |__________________________| at |__________________________| with |_________________________| s |_________________________| t |_______________________| on |_______________________| all |____________________| this |____________________| for |____________________| had |____________________| but |___________________| be |___________________| not |_________________| they |_________________| so 

Haskell – 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

 import Data.List import Data.Char l=length t=filter m=map fc|isAlpha c=toLower c|0<1=' ' hw=(-lw,head w) x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w q?(g,w)=q*(77-lw)`div`g bx=m(x!)x a(l:r)=(' ':t(=='_')l):l:r main=interact$unlines.abtake 22.sort.m h.group.sort .t(`notElem`words"the and of to ai it in or is").words.mf 

How it works is best seen by reading the argument to interact backwards:

  • map f lowercases alphabetics, replaces everything else with spaces.
  • words produces a list of words, dropping the separating whitespace.
  • filter ( notElem words "the and of to ai it in or is") discards all entries with forbidden words.
  • group . sort sorts the words, and groups identical ones into lists.
  • map h maps each list of identical words to a tuple of the form (-frequency, word) .
  • take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
  • b maps tuples to bars (see below).
  • a prepends the first line of underscores, to complete the topmost bar.
  • unlines joins all these lines together with newlines.

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps cx over x , where x is the list of histograms. The entire list is passed to c , so that each invocation of c can compute the scale factor for itself by calling u . In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

Note the trick of using -frequency . This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u , two -frequency values are multiplied, which will cancel the negation out.

JavaScript 1.8 (SpiderMonkey) – 354

 x={};p='|';e=' ';z=[];c=77 while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1})) z=z.sort(function(a,b)bc-ac).slice(0,22) for each(v in z){vr=vc/z[0].c c=c>(l=(77-vwlength)/vr)?l:c}for(k in z){v=z[k] s=Array(vr*c|0).join('_') if(!+k)print(e+s+e) print(p+s+p+e+vw)} 

Sadly, the for([k,v]in z) from the Rhino version doesn’t seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines….

Adding whitespace for readability:

 x={};p='|';e=' ';z=[];c=77 while(l=readline()) l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g, function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} ) ) z=z.sort(function(a,b) bc - ac).slice(0,22) for each(v in z){ vr=vc/z[0].c c=c>(l=(77-vwlength)/vr)?l:c } for(k in z){ v=z[k] s=Array(vr*c|0).join('_') if(!+k)print(e+s+e) print(p+s+p+e+vw) } 

Usage: js golf.js < input.txt

Salida:

  _________________________________________________________________________ 
|_________________________________________________________________________|  ella
|_______________________________________________________________|  tú
|____________________________________________________________|  dijo
|____________________________________________________|  Alicia
|______________________________________________|  estaba
|___________________________________________|  ese
|___________________________________|  como
|________________________________|  su
|_____________________________|  a
|_____________________________|  con
|____________________________|  s
|____________________________|  t
|__________________________|  en
|_________________________|  todas
|_______________________|  esta
|______________________|  para
|______________________|  tenido
|______________________|  pero
|_____________________|  ser
|_____________________|  no
|___________________|  ellos
|___________________|  asi que

(base version - doesn't handle bar widths correctly)

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

I think my sorting logic is off, but.. I duno. Brainfart fixed.

Minified (abusing \n 's interpreted as a ; sometimes):

 x={};p='|';e=' ';z=[] readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})}) z=z.sort(function(a,b){return bc-ac}).slice(0,22) for([k,v]in z){s=Array((vc/z[0].c)*70|0).join('_') if(!+k)print(e+s+e) print(p+s+p+e+vw)} 

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

Usage: php.exe

Minified:

 < ?php $a=array_count_values(array_filter(preg_split('/[^az]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?> 

Human readable:

 < ?php // Read: $s = strtolower(file_get_contents($argv[1])); // Split: $a = preg_split('/[^az]/', $s, -1, PREG_SPLIT_NO_EMPTY); // Remove unwanted words: $a = array_filter($a, function($x){ return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x); }); // Count: $a = array_count_values($a); // Sort: arsort($a); // Pick top 22: $a=array_slice($a,0,22); // Recursive function to adjust bar widths // according to the last requirement: function R($a,$F,$B){ $r = array(); foreach($a as $x=>$f){ $l = strlen($x); $r[$x] = $b = $f * $B / $F; if ( $l + $b > 76 ) return R($a,$f,76-$l); } return $r; } // Apply the function: $c = R($a,max($a),76-strlen(key($a))); // Output: foreach ($a as $x => $f) echo '|',str_repeat('-',$c[$x]),"| $x\n"; ?> 

Salida:

 |-------------------------------------------------------------------------| she |---------------------------------------------------------------| you |------------------------------------------------------------| said |-----------------------------------------------------| alice |-----------------------------------------------| was |-------------------------------------------| that |------------------------------------| as |--------------------------------| her |-----------------------------| at |-----------------------------| with |--------------------------| on |--------------------------| all |-----------------------| this |-----------------------| for |-----------------------| had |-----------------------| but |----------------------| be |---------------------| not |--------------------| they |--------------------| so |-------------------| very |------------------| what 

When there is a long word, the bars are adjusted properly:

 |--------------------------------------------------------| she |---------------------------------------------------| thisisareallylongwordhere |-------------------------------------------------| you |-----------------------------------------------| said |-----------------------------------------| alice |------------------------------------| was |---------------------------------| that |---------------------------| as |-------------------------| her |-----------------------| with |-----------------------| at |--------------------| on |--------------------| all |------------------| this |------------------| for |------------------| had |-----------------| but |-----------------| be |----------------| not |---------------| they |---------------| so |--------------| very 

Python 3.1 – 245 229 charaters

I guess using Counter is kind of cheating 🙂 I just read about it about a week ago, so this was the perfect chance to see how it works.

 import re,collections o=collections.Counter([w for w in re.findall("[az]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22) print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o)) 

Imprime:

 |____________________________________________________________________________| she |__________________________________________________________________| you |_______________________________________________________________| said |_______________________________________________________| alice |_________________________________________________| was |_____________________________________________| that |_____________________________________| as |__________________________________| her |_______________________________| with |_______________________________| at |______________________________| s |_____________________________| t |____________________________| on |___________________________| all |________________________| this |________________________| for |________________________| had |________________________| but |______________________| be |______________________| not |_____________________| they |____________________| so 

Some of the code was “borrowed” from AKX’s solution.

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

Original:

 $k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[az]+/g}<>;@t=sort{$k{$b}< =>$k{$a}}keys%k;$l=76-length$t[0];printf" %s ",'_'x$l;printf"|%s| $_ ",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21]; 

Latest version down to 191 characters:

 /^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[az]+/g}<>;@e=sort{$k{$b}< =>$k{$a}}keys%k;$n=" %s ";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s "}@e[0,0..21] 

Latest version down to 189 characters:

 /^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[az]+/g}<>;@_=sort{$k{$b}< =>$k{$a}}keys%k;$n=" %s ";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s "}@_[0,0..21] 

This version (205 char) accounts for the lines with words longer than what would be found later.

 /^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[az]+/g}<>;($r)=sort{$a< =>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}< =>$k{$a}}keys%k;$n=" %s ";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s ";}@e[0,0..21] 

Perl: 203 202 201 198 195 208 203 / 231 chars

 $/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[az]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}< =>$x{$a}}keys%x)[0..21] 

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars ( this implementation is 231 chars ):

 $/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[az]+/gi;@e=(sort{$x{$b}< =>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"} 

The specification didn’t state anywhere that this had to go to STDOUT, so I used perl’s warn() instead of print – four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 – might sleep on it. At least Perl’s now under the “shell, grep, tr, grep, sort, uniq, sort, head, perl” char count for now 😉

PS: Reddit says “Hi” 😉

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional “ignore 1-letter words” rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block – Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct – /^…$/i?1:$x{$ }++ for /^…$/||$x{$ }++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon… perhaps.

Update 4: Sleep deprivation has made me insane. Bien. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced “length” with the 1-char shorter (and much more golfish) y///c – you hear me, GolfScript?? I’m coming for you!!! sob

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn’t the end of the world. Played around with perl’s regex inline eval, but having trouble getting it to both work and save chars… lol. Updated the example to match current output.

Update 6: Removed unneeded braces protecting (…)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

Ejemplos:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so |___________________| very |__________________| what 

Alternative implementation in pathological case example:

  _______________________________________________________________ |_______________________________________________________________| she |_______________________________________________________| superlongstringstring |____________________________________________________| said |______________________________________________| alice |________________________________________| was |_____________________________________| that |_______________________________| as |____________________________| her |_________________________| with |_________________________| at |_______________________| on |______________________| all |____________________| this |____________________| for |____________________| had |____________________| but |___________________| be |__________________| not |_________________| they |_________________| so |________________| very |________________| what 

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k , then print results.

 let a= stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1) |>Seq.map(fun s->s.ToLower())|>Seq.countBy id |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w)) |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22 let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min let un=String.replicate(int(float(n)*k)-2)"_" printfn" %s "(u(snd(Seq.nth 0 a))) for(w,n)in a do printfn"|%s| %s "(un)w 

Example (I have different freq counts than you, unsure why):

 % app.exe < Alice.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |___________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| t |____________________________| s |__________________________| on |_________________________| all |_______________________| this |______________________| had |______________________| for |_____________________| but |_____________________| be |____________________| not |___________________| they |__________________| so 

Python 2.6, 347 chars

 import re W,x={},"a and i in is it of or the to".split() [W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[az]+",file("11.txt").read().lower())if w not in x] W=sorted(W.items(),key=lambda p:p[1])[:22] bm=(76.-len(W[0][0]))/W[0][1] U=lambda n:"_"*int(n*bm) print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W)) 

Salida:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

*sh (+curl), partial solution

This is incomplete, but for the hell of it, here’s the word-frequency counting half of the problem in 192 bytes:

 curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^az]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^az]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22 

Gawk — 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt’s challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

Heh heh! I am momentarily ahead of [Matt’s JavaScript][1] solution counter challenge! 😉 and [AKX’s python][2].

The problem seems to call out for a language that implements native associative arrays, so of course I’ve chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

It is all terribly inefficient, with all the golfifcations I’ve made it has gotten to be pretty awful, as well.

Minified:

 {gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++} END{split("the and of to ai it in or is",b," "); for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e} for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2; t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t; print"|"t"| "x;delete a[x]}} 

line breaks for clarity only: they are not necessary and should not be counted.


Salida:

 $ gawk -f wordfreq.awk.min < 11.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |____________________________________________________| alice |______________________________________________| was |__________________________________________| that |___________________________________| as |_______________________________| her |____________________________| with |____________________________| at |___________________________| s |___________________________| t |_________________________| on |_________________________| all |______________________| this |______________________| for |______________________| had |_____________________| but |____________________| be |____________________| not |___________________| they |__________________| so $ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min ______________________________________________________________________ |______________________________________________________________________| she |_____________________________________________________________| superlongstring |__________________________________________________________| said |__________________________________________________| alice |____________________________________________| was |_________________________________________| that |_________________________________| as |______________________________| her |___________________________| with |___________________________| at |__________________________| s |__________________________| t |________________________| on |________________________| all |_____________________| this |_____________________| for |_____________________| had |____________________| but |___________________| be |___________________| not |__________________| they |_________________| so 

Readable; 633 characters (originally 949):

 { gsub("[^a-zA-Z]"," "); for(;NF;NF--) a[tolower($NF)]++ } END{ # remove "short" words split("the and of to ai it in or is",b," "); for (w in b) delete a[b[w]]; # Find the bar ratio d=1; for (w in a) { e=a[w]/(78-length(w)); if (e>d) d=e } # Print the entries highest count first for (i=22; i; --i){ # find the highest count e=0; for (w in a) if (a[w]>e) e=a[x=w]; # Print the bar l=a[x]/d-2; # make a string of "_" the right length t=sprintf(sprintf("%%%dc",l)," "); gsub(" ","_",t); if (i==22) print" "t; print"|"t"| "x; delete a[x] } } 

Common LISP, 670 characters

I’m a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

 (flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c( make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda (kv)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test 'equal))(push(cons kv)y)))c)(setf y(sort y #'> :key #'cdr))(setf y (subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(- 76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* nf))) (write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline) (dolist(xy)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x)))))) (cond((char< = #\ax #\z)(push xw))(t(incf(gethash(concatenate 'string( reverse w))c 0))(setf w nil))))) 

can be run on for example with cat alice.txt | clisp -C golf.lisp .

In readable form is

 (flet ((r () (let ((x (read-char t nil))) (and x (char-downcase x))))) (do ((c (make-hash-table :test 'equal)) ; the word count map wy ; current word and final word list (x (r) (r))) ; iteration over all chars ((not x) ; make a list with (word . count) pairs removing stopwords (maphash (lambda (kv) (if (not (find k '("" "the" "and" "of" "to" "a" "i" "it" "in" "or" "is") :test 'equal)) (push (cons kv) y))) c) ; sort and truncate the list (setf y (sort y #'> :key #'cdr)) (setf y (subseq y 0 (min (length y) 22))) ; find the scaling factor (let ((f (apply #'min (mapcar (lambda (x) (/ (- 76.0 (length (car x))) (cdr x))) y)))) ; output (flet ((outx (n) (dotimes (i (floor (* nf))) (write-char #\_)))) (write-char #\Space) (outx (cdar y)) (write-char #\Newline) (dolist (xy) (write-char #\|) (outx (cdr x)) (format t "| ~a~%" (car x)))))) ; add alphabetic to current word, and bump word counter ; on non-alphabetic (cond ((char< = #\ax #\z) (push xw)) (t (incf (gethash (concatenate 'string (reverse w)) c 0)) (setf w nil))))) 

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

It does not handle failures and it does not release used memory.

 #include  #define S(X)g_string_##X #define H(X)g_hash_table_##X GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);} 

Perl, 185 char

200 (slightly broken) 199 197 195 193 187 185 characters. Last two newlines are significant. Complies with the spec.

 map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[az]+/gfor<>; $n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21]; die map{$U='_'x($X{$_}/$n);" $U "x!$z++,"|$U| $_ "}@w 

First line loads counts of valid words into %X .

The second line computes minimum scaling factor so that all output lines will be < = 80 characters.

The third line (contains two newline characters) produces the output.

Java – 886 865 756 744 742 744 752 742 714 680 chars

  • Updates before first 742 : improved regex, removed superfluous parameterized types, removed superfluous whitespace.

  • Update 742 > 744 chars : fixed the fixed-length hack. It’s only dependent on the 1st word, not other words (yet). Found several places to shorten the code ( \\s in regex replaced by and ArrayList replaced by Vector ). I’m now looking for a short way to remove the Commons IO dependency and reading from stdin.

  • Update 744 > 752 chars : I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.

  • Update 752 > 742 chars : I removed public and a space, made classname 1 char instead of 2 and it’s now ignoring one-letter words.

  • Update 742 > 714 chars : Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).

  • Update 714 > 680 chars : Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll() .


 import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Mapm=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);Listl=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}} 

Versión más legible:

 import java.util.*; class F{ public static void main(String[]a)throws Exception{ StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c)); final Mapm=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1); Listl=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}}); int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s); for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w); } } 

Salida:

  _________________________________________________________________________
|_________________________________________________________________________|  ella
|_______________________________________________________________|  tú
|____________________________________________________________|  dijo
|_____________________________________________________|  Alicia
|_______________________________________________|  estaba
|___________________________________________|  ese
|____________________________________|  como
|________________________________|  su
|_____________________________|  con
|_____________________________|  a
|__________________________|  en
|__________________________|  todas
|_______________________|  esta
|_______________________|  para
|_______________________|  tenido
|_______________________|  pero
|______________________|  ser
|_____________________|  no
|____________________|  ellos
|____________________|  asi que
|___________________|  muy
|__________________|  qué

It pretty sucks that Java doesn’t have String#join() and closures (yet).

Edit by Rotsor:

I have made several changes to your solution:

  • Replaced List with a String[]
  • Reused the ‘args’ argument instead of declaring my own String array. Also used it as an argument to .ToArray()
  • Replaced StringBuffer with a String (yes, yes, terrible performance)
  • Replaced Java sorting with a selection-sort with early halting (only first 22 elements have to be found)
  • Aggregated some int declaration into a single statement
  • Implemented the non-cheating algorithm finding the most limiting line of output. Implemented it without FP.
  • Fixed the problem of the program crashing when there were less than 22 distinct words in the text
  • Implemented a new algorithm of reading input, which is fast and only 9 characters longer than the slow one.

The condensed code is 688 711 684 characters long:

 import java.util.*;class F{public static void main(String[]l)throws Exception{Mapm=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(xx*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k 

The fast version ( 720 693 characters)

 import java.util.*;class F{public static void main(String[]l)throws Exception{Mapm=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(xx*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k 

Versión más legible:

 import java.util.*;class F{public static void main(String[]l)throws Exception{ Mapm=new HashMap();String w=""; int i=0,k=0,j=8,x,y,g=22; for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{ if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w=""; }} l=m.keySet().toArray(l);x=l.length;if(xx*j){i=x;j=y;}} String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_'); System.out.println(" "+s); for(k=0;k 

The version without behaviour improvements is 615 characters:

 import java.util.*;class F{public static void main(String[]l)throws Exception{Mapm=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i 

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

Now as a script ( a.scala ):

 val t="\\w+\\b(?< !\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22 def b(p:Int)="_"*(p*(for((w,c)< -t)yield(76.0-w.size)/c).min).toInt println(" "+b(t(0)._2)) for(p<-t)printf("|%s| %s \n",b(p._2),p._1) 

Run with

 scala -howtorun:script a.scala alice.txt 

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).

Clojure 282 strict

 (let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?< !\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[kv]s](p \| v \| k))) 

Somewhat more legibly:

 (let[[[_ m]:as s](->> (slurp *in*) .toLowerCase (re-seq #"\w+\b(?< !\bthe|and|of|to|a|i[tns]?|or)") frequencies (sort-by val >) (take 22)) [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s)) p #(do (print %1) (dotimes[_(* b %2)] (print \_)) (apply println %&))] (p " " m) (doseq[[kv] s] (p \| v \| k))) 

Scala, 368 chars

First, a legible version in 592 characters:

 object Alice { def main(args:Array[String]) { val s = io.Source.fromFile(args(0)) val words = s.getLines.flatMap("(?i)\\w+\\b(?< !\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase) val freqs = words.foldLeft(Map[String, Int]())((countmap, word) => countmap + (word -> (countmap.getOrElse(word, 0)+1))) val sortedFreqs = freqs.toList.sort((a, b) => a._2 > b._2) val top22 = sortedFreqs.take(22) val highestWord = top22.head._1 val highestCount = top22.head._2 val widest = 76 - highestWord.length println(" " + "_" * widest) top22.foreach(t => { val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt println("|" + "_" * width + "| " + t._1) }) } } 

The console output looks like this:

 $ scalac alice.scala $ scala Alice aliceinwonderland.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |____________________________________________| that |____________________________________| as |_________________________________| her |______________________________| at |______________________________| with |_____________________________| s |_____________________________| t |___________________________| on |__________________________| all |_______________________| had |_______________________| but |______________________| be |______________________| not |____________________| they |____________________| so |___________________| very |___________________| what 

We can do some aggressive minifying and get it down to 415 characters:

 object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?< !\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}} 

The console session looks like this:

 $ scalac a.scala $ scala A aliceinwonderland.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |____________________________________________| that |____________________________________| as |_________________________________| her |______________________________| at |______________________________| with |_____________________________| s |_____________________________| t |___________________________| on |__________________________| all |_______________________| had |_______________________| but |______________________| be |______________________| not |____________________| they |____________________| so |___________________| very |___________________| what 

I’m sure a Scala expert could do even better.

Update: In the comments Thomas gave an even shorter version, at 368 characters:

 object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x< -io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}} 

Legibly, at 375 characters:

 object Alice { def main(a:Array[String]) { val t = (Map[String, Int]() /: ( for ( x < - io.Source.fromFile(a(0)).getLines y <- "(?i)\\w+\\b(? c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22) val w = 76 - t.head._1.length print (" "+"_"*w) t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print) } } 

Java – 896 chars

931 chars

1233 chars made unreadable

1977 chars “uncompressed”


Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

I envy C# and LINQ so much.

 import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^az]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List v=new Vector(w.values());Collections.sort(v);List q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(mi)+" ");}}} 

“Readable”:

 import java.util.*; import java.io.*; import static java.util.regex.Pattern.*; class g { public static void main(String[] a)throws Exception { PrintStream o = System.out; Map w = new HashMap(); Scanner s = new Scanner(new File(a[0])) .useDelimiter(compile("[^az]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2)); while(s.hasNext()) { String z = s.next().trim().toLowerCase(); if(z.equals("")) continue; w.put(z,(w.get(z) == null?0:w.get(z))+1); } List v = new Vector(w.values()); Collections.sort(v); List q = new Vector(); int i,m; i = m = v.size()-1; while(q.size()<22) { for(String t:w.keySet()) if(!q.contains(t)&&w.get(t).equals(v.get(i))) q.add(t); i--; } int r = 80-q.get(0).length()-4; String l = String.format("%1$0"+r+"d",0).replace("0","_"); o.println(" "+l); o.println("|"+l+"| "+q.get(0)+" "); for(i = m-1; i > m-22; i--) { o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(mi)+" "); } } } 

Output of Alice:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |____________________________________________| that |____________________________________| as |_________________________________| her |______________________________| with |______________________________| at |___________________________| on |__________________________| all |________________________| this |________________________| for |_______________________| had |_______________________| but |______________________| be |______________________| not |____________________| they |____________________| so |___________________| very |___________________| what 

Output of Don Quixote (also from Gutenberg):

  ________________________________________________________________________ |________________________________________________________________________| that |________________________________________________________| he |______________________________________________| for |__________________________________________| his |________________________________________| as |__________________________________| with |_________________________________| not |_________________________________| was |________________________________| him |______________________________| be |___________________________| don |_________________________| my |_________________________| this |_________________________| all |_________________________| they |________________________| said |_______________________| have |_______________________| me |______________________| on |______________________| so |_____________________| you |_____________________| quixote