Lexical Analyzer

Lexical analyzer converts stream of input characters into a stream of tokens. The different tokens that our lexical analyzer identifies are as follows:

KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf, scanf, case, break, return, typedef, void

IDENTIFIERS: main, fopen, getch etc

NUMBERS: positive and negative integers, positive and negative floating point numbers.

OPERATORS: +, ++, -, –, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.

BRACKETS: [ ], { }, ( ).

STRINGS : Set of characters enclosed within the quotes

COMMENT LINES: Ignores single  line, multi line comments

For tokenizing into identifiers and keywords we incorporate a symbol table which initially consists of predefined keywords. The tokens are read from an input file. If the encountered token is an identifier or a keyword the lexical analyzer will look up in the symbol table to check  the existence  of the respective token. If an entry does exist then we proceed to the next  token. If not then that particular  token along with the token value is written into the symbol table. The rest of the tokens are directly displayed by writing into an output file.

The output file will consist of all the tokens present in our input file along with their respective token values.

INTRODUCTION:

Lexical analysis involves scanning the program to be compiled and recognizing the tokens that make up the source statements Scanners or lexical analyzers are usually designed to recognize keywords , operators , and identifiers , as well as integers, floating point numbers , character strings , and other  similar items that are written as part of the source program . The exact set of  tokens to be recognized of course, depends upon the programming language being used to describe it.

A sequence of input characters that comprises a single token is called a lexeme. A lexical analyzer can insulate a parser from the lexeme representation of tokens. Following are the list of functions that lexical analyzers perform.

Removal of white space and  comment :

Many languages allow “white space” to appear between tokens. Comments can likewise be ignored by the parser and translator , so they may also be treated as white space. If white space is eliminated by the lexical analyzer, the parser will never have to consider it.

Constants :

An integer constant is a sequence of digits, integer constants can be allowed by adding productions to the grammar for expressions, or by creating a token for such constants . The job of collecting digits into integers is generally given to a lexical analyzer because numbers can be treated as single units during translation. The lexical analyzer passes both the token and attribute to the parser.

Recognizing identifiers and keywords :

Languages use identifiers as names of variables, arrays and functions. A grammar for a language often treats an identifier as a token. Many languages use fixed character strings such as begin, end , if, and so on, as punctuation marks or to identify certain constructs. These character strings, called keywords, generally satisfy the rules for forming identifiers.

SYSTEM ANALYSIS:

The operating  system used is MS-DOS.

MS-DOS : The MS-DOS is a single user, single process, single processor operating system. Due to the confinement of the device dependent code into one layer, the porting of MS-DOS has theoretically been reduced to writing  of the BIOS code for the new hardware. At the command level it provides a hierarchical file system, input output redirection, pipes and filters. User written commands can be invoked in the same way as the standard system commands, giving the appearance as if the basic system functionality has been extended.

Being a single user system, it provides only file protection and access control. The disk space is allocated in terms of a cluster of consecutive sectors. The command language of MS-DOS has been designed to allow the user to interact directly with the operating system using a CRT terminal. The principal use of the command language is to initiate the execution of commands and programs. A user program can be executed under MS-DOS by simply typing the name of the executable file of the user program at the DOS command prompt.

The programming language used here is C programming

SYSTEM DESIGN:

Process:

The lexical analyzer is the first phase of  a compiler. Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis. This interaction, summarized schematically in fig. a.

fig a. Process of lexical analyzer
fig a. Process of lexical analyzer

Upon receiving a “get next token “command from the parser, the lexical analyzer reads the input characters until it can identify next token.

Sometimes , lexical analyzers are divided into a cascade of two phases, the first called “scanning”, and the second “lexical analysis”.

The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more complex operations.

The lexical analyzer which we have designed takes the input  from a input file. It reads one character at a time from the input file, and continues to read until end of the file is reached. It recognizes the valid identifiers, keywords and specifies the token values of the keywords.

It also identifies the header files, #define statements, numbers, special characters, various relational and logical operators, ignores the white spaces and comments. It prints the output in a separate file specifying the line number .

BLOCK DIAGRAM:

fig b. Block diagram of the project
fig b. Block diagram of the project
Editorial Team
Editorial Team

We are a group of young techies trying to provide the best study material for all Electronic and Computer science students. We are publishing Microcontroller projects, Basic Electronics, Digital Electronics, Computer projects and also c/c++, java programs.

27 thoughts on “Lexical Analyzer

  1. Hey can plz mail me the lex code to identify different tokens from a c code..
    thanx in advance..

  2. \\ Token Separation \\

    Write a program to identify and generate the tokens present in the given input
    #include
    #include
    #include
    #include

    int key = 0;
    char expr[100];
    char cont[][20]={“CONTROLS”,”for”,”do”,”while”,”NULL”,};
    char cond[][20]={“CONDITION”,”if”,”then”,”NULL”};
    char oprt[][20]={“OPERATOR”,”+”,”-“,”*”,”/”,”%”,”<","”,”>=”,”=”,”(“,”)”,”NULL”};
    char branch[][20]={“BRANCHING”,”goto”,”jump” ,”NULL”};
    void checking(char[],char[][20]);
    void main()
    {
    int i,j,l,k,m,n;
    char sbexpr[50],txt[3];
    clrscr();
    cout<=97 && sbexpr[m]=65 && sbexpr[m]<=90)))
    {
    cout<<"\n"<<sbexpr[m]<“<<"Identifier\n";
    key = 1;
    }
    }
    }
    if(key == 0)
    {
    cout<<"\n"<<sbexpr<“<<"Address\n";
    key = 1;
    }
    }

    getch();
    }

    void checking (char expr[],char check[][20])
    {
    for(int i=1;strcmp(check[i],"NULL")!=0;i++)
    {
    if(strcmp(expr,check[i])==0)
    {
    cout<<expr<“<<check[0]<<"\n";
    key = 1;
    }
    }
    }

  3. #include
    #include
    #include
    #include

    void Open_File();
    void Demage_Lexeme();
    int Search(char[256],int);
    void analyze();
    void Skip_Comment();
    void Read_String();
    void Is_Keyword_Or_Not();
    void Is_Identifier_Or_Not();
    void Is_Operator_Or_Not();
    void Read_Number();
    void Is_Special_Or_Not();
    void Is_Comparison_Or_Not();
    void Add_To_Lexical (char[256],int,char[256]);
    void Print_ST();
    void Print_TOKEN();
    void Token_Attribute();
    truct lexical
    {
    char data[256]; //Value of token.
    int line[256]; //Line # which token appear in input
    file.
    int times; //# of times that token appear in input
    file.
    char type[256]; //Type of each token.
    struct lexical *next;
    };

    typedef struct lexical Lex;
    typedef Lex *lex;

    /****************************************************************
    File pointer for accessing the file.
    *****************************************************************/

    FILE *fp;
    FILE *st;
    FILE *token;
    char lexeme[256],ch;
    int f,flag,line=1,i=1;
    lex head=NULL,tail=NULL;

    /****************************************************************
    Array holding all keywords for checking.
    *****************************************************************/

    char
    *keywords[]={“procedure”,”is”,”begin”,”end”,”var”,”cin”,”cout”,”if”,
    “then”,”else”,”and”,”or”,”not”,”loop”,”exit”,”when”,
    “while”,”until”};

    /****************************************************************
    Array holding all arithmetic operations for checking.
    *****************************************************************/

    char arithmetic_operator[]={‘+’,’-‘,’*’,’/’};

    /****************************************************************
    Array holding all comparison operations for checking.
    *****************************************************************/

    char *comparison_operator[]={“”,”=”,”<=","”,”>=”};

    /****************************************************************
    Array holding all special for checking.
    *****************************************************************/

    char special[]={‘%’,’!’,’@’,’~’,’$’};

    /****************************************************************

    **************
    *MAIN PROGRAM*
    **************

    *****************************************************************/

    void main()
    {
    Open_File();
    analyze();
    fclose(fp);
    Print_ST();
    Print_TOKEN();
    }

    /****************************************************************
    This function open input sourse file.
    *****************************************************************/

    void Open_File()
    {

    fp=fopen(“source.txt”,”r”); //provide path for source.txt here
    if(fp==NULL)
    {
    printf(“!!!Can’t open input file – source.txt!!!”);
    getch();
    exit(0);
    }
    }

    /****************************************************************
    Function to add item to structure of array to store data and
    information of lexical items.
    *****************************************************************/

    void Add_To_Lexical (char value[256],int line,char type[256])
    {
    lex new_lex;

    if (!Search(value,line)) //When return 1 the token not found.
    {

    new_lex=malloc(sizeof(Lex));

    if (new_lex!=NULL)
    {
    strcpy(new_lex->data,value);
    new_lex->line[0]=line;
    new_lex->times=1;
    strcpy(new_lex->type,type);
    new_lex->next=NULL;

    if (head==NULL)
    head=new_lex;
    else
    tail->next=new_lex;

    tail=new_lex;
    }
    }
    }

    /****************************************************************
    Function to search token.
    *****************************************************************/

    int Search (char value[256],int line)
    {
    lex x=head;
    int flag=0;

    while (x->next!=NULL && !flag)
    {
    if (strcmp(x->data,value)==0)
    {
    x->line[x->times]=line;
    x->times++;
    flag=1;
    }
    x=x->next;
    }
    return flag;
    }

    /****************************************************************
    Function to print the ST.TXT .
    *****************************************************************/

    void Print_ST()
    {
    lex x=head;
    int j;

    if ((st=fopen(“ST.TXT”,”w”))==NULL)
    printf(“The file ST.TXT cat not open.
    “);

    else

    {
    fprintf(st,” %s %s %s
    “,”Line#”,”Lexeme”,”Type”);
    fprintf(st,” —- —— —-
    “);

    while (x!=NULL)
    {
    if ((strcmp(x->type,”num”)==0) ||
    (strcmp(x->type,”keyword”)==0) ||
    (strcmp(x->type,”identifier”)==0))
    {
    fprintf(st,” “);

    for (j=0;jtimes;j++)
    {
    fprintf(st,”%d”,x->line[j]);
    if (j!=x->times-1) //This condition to prevent the comma
    fprintf(st,”,”,x->line[j]); //”,” to not print after last line #.
    }

    fprintf(st,” %-6s %-6s
    “,x->data,x->type);
    }
    x=x->next;
    }

    fclose(st);
    }
    }

    /****************************************************************
    Function to print the TOKENS.TXT .
    *****************************************************************/

    void Print_TOKEN()
    {
    int flag=0;

    fp=fopen(“source.txt”,”r”);

    if(fp==NULL)
    {
    printf(“!!!Can’t open input file – source.txt!!!”);
    getch();
    exit(0);
    }

    else

    {
    if ((token=fopen(“TOKENS.TXT”,”w”))==NULL)
    printf(“The file ST.TXT cat not open.
    “);

    else

    {
    ch=fgetc(fp);

    while (!(feof(fp)))
    {

    if (ch==’ ‘ && !flag)
    {
    do
    ch=fgetc(fp);
    while (ch==’ ‘);

    fseek(fp,-2,1);
    ch=fgetc(fp);
    flag=1;
    }

    if (ch!=’
    ‘ && ch!=’ ‘)
    fprintf(token,”%c”,ch);

    if (ch==’
    ‘)
    {
    fprintf(token,”
    “);
    Token_Attribute();
    i++;
    flag=0;
    }

    ch=fgetc(fp);
    }
    }
    }
    fclose(fp);
    fclose(token);
    }

    /****************************************************************
    Function to put the token and atrribute in TOKENS.TXT .
    *****************************************************************/

    void Token_Attribute()
    {
    lex x=head;
    int j;

    while (x!=NULL)
    {
    if (x->line[0]==i)
    {
    fprintf(token,”token : %-4s “,x->type);

    if ((strcmp(x->type,”num”)==0) ||
    (strcmp(x->type,”keyword”)==0) ||
    (strcmp(x->type,”identifier”)==0))

    {
    fprintf(token,”attribute : line#=%-4d
    “,i);
    }

    else

    {
    fprintf(token,”attribute : %-4s
    “,x->data);
    }

    }
    x=x->next;
    }
    fprintf(token,”
    “);
    }

    /****************************************************************
    Function to create lexical analysis.
    *****************************************************************/

    void analyze()
    {

    ch=fgetc(fp); //Read character.

    while(!feof(fp)) //While the file is not end.
    {

    if(ch==’
    ‘) //Compute # of lines in source.txt
    .
    {
    line++;
    ch=fgetc(fp);
    }

    if(isspace(ch) && ch==’
    ‘ )
    {
    line++;
    ch=fgetc(fp);
    }
    if(isspace(ch) && ch!=’
    ‘ ) //The character is space.
    ch=fgetc(fp);

    if(ch==’/’ || ch=='”‘) //Function for skipping comments in the
    file
    Skip_Comment(); //and ‘”‘ with display statements.

    if(isalpha(ch)) //The character is leter.
    {
    Read_String();
    Is_Keyword_Or_Not();
    Is_Operator_Or_Not();
    Is_Identifier_Or_Not();
    }

    if(isdigit(ch)) //The character is digit.
    Read_Number();

    if (ch==’;’) //The character is semicolon.
    Add_To_Lexical(“;”,line,”semicolon”);

    if (ch==’:’) //The character is colon.
    Add_To_Lexical(“:”,line,”colon”);

    if (ch==’,’) //The character is comma.
    Add_To_Lexical(“,”,line,”comma”);

    if (ch=='(‘) //The character is parenthesis.
    Add_To_Lexical(“(“,line,”parenthesis”);

    if (ch==’)’) //The character is parenthesis.
    Add_To_Lexical(“)”,line,”parenthesis”);

    //The character is comparison_operator
    if (ch==”)
    Is_Comparison_Or_Not();

    Is_Special_Or_Not(); //After failed scaning in before cases
    //check the character is special or not.
    Demage_Lexeme();

    if(isspace(ch) && ch==’
    ‘ )
    {
    line++;
    ch=fgetc(fp);
    }
    else
    ch=fgetc(fp);
    }
    }

    /****************************************************************
    This function read all character of strings.
    *****************************************************************/

    void Read_String()
    {
    int j=0;

    do
    {
    lexeme[j++]=ch;
    ch=fgetc(fp);
    } while(isalpha(ch));

    fseek(fp,-1,1);
    lexeme[j]=’

  4. Can you tell how many tokes the following statments consists of:-

    printf(“Test of %d tokens %d”,a,b);

    printf(“TEST”);

    ??

    i am bit confused weather %d will be treated as separate tokens or not?

    And for the literal weather the quotations (“) will be counted as tokens or not?

  5. plz send me the source code in lex to identify different parts of speech by making use of symbol table.

  6. 2.Write a program in LEX to count the no of:
    (i) positive and negative integers
    (ii) positive and negative fractions.
    For C and C++ source programs

  7. would u give an answer for this

    1. Write a program in LEX to count the no of:
    (i) positive and negative integers
    (ii) positive and negative fractions.
    For C and C++ source programs
    1. Write a LEX program to recognize a valid C and C++ programs if u can include if condition, loop.

  8. Kindly send me a program that will accept expressions from the user and define the keywords, identifiers, number, relational operator and etc.

    Sample output:
    Enter expression:
    scanf(“%s”, exp);

    scanf is a keyword
    ( is a punctuation
    ” is a punctuation
    %s
    ” is a punctuation
    , is a punctuation
    exp is an identifier
    ) is a punctuation
    ; is a punctuation

    mail me on joeann_espanol@yahoo.com

    thank’z

  9. kylangan ko din po ng program n lexical analyzer.. the same program sa taas..

    sample output:
    m=1* (a+v % 3);
    m is a variable
    = is an assignment operator
    1 is a number
    ( is a punctuation
    a is a variable
    + is an arithmetic operator
    v is a variable
    % is an arithmetic operator
    3 is a number
    ) is a punctuation
    ; is a separator

Leave a Reply

Your email address will not be published. Required fields are marked *

Get the latest updates on your inbox

Be the first to receive the latest updates from Codesdoc by signing up to our email subscription.

    StudentProjects.in